All of lore.kernel.org
 help / color / mirror / Atom feed
* WriteBack Throttle kill the performace of the disk
@ 2014-10-13 10:18 Nicheal
  2014-10-13 13:29 ` Mark Nelson
  0 siblings, 1 reply; 10+ messages in thread
From: Nicheal @ 2014-10-13 10:18 UTC (permalink / raw)
  To: ceph-devel

Hi,

I'm currently finding that enable WritebackThrottle lead to lower IOPS
for large number of small io. Since WritebackThrottle calls
fdatasync(fd) to flush an object content to disk, large number of
ramdom small io always cause the WritebackThrottle to submit one or
two 4k io every time.
Thus, it is much slower than the global sync in
FileStore::sync_entry().  Note:: here, I use xfs as the FileStore
underlying filesystem. So I would know that if any impact when I
disable Writeback throttles. I cannot catch the idea on the website
(http://ceph.com/docs/master/dev/osd_internals/wbthrottle/).
Large number of inode will cause longer time to sync, but submitting a
batch of write to disk always faster than submitting few io update to
the disk.

Nicheal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-13 10:18 WriteBack Throttle kill the performace of the disk Nicheal
@ 2014-10-13 13:29 ` Mark Nelson
  2014-10-13 19:50   ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Nelson @ 2014-10-13 13:29 UTC (permalink / raw)
  To: Nicheal, ceph-devel

On 10/13/2014 05:18 AM, Nicheal wrote:
> Hi,
>
> I'm currently finding that enable WritebackThrottle lead to lower IOPS
> for large number of small io. Since WritebackThrottle calls
> fdatasync(fd) to flush an object content to disk, large number of
> ramdom small io always cause the WritebackThrottle to submit one or
> two 4k io every time.
> Thus, it is much slower than the global sync in
> FileStore::sync_entry().  Note:: here, I use xfs as the FileStore
> underlying filesystem. So I would know that if any impact when I
> disable Writeback throttles. I cannot catch the idea on the website
> (http://ceph.com/docs/master/dev/osd_internals/wbthrottle/).
> Large number of inode will cause longer time to sync, but submitting a
> batch of write to disk always faster than submitting few io update to
> the disk.

Hi Nichael,

When the wbthrottle code was introduced back around dumpling we had to 
increase the sync intervals quite a bit to get it performing similarly 
to cuttlefish.  Have you tried playing with the various wbthrottle xfs 
tuneables to see if you can improve the behaviour?

OPTION(filestore_wbthrottle_enable, OPT_BOOL, true)
OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 41943040)
OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 419430400)
OPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 500)
OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 5000)
OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 500)

Mark

>
> Nicheal
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-13 13:29 ` Mark Nelson
@ 2014-10-13 19:50   ` Gregory Farnum
  2014-10-14  5:15     ` Nicheal
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2014-10-13 19:50 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Nicheal, ceph-devel

On Mon, Oct 13, 2014 at 6:29 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
> On 10/13/2014 05:18 AM, Nicheal wrote:
>>
>> Hi,
>>
>> I'm currently finding that enable WritebackThrottle lead to lower IOPS
>> for large number of small io. Since WritebackThrottle calls
>> fdatasync(fd) to flush an object content to disk, large number of
>> ramdom small io always cause the WritebackThrottle to submit one or
>> two 4k io every time.
>> Thus, it is much slower than the global sync in
>> FileStore::sync_entry().  Note:: here, I use xfs as the FileStore
>> underlying filesystem. So I would know that if any impact when I
>> disable Writeback throttles. I cannot catch the idea on the website
>> (http://ceph.com/docs/master/dev/osd_internals/wbthrottle/).
>> Large number of inode will cause longer time to sync, but submitting a
>> batch of write to disk always faster than submitting few io update to
>> the disk.
>
>
> Hi Nichael,
>
> When the wbthrottle code was introduced back around dumpling we had to
> increase the sync intervals quite a bit to get it performing similarly to
> cuttlefish.  Have you tried playing with the various wbthrottle xfs
> tuneables to see if you can improve the behaviour?
>
> OPTION(filestore_wbthrottle_enable, OPT_BOOL, true)
> OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 41943040)
> OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 419430400)
> OPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 500)
> OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 5000)
> OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 500)

In particular, these are semi-tuned for a standard spinning hard
drive. If you have an SSD as your backing store, you'll want to put
them all way up.
Alternatively, if you have a very large journal, you will see the
flusher as slowing down shorter benchmarks, because it's trying to
keep the journal from getting too far ahead of the backing store. But
this is deliberate; it's making you pay a closer approximation to the
true cost up front instead of letting you overload the system and then
have all your writes get very slow as syncfs calls start taking tens
of seconds.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-13 19:50   ` Gregory Farnum
@ 2014-10-14  5:15     ` Nicheal
  2014-10-14 12:19       ` Mark Nelson
  0 siblings, 1 reply; 10+ messages in thread
From: Nicheal @ 2014-10-14  5:15 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Mark Nelson, ceph-devel

Yes, Greg.
But Unix based system always have a parameter dirty_ratio to prevent
the system memory from being exhausted. If Journal speed is so fast
while backing store cannot catch up with Journal, then the backing
store write will be blocked by the hard limitation of system dirty
pages. The problem here may be that system call, sync(), cannot return
since the system always has lots of dirty pages. Consequently, 1)
FileStore::sync_entry() will be timeout and then ceph_osd_daemon
abort.  2) Even if the thread is not timed out, Since the Journal
committed point cannot be updated so that the Journal will be blocked,
waiting for the sync() return and update Journal committed point.
So the Throttle is added to solve the above problems, right?
However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
will cause problem (SSD as journal, and HDD as data disk, fio 4k
ramdom write iodepth 64):
    WritebackThrottle enable: Based on blktrace, we trace the back-end
hdd io behaviour. Because of frequently calling fdatasync() in
Writeback Throttle, it cause every back-end hdd spent more time to
finish one io. This causes the total sync time longer. For example,
default sync_max_interval is 5 seconds, total dirty data in 5 seconds
is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
disk within 4 second, So cat /proc/meminfo, the dirty data of my
system is always clean(near zero). However, If I enable
WritebackThrottle, fdatasync() slows down the sync process. Thus, it
seems 8-9M random io will be sync to the disk within 5s. Thus the
dirty data is always growing to the critical point (system
up-limitation), and then sync_entry() is always timed out. So I means,
in my case, disabling WritebackThrottle, I may always have 600 IOPS.
If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
cause back-end HDD disk overloaded.
   So I would like that we can dynamically throttle the IOPS in
FileStore. We cannot know the average sync() speed of the back-end
Store since different disk own different IO performance. However, we
can trace the average write speed in FileStore and Journal, Also, we
can know, whether start_sync() is return and finished. Thus, If this
time, Journal is writing so fast that the back-end cannot catch up the
Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
800IOPS/s) in next operation interval(the interval maybe 1 to 5
seconds, in the third second, Thottle become 1000*e^-x where x is the
tick interval, ), if in this interval, Journal write reach the
limitation, the following submitting write should waiting in OSD
waiting queue.So in this way, Journal may provide a boosting IO, but
finally, back-end sync() will return and catch up with Journal become
we always slow down the Journal speed after several seconds.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-14  5:15     ` Nicheal
@ 2014-10-14 12:19       ` Mark Nelson
  2014-10-14 12:42         ` Wido den Hollander
                           ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Mark Nelson @ 2014-10-14 12:19 UTC (permalink / raw)
  To: Nicheal, Gregory Farnum; +Cc: Mark Nelson, ceph-devel

On 10/14/2014 12:15 AM, Nicheal wrote:
> Yes, Greg.
> But Unix based system always have a parameter dirty_ratio to prevent
> the system memory from being exhausted. If Journal speed is so fast
> while backing store cannot catch up with Journal, then the backing
> store write will be blocked by the hard limitation of system dirty
> pages. The problem here may be that system call, sync(), cannot return
> since the system always has lots of dirty pages. Consequently, 1)
> FileStore::sync_entry() will be timeout and then ceph_osd_daemon
> abort.  2) Even if the thread is not timed out, Since the Journal
> committed point cannot be updated so that the Journal will be blocked,
> waiting for the sync() return and update Journal committed point.
> So the Throttle is added to solve the above problems, right?

Greg or Sam can correct me if I'm wrong, but I always thought of the 
wbthrottle code as being more an attempt to smooth out spikes in write 
throughput to prevent the journal from getting too far ahead of the 
backing store.  IE have more frequent, shorter flush periods rather than 
less frequent longer ones.  For Ceph that is's probably a reasonable 
idea since you want all of the OSDs behaving as consistently as possible 
to prevent hitting the max outstanding client IOs/Bytes on the client 
and starving other ready OSDs.  I'm not sure it's worked out in practice 
as well as it might have in theory, though I'm not sure we've really 
investigated what's going on enough to be sure.

> However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
> will cause problem (SSD as journal, and HDD as data disk, fio 4k
> ramdom write iodepth 64):
>      WritebackThrottle enable: Based on blktrace, we trace the back-end
> hdd io behaviour. Because of frequently calling fdatasync() in
> Writeback Throttle, it cause every back-end hdd spent more time to
> finish one io. This causes the total sync time longer. For example,
> default sync_max_interval is 5 seconds, total dirty data in 5 seconds
> is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
> disk within 4 second, So cat /proc/meminfo, the dirty data of my
> system is always clean(near zero). However, If I enable
> WritebackThrottle, fdatasync() slows down the sync process. Thus, it
> seems 8-9M random io will be sync to the disk within 5s. Thus the
> dirty data is always growing to the critical point (system
> up-limitation), and then sync_entry() is always timed out. So I means,
> in my case, disabling WritebackThrottle, I may always have 600 IOPS.
> If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
> cause back-end HDD disk overloaded.

We never did a blktrace investigation, but we did see pretty bad 
performance with the default wbthrottle code when it was first 
implemented.  We ended up raising the throttles pretty considerably in 
dumpling RC2.  It would be interesting to repeat this test on an Intel 
system.

>     So I would like that we can dynamically throttle the IOPS in
> FileStore. We cannot know the average sync() speed of the back-end
> Store since different disk own different IO performance. However, we
> can trace the average write speed in FileStore and Journal, Also, we
> can know, whether start_sync() is return and finished. Thus, If this
> time, Journal is writing so fast that the back-end cannot catch up the
> Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
> 800IOPS/s) in next operation interval(the interval maybe 1 to 5
> seconds, in the third second, Thottle become 1000*e^-x where x is the
> tick interval, ), if in this interval, Journal write reach the
> limitation, the following submitting write should waiting in OSD
> waiting queue.So in this way, Journal may provide a boosting IO, but
> finally, back-end sync() will return and catch up with Journal become
> we always slow down the Journal speed after several seconds.
>

I will wait for Sam's input, but it seems reasonable to me.  Perhaps you 
might write it up as a blueprint for CDS?

Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-14 12:19       ` Mark Nelson
@ 2014-10-14 12:42         ` Wido den Hollander
  2014-10-15  3:10           ` Nicheal
  2014-10-14 13:22         ` Sage Weil
  2014-10-15  5:55         ` Nicheal
  2 siblings, 1 reply; 10+ messages in thread
From: Wido den Hollander @ 2014-10-14 12:42 UTC (permalink / raw)
  To: Mark Nelson, Nicheal, Gregory Farnum; +Cc: ceph-devel

On 10/14/2014 02:19 PM, Mark Nelson wrote:
> On 10/14/2014 12:15 AM, Nicheal wrote:
>> Yes, Greg.
>> But Unix based system always have a parameter dirty_ratio to prevent
>> the system memory from being exhausted. If Journal speed is so fast
>> while backing store cannot catch up with Journal, then the backing
>> store write will be blocked by the hard limitation of system dirty
>> pages. The problem here may be that system call, sync(), cannot return
>> since the system always has lots of dirty pages. Consequently, 1)
>> FileStore::sync_entry() will be timeout and then ceph_osd_daemon
>> abort.  2) Even if the thread is not timed out, Since the Journal
>> committed point cannot be updated so that the Journal will be blocked,
>> waiting for the sync() return and update Journal committed point.
>> So the Throttle is added to solve the above problems, right?
> 
> Greg or Sam can correct me if I'm wrong, but I always thought of the
> wbthrottle code as being more an attempt to smooth out spikes in write
> throughput to prevent the journal from getting too far ahead of the
> backing store.  IE have more frequent, shorter flush periods rather than
> less frequent longer ones.  For Ceph that is's probably a reasonable
> idea since you want all of the OSDs behaving as consistently as possible
> to prevent hitting the max outstanding client IOs/Bytes on the client
> and starving other ready OSDs.  I'm not sure it's worked out in practice
> as well as it might have in theory, though I'm not sure we've really
> investigated what's going on enough to be sure.
> 

I thought that as well. So in the case of a SSD-based OSD where the
journal is on a partition #1 and the data on #2 you would disable
wbthrottle, correct?

Since the journal is just as fast as the data partition.

>> However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
>> will cause problem (SSD as journal, and HDD as data disk, fio 4k
>> ramdom write iodepth 64):
>>      WritebackThrottle enable: Based on blktrace, we trace the back-end
>> hdd io behaviour. Because of frequently calling fdatasync() in
>> Writeback Throttle, it cause every back-end hdd spent more time to
>> finish one io. This causes the total sync time longer. For example,
>> default sync_max_interval is 5 seconds, total dirty data in 5 seconds
>> is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
>> disk within 4 second, So cat /proc/meminfo, the dirty data of my
>> system is always clean(near zero). However, If I enable
>> WritebackThrottle, fdatasync() slows down the sync process. Thus, it
>> seems 8-9M random io will be sync to the disk within 5s. Thus the
>> dirty data is always growing to the critical point (system
>> up-limitation), and then sync_entry() is always timed out. So I means,
>> in my case, disabling WritebackThrottle, I may always have 600 IOPS.
>> If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
>> cause back-end HDD disk overloaded.
> 
> We never did a blktrace investigation, but we did see pretty bad
> performance with the default wbthrottle code when it was first
> implemented.  We ended up raising the throttles pretty considerably in
> dumpling RC2.  It would be interesting to repeat this test on an Intel
> system.
> 
>>     So I would like that we can dynamically throttle the IOPS in
>> FileStore. We cannot know the average sync() speed of the back-end
>> Store since different disk own different IO performance. However, we
>> can trace the average write speed in FileStore and Journal, Also, we
>> can know, whether start_sync() is return and finished. Thus, If this
>> time, Journal is writing so fast that the back-end cannot catch up the
>> Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
>> 800IOPS/s) in next operation interval(the interval maybe 1 to 5
>> seconds, in the third second, Thottle become 1000*e^-x where x is the
>> tick interval, ), if in this interval, Journal write reach the
>> limitation, the following submitting write should waiting in OSD
>> waiting queue.So in this way, Journal may provide a boosting IO, but
>> finally, back-end sync() will return and catch up with Journal become
>> we always slow down the Journal speed after several seconds.
>>
> 
> I will wait for Sam's input, but it seems reasonable to me.  Perhaps you
> might write it up as a blueprint for CDS?
> 
> Mark
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-14 12:19       ` Mark Nelson
  2014-10-14 12:42         ` Wido den Hollander
@ 2014-10-14 13:22         ` Sage Weil
  2014-10-15  2:20           ` Nicheal
  2014-10-15  5:55         ` Nicheal
  2 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2014-10-14 13:22 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Nicheal, Gregory Farnum, ceph-devel

On Tue, 14 Oct 2014, Mark Nelson wrote:
> On 10/14/2014 12:15 AM, Nicheal wrote:
> > Yes, Greg.
> > But Unix based system always have a parameter dirty_ratio to prevent
> > the system memory from being exhausted. If Journal speed is so fast
> > while backing store cannot catch up with Journal, then the backing
> > store write will be blocked by the hard limitation of system dirty
> > pages. The problem here may be that system call, sync(), cannot return
> > since the system always has lots of dirty pages. Consequently, 1)
> > FileStore::sync_entry() will be timeout and then ceph_osd_daemon
> > abort.  2) Even if the thread is not timed out, Since the Journal
> > committed point cannot be updated so that the Journal will be blocked,
> > waiting for the sync() return and update Journal committed point.
> > So the Throttle is added to solve the above problems, right?
> 
> Greg or Sam can correct me if I'm wrong, but I always thought of the
> wbthrottle code as being more an attempt to smooth out spikes in write
> throughput to prevent the journal from getting too far ahead of the backing
> store.  IE have more frequent, shorter flush periods rather than less frequent
> longer ones.  For Ceph that is's probably a reasonable idea since you want all
> of the OSDs behaving as consistently as possible to prevent hitting the max
> outstanding client IOs/Bytes on the client and starving other ready OSDs.  I'm
> not sure it's worked out in practice as well as it might have in theory,
> though I'm not sure we've really investigated what's going on enough to be
> sure.

Right.  The fdatasync strategy means that the overall throughput is lower, 
but the latencies are much more consistent.  Without the throttling we had 
huge spikes, which is even more problematic.

> > However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
> > will cause problem (SSD as journal, and HDD as data disk, fio 4k
> > ramdom write iodepth 64):
> >      WritebackThrottle enable: Based on blktrace, we trace the back-end
> > hdd io behaviour. Because of frequently calling fdatasync() in
> > Writeback Throttle, it cause every back-end hdd spent more time to
> > finish one io. This causes the total sync time longer. For example,
> > default sync_max_interval is 5 seconds, total dirty data in 5 seconds
> > is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
> > disk within 4 second, So cat /proc/meminfo, the dirty data of my
> > system is always clean(near zero). However, If I enable
> > WritebackThrottle, fdatasync() slows down the sync process. Thus, it
> > seems 8-9M random io will be sync to the disk within 5s. Thus the
> > dirty data is always growing to the critical point (system
> > up-limitation), and then sync_entry() is always timed out. So I means,
> > in my case, disabling WritebackThrottle, I may always have 600 IOPS.
> > If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
> > cause back-end HDD disk overloaded.

It is true.  One could probably disable wbthrottle and carefully tune the 
kernel dirty_ratio and dirty_bytes.  As I recall the problem though was 
that it was inode writeback that was expensive, and there were not good 
kernel knobs for limiting the dirty items in that cache. I would be very 
interested in hearing about successes in this area.

Another promising direction is the batched fsync experiment that Dave 
Chinner did a few months back.  I'm not what the status is in 
getting that into mainline, though, so it's not helpful anytime soon.

> >     So I would like that we can dynamically throttle the IOPS in
> > FileStore. We cannot know the average sync() speed of the back-end
> > Store since different disk own different IO performance. However, we
> > can trace the average write speed in FileStore and Journal, Also, we
> > can know, whether start_sync() is return and finished. Thus, If this
> > time, Journal is writing so fast that the back-end cannot catch up the
> > Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
> > 800IOPS/s) in next operation interval(the interval maybe 1 to 5
> > seconds, in the third second, Thottle become 1000*e^-x where x is the
> > tick interval, ), if in this interval, Journal write reach the
> > limitation, the following submitting write should waiting in OSD
> > waiting queue.So in this way, Journal may provide a boosting IO, but
> > finally, back-end sync() will return and catch up with Journal become
> > we always slow down the Journal speed after several seconds.

Autotuning these parameters based on observed performance definitely 
sounds promising!

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-14 13:22         ` Sage Weil
@ 2014-10-15  2:20           ` Nicheal
  0 siblings, 0 replies; 10+ messages in thread
From: Nicheal @ 2014-10-15  2:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, Gregory Farnum, ceph-devel

2014-10-14 21:22 GMT+08:00 Sage Weil <sage@newdream.net>:
> On Tue, 14 Oct 2014, Mark Nelson wrote:
>> On 10/14/2014 12:15 AM, Nicheal wrote:
>> > Yes, Greg.
>> > But Unix based system always have a parameter dirty_ratio to prevent
>> > the system memory from being exhausted. If Journal speed is so fast
>> > while backing store cannot catch up with Journal, then the backing
>> > store write will be blocked by the hard limitation of system dirty
>> > pages. The problem here may be that system call, sync(), cannot return
>> > since the system always has lots of dirty pages. Consequently, 1)
>> > FileStore::sync_entry() will be timeout and then ceph_osd_daemon
>> > abort.  2) Even if the thread is not timed out, Since the Journal
>> > committed point cannot be updated so that the Journal will be blocked,
>> > waiting for the sync() return and update Journal committed point.
>> > So the Throttle is added to solve the above problems, right?
>>
>> Greg or Sam can correct me if I'm wrong, but I always thought of the
>> wbthrottle code as being more an attempt to smooth out spikes in write
>> throughput to prevent the journal from getting too far ahead of the backing
>> store.  IE have more frequent, shorter flush periods rather than less frequent
>> longer ones.  For Ceph that is's probably a reasonable idea since you want all
>> of the OSDs behaving as consistently as possible to prevent hitting the max
>> outstanding client IOs/Bytes on the client and starving other ready OSDs.  I'm
>> not sure it's worked out in practice as well as it might have in theory,
>> though I'm not sure we've really investigated what's going on enough to be
>> sure.
>
> Right.  The fdatasync strategy means that the overall throughput is lower,
> but the latencies are much more consistent.  Without the throttling we had
> huge spikes, which is even more problematic.
>
>> > However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
>> > will cause problem (SSD as journal, and HDD as data disk, fio 4k
>> > ramdom write iodepth 64):
>> >      WritebackThrottle enable: Based on blktrace, we trace the back-end
>> > hdd io behaviour. Because of frequently calling fdatasync() in
>> > Writeback Throttle, it cause every back-end hdd spent more time to
>> > finish one io. This causes the total sync time longer. For example,
>> > default sync_max_interval is 5 seconds, total dirty data in 5 seconds
>> > is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
>> > disk within 4 second, So cat /proc/meminfo, the dirty data of my
>> > system is always clean(near zero). However, If I enable
>> > WritebackThrottle, fdatasync() slows down the sync process. Thus, it
>> > seems 8-9M random io will be sync to the disk within 5s. Thus the
>> > dirty data is always growing to the critical point (system
>> > up-limitation), and then sync_entry() is always timed out. So I means,
>> > in my case, disabling WritebackThrottle, I may always have 600 IOPS.
>> > If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
>> > cause back-end HDD disk overloaded.
>
> It is true.  One could probably disable wbthrottle and carefully tune the
> kernel dirty_ratio and dirty_bytes.  As I recall the problem though was
> that it was inode writeback that was expensive, and there were not good
> kernel knobs for limiting the dirty items in that cache. I would be very
> interested in hearing about successes in this area.
>
Yes, I also find that inode-writeback is definitely expensive. I try
to find out the reason and improve it but failed. Since it is quite
complex and different file-systems have different implementations to
manage its inode in VFS layer. Furthermore, the filesystem itself
maintains  journal for inodes to accelerate inode writeback and keep
inode writeback atomic. I am still find more useful materials in how
the filesystem (XFS, EXT4) works in its VFS and their strategies
maintain and writeback inodes based on their source code. If any one
can suggests more relative literature?

> Another promising direction is the batched fsync experiment that Dave
> Chinner did a few months back.  I'm not what the status is in
> getting that into mainline, though, so it's not helpful anytime soon.
>
>> >     So I would like that we can dynamically throttle the IOPS in
>> > FileStore. We cannot know the average sync() speed of the back-end
>> > Store since different disk own different IO performance. However, we
>> > can trace the average write speed in FileStore and Journal, Also, we
>> > can know, whether start_sync() is return and finished. Thus, If this
>> > time, Journal is writing so fast that the back-end cannot catch up the
>> > Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
>> > 800IOPS/s) in next operation interval(the interval maybe 1 to 5
>> > seconds, in the third second, Thottle become 1000*e^-x where x is the
>> > tick interval, ), if in this interval, Journal write reach the
>> > limitation, the following submitting write should waiting in OSD
>> > waiting queue.So in this way, Journal may provide a boosting IO, but
>> > finally, back-end sync() will return and catch up with Journal become
>> > we always slow down the Journal speed after several seconds.
>
> Autotuning these parameters based on observed performance definitely
> sounds promising!
>
> sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-14 12:42         ` Wido den Hollander
@ 2014-10-15  3:10           ` Nicheal
  0 siblings, 0 replies; 10+ messages in thread
From: Nicheal @ 2014-10-15  3:10 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Mark Nelson, Gregory Farnum, ceph-devel

On 10/14/2014 02:19 PM, Mark Nelson wrote:
> On 10/14/2014 12:15 AM, Nicheal wrote:
>> Yes, Greg.
>> But Unix based system always have a parameter dirty_ratio to prevent
>> the system memory from being exhausted. If Journal speed is so fast
>> while backing store cannot catch up with Journal, then the backing
>> store write will be blocked by the hard limitation of system dirty
>> pages. The problem here may be that system call, sync(), cannot return
>> since the system always has lots of dirty pages. Consequently, 1)
>> FileStore::sync_entry() will be timeout and then ceph_osd_daemon
>> abort.  2) Even if the thread is not timed out, Since the Journal
>> committed point cannot be updated so that the Journal will be blocked,
>> waiting for the sync() return and update Journal committed point.
>> So the Throttle is added to solve the above problems, right?
>
> Greg or Sam can correct me if I'm wrong, but I always thought of the
> wbthrottle code as being more an attempt to smooth out spikes in write
> throughput to prevent the journal from getting too far ahead of the
> backing store.  IE have more frequent, shorter flush periods rather than
> less frequent longer ones.  For Ceph that is's probably a reasonable
> idea since you want all of the OSDs behaving as consistently as possible
> to prevent hitting the max outstanding client IOs/Bytes on the client
> and starving other ready OSDs.  I'm not sure it's worked out in practice
> as well as it might have in theory, though I'm not sure we've really
> investigated what's going on enough to be sure.
>

> I thought that as well. So in the case of a SSD-based OSD where the
> journal is on a partition #1 and the data on #2 you would disable
> wbthrottle, correct?
Yes, Wido. But it also depends, I don't know you environment, but I
can provide tips here:
    First of all, if you do large number of small io (e.g. 4k), the
bottleneck maybe your CPU, my xeon E3 1230 v2 can just support 2 SSD
OSD/node if I test 4k write.  So disabling wbthrottle can save your
cpu cost and improve performance.
    Secondly, if your cpu is not bottleneck (supposing you use a
powerful server 2*Xeon E5), then if you use SSD can provide power-loss
data protection, you can mount you SSD with nobarrier(If you don't
know the concept of filesystem writebarrier, please refer to
http://xfs.org/index.php/XFS_FAQ#Write_barrier_support) so that
fdatasync() would be quite efficient to smooth your IOPS.
    If you don't care how to improve the performance based on the ceph
source code, my suggestion is that try different tuning under your
environment and chose the better one.

> Since the journal is just as fast as the data partition.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: WriteBack Throttle kill the performace of the disk
  2014-10-14 12:19       ` Mark Nelson
  2014-10-14 12:42         ` Wido den Hollander
  2014-10-14 13:22         ` Sage Weil
@ 2014-10-15  5:55         ` Nicheal
  2 siblings, 0 replies; 10+ messages in thread
From: Nicheal @ 2014-10-15  5:55 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Gregory Farnum, ceph-devel

2014-10-14 20:19 GMT+08:00 Mark Nelson <mark.nelson@inktank.com>:
> On 10/14/2014 12:15 AM, Nicheal wrote:
>>
>> Yes, Greg.
>> But Unix based system always have a parameter dirty_ratio to prevent
>> the system memory from being exhausted. If Journal speed is so fast
>> while backing store cannot catch up with Journal, then the backing
>> store write will be blocked by the hard limitation of system dirty
>> pages. The problem here may be that system call, sync(), cannot return
>> since the system always has lots of dirty pages. Consequently, 1)
>> FileStore::sync_entry() will be timeout and then ceph_osd_daemon
>> abort.  2) Even if the thread is not timed out, Since the Journal
>> committed point cannot be updated so that the Journal will be blocked,
>> waiting for the sync() return and update Journal committed point.
>> So the Throttle is added to solve the above problems, right?
>
>
> Greg or Sam can correct me if I'm wrong, but I always thought of the
> wbthrottle code as being more an attempt to smooth out spikes in write
> throughput to prevent the journal from getting too far ahead of the backing
> store.  IE have more frequent, shorter flush periods rather than less
> frequent longer ones.  For Ceph that is's probably a reasonable idea since
> you want all of the OSDs behaving as consistently as possible to prevent
> hitting the max outstanding client IOs/Bytes on the client and starving
> other ready OSDs.  I'm not sure it's worked out in practice as well as it
> might have in theory, though I'm not sure we've really investigated what's
> going on enough to be sure.
>
>> However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
>> will cause problem (SSD as journal, and HDD as data disk, fio 4k
>> ramdom write iodepth 64):
>>      WritebackThrottle enable: Based on blktrace, we trace the back-end
>> hdd io behaviour. Because of frequently calling fdatasync() in
>> Writeback Throttle, it cause every back-end hdd spent more time to
>> finish one io. This causes the total sync time longer. For example,
>> default sync_max_interval is 5 seconds, total dirty data in 5 seconds
>> is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
>> disk within 4 second, So cat /proc/meminfo, the dirty data of my
>> system is always clean(near zero). However, If I enable
>> WritebackThrottle, fdatasync() slows down the sync process. Thus, it
>> seems 8-9M random io will be sync to the disk within 5s. Thus the
>> dirty data is always growing to the critical point (system
>> up-limitation), and then sync_entry() is always timed out. So I means,
>> in my case, disabling WritebackThrottle, I may always have 600 IOPS.
>> If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
>> cause back-end HDD disk overloaded.
>
>
> We never did a blktrace investigation, but we did see pretty bad performance
> with the default wbthrottle code when it was first implemented.  We ended up
> raising the throttles pretty considerably in dumpling RC2.  It would be
> interesting to repeat this test on an Intel system.
>
>>     So I would like that we can dynamically throttle the IOPS in
>> FileStore. We cannot know the average sync() speed of the back-end
>> Store since different disk own different IO performance. However, we
>> can trace the average write speed in FileStore and Journal, Also, we
>> can know, whether start_sync() is return and finished. Thus, If this
>> time, Journal is writing so fast that the back-end cannot catch up the
>> Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
>> 800IOPS/s) in next operation interval(the interval maybe 1 to 5
>> seconds, in the third second, Thottle become 1000*e^-x where x is the
>> tick interval, ), if in this interval, Journal write reach the
>> limitation, the following submitting write should waiting in OSD
>> waiting queue.So in this way, Journal may provide a boosting IO, but
>> finally, back-end sync() will return and catch up with Journal become
>> we always slow down the Journal speed after several seconds.
>>
>
> I will wait for Sam's input, but it seems reasonable to me.  Perhaps you
> might write it up as a blueprint for CDS?
Ok, Mark. I would consider. But now, it just a basic idea. I may think
out whether we can use a AutotuningThrottle to replace the
WritebackThrottle.

>
> Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-10-15  5:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-13 10:18 WriteBack Throttle kill the performace of the disk Nicheal
2014-10-13 13:29 ` Mark Nelson
2014-10-13 19:50   ` Gregory Farnum
2014-10-14  5:15     ` Nicheal
2014-10-14 12:19       ` Mark Nelson
2014-10-14 12:42         ` Wido den Hollander
2014-10-15  3:10           ` Nicheal
2014-10-14 13:22         ` Sage Weil
2014-10-15  2:20           ` Nicheal
2014-10-15  5:55         ` Nicheal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.