All of lore.kernel.org
 help / color / mirror / Atom feed
* Mysteriously poor write performance
@ 2012-03-17 11:35 Andrey Korolyov
  2012-03-18 18:22 ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: Andrey Korolyov @ 2012-03-17 11:35 UTC (permalink / raw)
  To: ceph-devel

Hi,

I`ve did some performance tests at the following configuration:

mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
dom0 with three dedicated cores and 1.5G, mostly idle. First three
disks on each r410 arranged into raid0 and holds osd data when fourth
holds os and osd` journal partition, all ceph-related stuff mounted on
the ext4 without barriers.

Firstly, I`ve noticed about a difference of benchmark performance and
write speed through rbd from small kvm instance running on one of
first two machines - when bench gave me about 110Mb/s, writing zeros
to raw block device inside vm with dd was at top speed about 45 mb/s,
for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
Things get worse, when I`ve started second vm at second host and tried
to continue same dd tests simultaneously - performance fairly divided
by half for each instance :). Enabling jumbo frames, playing with cpu
affinity for ceph and vm instances and trying different TCP congestion
protocols gave no effect at all - with DCTCP I have slightly smoother
network load graph and that`s all.

Can ml please suggest anything to try to improve performance?

ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-17 11:35 Mysteriously poor write performance Andrey Korolyov
@ 2012-03-18 18:22 ` Sage Weil
  2012-03-19 13:46   ` Andrey Korolyov
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2012-03-18 18:22 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> Hi,
> 
> I`ve did some performance tests at the following configuration:
> 
> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> dom0 with three dedicated cores and 1.5G, mostly idle. First three
> disks on each r410 arranged into raid0 and holds osd data when fourth
> holds os and osd` journal partition, all ceph-related stuff mounted on
> the ext4 without barriers.
> 
> Firstly, I`ve noticed about a difference of benchmark performance and
> write speed through rbd from small kvm instance running on one of
> first two machines - when bench gave me about 110Mb/s, writing zeros
> to raw block device inside vm with dd was at top speed about 45 mb/s,
> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> Things get worse, when I`ve started second vm at second host and tried
> to continue same dd tests simultaneously - performance fairly divided
> by half for each instance :). Enabling jumbo frames, playing with cpu
> affinity for ceph and vm instances and trying different TCP congestion
> protocols gave no effect at all - with DCTCP I have slightly smoother
> network load graph and that`s all.
> 
> Can ml please suggest anything to try to improve performance?

Can you try setting

	rbd writeback window = 8192000

or similar, and see what kind of effect that has?  I suspect it'll speed 
up dd; I'm less sure about ext3.

Thanks!
sage


> 
> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-18 18:22 ` Sage Weil
@ 2012-03-19 13:46   ` Andrey Korolyov
  2012-03-19 16:59     ` Greg Farnum
  0 siblings, 1 reply; 16+ messages in thread
From: Andrey Korolyov @ 2012-03-19 13:46 UTC (permalink / raw)
  To: ceph-devel

More strangely, writing speed drops down by fifteen percent when this
option was set in vm` config(instead of result from
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
under heavy load.

On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil <sage@newdream.net> wrote:
> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>> Hi,
>>
>> I`ve did some performance tests at the following configuration:
>>
>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> disks on each r410 arranged into raid0 and holds osd data when fourth
>> holds os and osd` journal partition, all ceph-related stuff mounted on
>> the ext4 without barriers.
>>
>> Firstly, I`ve noticed about a difference of benchmark performance and
>> write speed through rbd from small kvm instance running on one of
>> first two machines - when bench gave me about 110Mb/s, writing zeros
>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> Things get worse, when I`ve started second vm at second host and tried
>> to continue same dd tests simultaneously - performance fairly divided
>> by half for each instance :). Enabling jumbo frames, playing with cpu
>> affinity for ceph and vm instances and trying different TCP congestion
>> protocols gave no effect at all - with DCTCP I have slightly smoother
>> network load graph and that`s all.
>>
>> Can ml please suggest anything to try to improve performance?
>
> Can you try setting
>
>        rbd writeback window = 8192000
>
> or similar, and see what kind of effect that has?  I suspect it'll speed
> up dd; I'm less sure about ext3.
>
> Thanks!
> sage
>
>
>>
>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-19 13:46   ` Andrey Korolyov
@ 2012-03-19 16:59     ` Greg Farnum
  2012-03-19 18:13       ` Andrey Korolyov
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Farnum @ 2012-03-19 16:59 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM). 
If you are using KVM, you probably want 81920000 (~80MB) rather than 8192000 (~8MB).

What options are you running dd with? If you run a rados bench from both machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg


On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

> More strangely, writing speed drops down by fifteen percent when this
> option was set in vm` config(instead of result from
> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> under heavy load.
> 
> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil <sage@newdream.net (mailto:sage@newdream.net)> wrote:
> > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> > > Hi,
> > > 
> > > I`ve did some performance tests at the following configuration:
> > > 
> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
> > > disks on each r410 arranged into raid0 and holds osd data when fourth
> > > holds os and osd` journal partition, all ceph-related stuff mounted on
> > > the ext4 without barriers.
> > > 
> > > Firstly, I`ve noticed about a difference of benchmark performance and
> > > write speed through rbd from small kvm instance running on one of
> > > first two machines - when bench gave me about 110Mb/s, writing zeros
> > > to raw block device inside vm with dd was at top speed about 45 mb/s,
> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> > > Things get worse, when I`ve started second vm at second host and tried
> > > to continue same dd tests simultaneously - performance fairly divided
> > > by half for each instance :). Enabling jumbo frames, playing with cpu
> > > affinity for ceph and vm instances and trying different TCP congestion
> > > protocols gave no effect at all - with DCTCP I have slightly smoother
> > > network load graph and that`s all.
> > > 
> > > Can ml please suggest anything to try to improve performance?
> > 
> > Can you try setting
> > 
> > rbd writeback window = 8192000
> > 
> > or similar, and see what kind of effect that has? I suspect it'll speed
> > up dd; I'm less sure about ext3.
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-19 16:59     ` Greg Farnum
@ 2012-03-19 18:13       ` Andrey Korolyov
  2012-03-19 18:25         ` Greg Farnum
  2012-03-19 18:40         ` Josh Durgin
  0 siblings, 2 replies; 16+ messages in thread
From: Andrey Korolyov @ 2012-03-19 18:13 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel

Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:
dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
 Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec

Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
<gregory.farnum@dreamhost.com> wrote:
> It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM).
> If you are using KVM, you probably want 81920000 (~80MB) rather than 8192000 (~8MB).
>
> What options are you running dd with? If you run a rados bench from both machines, what do the results look like?
> Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
> -Greg
>
>
> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>
>> More strangely, writing speed drops down by fifteen percent when this
>> option was set in vm` config(instead of result from
>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>> under heavy load.
>>
>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil <sage@newdream.net (mailto:sage@newdream.net)> wrote:
>> > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>> > > Hi,
>> > >
>> > > I`ve did some performance tests at the following configuration:
>> > >
>> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
>> > > disks on each r410 arranged into raid0 and holds osd data when fourth
>> > > holds os and osd` journal partition, all ceph-related stuff mounted on
>> > > the ext4 without barriers.
>> > >
>> > > Firstly, I`ve noticed about a difference of benchmark performance and
>> > > write speed through rbd from small kvm instance running on one of
>> > > first two machines - when bench gave me about 110Mb/s, writing zeros
>> > > to raw block device inside vm with dd was at top speed about 45 mb/s,
>> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>> > > Things get worse, when I`ve started second vm at second host and tried
>> > > to continue same dd tests simultaneously - performance fairly divided
>> > > by half for each instance :). Enabling jumbo frames, playing with cpu
>> > > affinity for ceph and vm instances and trying different TCP congestion
>> > > protocols gave no effect at all - with DCTCP I have slightly smoother
>> > > network load graph and that`s all.
>> > >
>> > > Can ml please suggest anything to try to improve performance?
>> >
>> > Can you try setting
>> >
>> > rbd writeback window = 8192000
>> >
>> > or similar, and see what kind of effect that has? I suspect it'll speed
>> > up dd; I'm less sure about ext3.
>> >
>> > Thanks!
>> > sage
>> >
>> >
>> > >
>> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-19 18:13       ` Andrey Korolyov
@ 2012-03-19 18:25         ` Greg Farnum
  2012-03-19 18:40         ` Josh Durgin
  1 sibling, 0 replies; 16+ messages in thread
From: Greg Farnum @ 2012-03-19 18:25 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

On Monday, March 19, 2012 at 11:13 AM, Andrey Korolyov wrote:
> Nope, I`m using KVM for rbd guests.

Ah, okay — I'm not sure what your reference to dom0 and mon2 meant, then?
  
> Surely I`ve been noticed that Sage
> mentioned too small value and I`ve changed it to 64M before posting
> previous message with no success - both 8M and this value cause a
> performance drop. When I tried to wrote small amount of data that can
> be compared to writeback cache size(both on raw device and ext3 with
> sync option), following results were made:
> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
> same without oflag there and in the following samples)
> 10+0 records in
> 10+0 records out
> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
> 20+0 records in
> 20+0 records out
> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
> 30+0 records in
> 30+0 records out
> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>  
> and so on. Reference test with bs=1M and count=2000 has slightly worse
> results _with_ writeback cache than without, as I`ve mentioned before.
> Here the bench results, they`re almost equal on both nodes:
>  
> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
Okay, this is all a little odd to me. Can you send along your ceph.conf (along with any other pool config changes you've made) and the output from a rados bench (60 seconds or so)?
-Greg
  
>  
> Also, because I`ve not mentioned it before, network performance is
> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
> is not interrupt problem or something like it - even if ceph-osd,
> ethernet card queues and kvm instance pinned to different sets of
> cores, nothing changes.
>  
> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
> <gregory.farnum@dreamhost.com (mailto:gregory.farnum@dreamhost.com)> wrote:
> > It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM).
> > If you are using KVM, you probably want 81920000 (~80MB) rather than 8192000 (~8MB).
> >  
> > What options are you running dd with? If you run a rados bench from both machines, what do the results look like?
> > Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
> > -Greg
> >  
> >  
> > On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
> >  
> > > More strangely, writing speed drops down by fifteen percent when this
> > > option was set in vm` config(instead of result from
> > > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
> > > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
> > > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
> > > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
> > > under heavy load.
> > >  
> > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil <sage@newdream.net (mailto:sage@newdream.net)> wrote:
> > > > On Sat, 17 Mar 2012, Andrey Korolyov wrote:
> > > > > Hi,
> > > > >  
> > > > > I`ve did some performance tests at the following configuration:
> > > > >  
> > > > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
> > > > > dom0 with three dedicated cores and 1.5G, mostly idle. First three
> > > > > disks on each r410 arranged into raid0 and holds osd data when fourth
> > > > > holds os and osd` journal partition, all ceph-related stuff mounted on
> > > > > the ext4 without barriers.
> > > > >  
> > > > > Firstly, I`ve noticed about a difference of benchmark performance and
> > > > > write speed through rbd from small kvm instance running on one of
> > > > > first two machines - when bench gave me about 110Mb/s, writing zeros
> > > > > to raw block device inside vm with dd was at top speed about 45 mb/s,
> > > > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
> > > > > Things get worse, when I`ve started second vm at second host and tried
> > > > > to continue same dd tests simultaneously - performance fairly divided
> > > > > by half for each instance :). Enabling jumbo frames, playing with cpu
> > > > > affinity for ceph and vm instances and trying different TCP congestion
> > > > > protocols gave no effect at all - with DCTCP I have slightly smoother
> > > > > network load graph and that`s all.
> > > > >  
> > > > > Can ml please suggest anything to try to improve performance?
> > > >  
> > > > Can you try setting
> > > >  
> > > > rbd writeback window = 8192000
> > > >  
> > > > or similar, and see what kind of effect that has? I suspect it'll speed
> > > > up dd; I'm less sure about ext3.
> > > >  
> > > > Thanks!
> > > > sage
> > > >  
> > > >  
> > > > >  
> > > > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > >  
> > >  
> > >  
> > >  
> > >  
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >  
>  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-19 18:13       ` Andrey Korolyov
  2012-03-19 18:25         ` Greg Farnum
@ 2012-03-19 18:40         ` Josh Durgin
  2012-03-19 19:30           ` Andrey Korolyov
  2012-03-20 20:37           ` Andrey Korolyov
  1 sibling, 2 replies; 16+ messages in thread
From: Josh Durgin @ 2012-03-19 18:40 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: Greg Farnum, ceph-devel

On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
> mentioned too small value and I`ve changed it to 64M before posting
> previous message with no success - both 8M and this value cause a
> performance drop. When I tried to wrote small amount of data that can
> be compared to writeback cache size(both on raw device and ext3 with
> sync option), following results were made:

I just want to clarify that the writeback window isn't a full writeback 
cache - it doesn't affect reads, and does not help with request merging 
etc. It simply allows a bunch of writes to be in flight while acking the 
write to the guest immediately. We're working on a full-fledged 
writeback cache that to replace the writeback window.

> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
> same without oflag there and in the following samples)
> 10+0 records in
> 10+0 records out
> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
> 20+0 records in
> 20+0 records out
> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
> 30+0 records in
> 30+0 records out
> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>
> and so on. Reference test with bs=1M and count=2000 has slightly worse
> results _with_ writeback cache than without, as I`ve mentioned before.
>   Here the bench results, they`re almost equal on both nodes:
>
> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec

One thing to check is the size of the writes that are actually being 
sent by rbd. The guest is probably splitting them into relatively small 
(128 or 256k) writes. Ideally it would be sending 4k writes, and this 
should be a lot faster.

You can see the writes being sent by adding debug_ms=1 to the client or 
osd. The format is osd_op(.*[write OFFSET~LENGTH]).

> Also, because I`ve not mentioned it before, network performance is
> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
> is not interrupt problem or something like it - even if ceph-osd,
> ethernet card queues and kvm instance pinned to different sets of
> cores, nothing changes.
>
> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
> <gregory.farnum@dreamhost.com>  wrote:
>> It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM).
>> If you are using KVM, you probably want 81920000 (~80MB) rather than 8192000 (~8MB).
>>
>> What options are you running dd with? If you run a rados bench from both machines, what do the results look like?
>> Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>> -Greg
>>
>>
>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>
>>> More strangely, writing speed drops down by fifteen percent when this
>>> option was set in vm` config(instead of result from
>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>> under heavy load.
>>>
>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net (mailto:sage@newdream.net)>  wrote:
>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>> Hi,
>>>>>
>>>>> I`ve did some performance tests at the following configuration:
>>>>>
>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>> the ext4 without barriers.
>>>>>
>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>> write speed through rbd from small kvm instance running on one of
>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>> network load graph and that`s all.
>>>>>
>>>>> Can ml please suggest anything to try to improve performance?
>>>>
>>>> Can you try setting
>>>>
>>>> rbd writeback window = 8192000
>>>>
>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>> up dd; I'm less sure about ext3.
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>>
>>>>>
>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-19 18:40         ` Josh Durgin
@ 2012-03-19 19:30           ` Andrey Korolyov
  2012-03-20 20:37           ` Andrey Korolyov
  1 sibling, 0 replies; 16+ messages in thread
From: Andrey Korolyov @ 2012-03-19 19:30 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Greg Farnum, ceph-devel

Thanks to Greg, I have noticed very strange thing - data pool filled
with a bunch of objects like rb.0.0.0000000004db with typical size
4194304 when original pool for guest os has size only 112(created as
40g). Seems that something went wrong, because on 0.42 I had more
impressive performance on cheaper hardware. For first time, I blamed
recent crash and recreated cluster from scratch about a hour ago, but
those objects created in a bare data/ pool with only one vm.




On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>
>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>> mentioned too small value and I`ve changed it to 64M before posting
>> previous message with no success - both 8M and this value cause a
>> performance drop. When I tried to wrote small amount of data that can
>> be compared to writeback cache size(both on raw device and ext3 with
>> sync option), following results were made:
>
>
> I just want to clarify that the writeback window isn't a full writeback
> cache - it doesn't affect reads, and does not help with request merging etc.
> It simply allows a bunch of writes to be in flight while acking the write to
> the guest immediately. We're working on a full-fledged writeback cache that
> to replace the writeback window.
>
>
>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>> same without oflag there and in the following samples)
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>> 20+0 records in
>> 20+0 records out
>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>> 30+0 records in
>> 30+0 records out
>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>
>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>> results _with_ writeback cache than without, as I`ve mentioned before.
>>  Here the bench results, they`re almost equal on both nodes:
>>
>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>
>
> One thing to check is the size of the writes that are actually being sent by
> rbd. The guest is probably splitting them into relatively small (128 or
> 256k) writes. Ideally it would be sending 4k writes, and this should be a
> lot faster.
>
> You can see the writes being sent by adding debug_ms=1 to the client or osd.
> The format is osd_op(.*[write OFFSET~LENGTH]).
>
>
>> Also, because I`ve not mentioned it before, network performance is
>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>> is not interrupt problem or something like it - even if ceph-osd,
>> ethernet card queues and kvm instance pinned to different sets of
>> cores, nothing changes.
>>
>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>> <gregory.farnum@dreamhost.com>  wrote:
>>>
>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>> only works for userspace rbd implementations (eg, KVM).
>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>> 8192000 (~8MB).
>>>
>>> What options are you running dd with? If you run a rados bench from both
>>> machines, what do the results look like?
>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>> -Greg
>>>
>>>
>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>
>>>> More strangely, writing speed drops down by fifteen percent when this
>>>> option was set in vm` config(instead of result from
>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>> under heavy load.
>>>>
>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>
>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>
>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>> the ext4 without barriers.
>>>>>>
>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>> network load graph and that`s all.
>>>>>>
>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>
>>>>>
>>>>> Can you try setting
>>>>>
>>>>> rbd writeback window = 8192000
>>>>>
>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>> up dd; I'm less sure about ext3.
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> (mailto:majordomo@vger.kernel.org)
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-19 18:40         ` Josh Durgin
  2012-03-19 19:30           ` Andrey Korolyov
@ 2012-03-20 20:37           ` Andrey Korolyov
  2012-03-20 22:36             ` Samuel Just
  1 sibling, 1 reply; 16+ messages in thread
From: Andrey Korolyov @ 2012-03-20 20:37 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Greg Farnum, ceph-devel

rados bench 60 write -p data
<skip>
Total time run:        61.217676
Total writes made:     989
Write size:            4194304
Bandwidth (MB/sec):    64.622

Average Latency:       0.989608
Max latency:           2.21701
Min latency:           0.255315

Here a snip from osd log, seems write size is okay.

2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
1220608~4096] 0.17eb9fd8) v4)
2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
(0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
active+clean]    q front is repgather(0x31b5360 applying 10'83
rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)

Sorry for my previous question about rbd chunks, it was really stupid :)

On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>
>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>> mentioned too small value and I`ve changed it to 64M before posting
>> previous message with no success - both 8M and this value cause a
>> performance drop. When I tried to wrote small amount of data that can
>> be compared to writeback cache size(both on raw device and ext3 with
>> sync option), following results were made:
>
>
> I just want to clarify that the writeback window isn't a full writeback
> cache - it doesn't affect reads, and does not help with request merging etc.
> It simply allows a bunch of writes to be in flight while acking the write to
> the guest immediately. We're working on a full-fledged writeback cache that
> to replace the writeback window.
>
>
>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>> same without oflag there and in the following samples)
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>> 20+0 records in
>> 20+0 records out
>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>> 30+0 records in
>> 30+0 records out
>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>
>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>> results _with_ writeback cache than without, as I`ve mentioned before.
>>  Here the bench results, they`re almost equal on both nodes:
>>
>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>
>
> One thing to check is the size of the writes that are actually being sent by
> rbd. The guest is probably splitting them into relatively small (128 or
> 256k) writes. Ideally it would be sending 4k writes, and this should be a
> lot faster.
>
> You can see the writes being sent by adding debug_ms=1 to the client or osd.
> The format is osd_op(.*[write OFFSET~LENGTH]).
>
>
>> Also, because I`ve not mentioned it before, network performance is
>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>> is not interrupt problem or something like it - even if ceph-osd,
>> ethernet card queues and kvm instance pinned to different sets of
>> cores, nothing changes.
>>
>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>> <gregory.farnum@dreamhost.com>  wrote:
>>>
>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>> only works for userspace rbd implementations (eg, KVM).
>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>> 8192000 (~8MB).
>>>
>>> What options are you running dd with? If you run a rados bench from both
>>> machines, what do the results look like?
>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>> -Greg
>>>
>>>
>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>
>>>> More strangely, writing speed drops down by fifteen percent when this
>>>> option was set in vm` config(instead of result from
>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>> under heavy load.
>>>>
>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>
>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>
>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>> the ext4 without barriers.
>>>>>>
>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>> network load graph and that`s all.
>>>>>>
>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>
>>>>>
>>>>> Can you try setting
>>>>>
>>>>> rbd writeback window = 8192000
>>>>>
>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>> up dd; I'm less sure about ext3.
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> (mailto:majordomo@vger.kernel.org)
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-20 20:37           ` Andrey Korolyov
@ 2012-03-20 22:36             ` Samuel Just
       [not found]               ` <CABYiri9An0sYP6pP1xU_Xjz7yhXdv1eF-4q-DqtqygYH76rMHw@mail.gmail.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Samuel Just @ 2012-03-20 22:36 UTC (permalink / raw)
  To: ceph-devel

Can you set osd and filestore debugging to 20, restart the osds, run
rados bench as before, and post the logs?
-Sam Just

On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
> rados bench 60 write -p data
> <skip>
> Total time run:        61.217676
> Total writes made:     989
> Write size:            4194304
> Bandwidth (MB/sec):    64.622
>
> Average Latency:       0.989608
> Max latency:           2.21701
> Min latency:           0.255315
>
> Here a snip from osd log, seems write size is okay.
>
> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
> 1220608~4096] 0.17eb9fd8) v4)
> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
> active+clean]    q front is repgather(0x31b5360 applying 10'83
> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>
> Sorry for my previous question about rbd chunks, it was really stupid :)
>
> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>
>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>> mentioned too small value and I`ve changed it to 64M before posting
>>> previous message with no success - both 8M and this value cause a
>>> performance drop. When I tried to wrote small amount of data that can
>>> be compared to writeback cache size(both on raw device and ext3 with
>>> sync option), following results were made:
>>
>>
>> I just want to clarify that the writeback window isn't a full writeback
>> cache - it doesn't affect reads, and does not help with request merging etc.
>> It simply allows a bunch of writes to be in flight while acking the write to
>> the guest immediately. We're working on a full-fledged writeback cache that
>> to replace the writeback window.
>>
>>
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>> same without oflag there and in the following samples)
>>> 10+0 records in
>>> 10+0 records out
>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>> 20+0 records in
>>> 20+0 records out
>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>> 30+0 records in
>>> 30+0 records out
>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>
>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>  Here the bench results, they`re almost equal on both nodes:
>>>
>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>
>>
>> One thing to check is the size of the writes that are actually being sent by
>> rbd. The guest is probably splitting them into relatively small (128 or
>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>> lot faster.
>>
>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>
>>
>>> Also, because I`ve not mentioned it before, network performance is
>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>> is not interrupt problem or something like it - even if ceph-osd,
>>> ethernet card queues and kvm instance pinned to different sets of
>>> cores, nothing changes.
>>>
>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>> <gregory.farnum@dreamhost.com>  wrote:
>>>>
>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>> only works for userspace rbd implementations (eg, KVM).
>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>> 8192000 (~8MB).
>>>>
>>>> What options are you running dd with? If you run a rados bench from both
>>>> machines, what do the results look like?
>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>> -Greg
>>>>
>>>>
>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>
>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>> option was set in vm` config(instead of result from
>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>> under heavy load.
>>>>>
>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>>
>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>
>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>> the ext4 without barriers.
>>>>>>>
>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>> network load graph and that`s all.
>>>>>>>
>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>
>>>>>>
>>>>>> Can you try setting
>>>>>>
>>>>>> rbd writeback window = 8192000
>>>>>>
>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>> up dd; I'm less sure about ext3.
>>>>>>
>>>>>> Thanks!
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> (mailto:majordomo@vger.kernel.org)
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
       [not found]                             ` <CABYiri8TfJXC7j3L5QXXFO2nmtwiyoP=YSGCZse0VrsY+_zbLw@mail.gmail.com>
@ 2012-03-21 21:20                               ` Samuel Just
       [not found]                                 ` <CABYiri8qHCv6=dFUc-8tFo9bxEtUTQhV1cE9K=CK8hbhW3u10A@mail.gmail.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Samuel Just @ 2012-03-21 21:20 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

(CCing the list)

So, the problem isn't the bandwidth.  Before we respond to the client,
we write the operation to the journal.  In this case, that operation
is taking >1s per operation on osd.1.  Both rbd and rados bench will
only allow a limited number of ops in flight at a time, so this
latency is killing your throughput.  For comparison, the latency for
writing to the journal on osd.0 is < .3s.  Can you measure direct io
latency for writes to your osd.1 journal file?
-Sam

On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
> not Megabits.
>
> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> [global]
>>       log dir = /ceph/out
>>       log_file = ""
>>       logger dir = /ceph/log
>>       pid file = /ceph/out/$type$id.pid
>> [mds]
>>       pid file = /ceph/out/$name.pid
>>       lockdep = 1
>>       mds log max segments = 2
>> [osd]
>>       lockdep = 1
>>       filestore_xattr_use_omap = 1
>>       osd data = /ceph/dev/osd$id
>>       osd journal = /ceph/meta/journal
>>       osd journal size = 100
>> [mon]
>>       lockdep = 1
>>       mon data = /ceph/dev/mon$id
>> [mon.0]
>>       host = 172.20.1.32
>>       mon addr = 172.20.1.32:6789
>> [mon.1]
>>       host = 172.20.1.33
>>       mon addr = 172.20.1.33:6789
>> [mon.2]
>>       host = 172.20.1.35
>>       mon addr = 172.20.1.35:6789
>> [osd.0]
>>       host = 172.20.1.32
>> [osd.1]
>>       host = 172.20.1.33
>> [mds.a]
>>       host = 172.20.1.32
>>
>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>> metadata/. Also both machines do not hold anything else which may
>> impact osd.
>>
>> Also please note of following:
>>
>> http://i.imgur.com/ZgFdO.png
>>
>> First two peaks are related to running rados bench, then goes cluster
>> recreation, automated debian install and final peaks are dd test.
>> Surely I can have more precise graphs, but current one probably enough
>> to state a situation - rbd utilizing about a quarter of possible
>> bandwidth(if we can count rados bench as 100%).
>>
>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>> with the osd.1 journal disk?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>
>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>> -Sam
>>>>>
>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>
>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>>>>>> - at osd1.
>>>>>>
>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>>>>>> post osd.1's logs?
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>
>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay.
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run
>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>> -Sam Just
>>>>>>>>>
>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>> <skip>
>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>> Total writes made:     989
>>>>>>>>>> Write size:            4194304
>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>
>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>
>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>
>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>
>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :)
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting
>>>>>>>>>>>> previous message with no success - both 8M and this value cause a
>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can
>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with
>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback
>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc.
>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to
>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that
>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>
>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>
>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by
>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or
>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>>>>>>>>>>> lot faster.
>>>>>>>>>>>
>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is
>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd,
>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of
>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>> <gregory.farnum@dreamhost.com>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>
>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both
>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>>>>>>>>>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
       [not found]                                 ` <CABYiri8qHCv6=dFUc-8tFo9bxEtUTQhV1cE9K=CK8hbhW3u10A@mail.gmail.com>
@ 2012-03-22 17:26                                   ` Samuel Just
  2012-03-22 18:40                                     ` Andrey Korolyov
  0 siblings, 1 reply; 16+ messages in thread
From: Samuel Just @ 2012-03-22 17:26 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

Our journal writes are actually sequential.  Could you send FIO
results for sequential 4k writes osd.0's journal and osd.1's journal?
-Sam

On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
> FIO output for journal partition, directio enabled, seems good(same
> results for ext4 on other single sata disks).
>
> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued r/w: total=0/40960, short=0/0
>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>     lat (msec): 500=0.04%
>
>
>
> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>> (CCing the list)
>>
>> So, the problem isn't the bandwidth.  Before we respond to the client,
>> we write the operation to the journal.  In this case, that operation
>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>> only allow a limited number of ops in flight at a time, so this
>> latency is killing your throughput.  For comparison, the latency for
>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>> latency for writes to your osd.1 journal file?
>> -Sam
>>
>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>> not Megabits.
>>>
>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>> [global]
>>>>       log dir = /ceph/out
>>>>       log_file = ""
>>>>       logger dir = /ceph/log
>>>>       pid file = /ceph/out/$type$id.pid
>>>> [mds]
>>>>       pid file = /ceph/out/$name.pid
>>>>       lockdep = 1
>>>>       mds log max segments = 2
>>>> [osd]
>>>>       lockdep = 1
>>>>       filestore_xattr_use_omap = 1
>>>>       osd data = /ceph/dev/osd$id
>>>>       osd journal = /ceph/meta/journal
>>>>       osd journal size = 100
>>>> [mon]
>>>>       lockdep = 1
>>>>       mon data = /ceph/dev/mon$id
>>>> [mon.0]
>>>>       host = 172.20.1.32
>>>>       mon addr = 172.20.1.32:6789
>>>> [mon.1]
>>>>       host = 172.20.1.33
>>>>       mon addr = 172.20.1.33:6789
>>>> [mon.2]
>>>>       host = 172.20.1.35
>>>>       mon addr = 172.20.1.35:6789
>>>> [osd.0]
>>>>       host = 172.20.1.32
>>>> [osd.1]
>>>>       host = 172.20.1.33
>>>> [mds.a]
>>>>       host = 172.20.1.32
>>>>
>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>>>> metadata/. Also both machines do not hold anything else which may
>>>> impact osd.
>>>>
>>>> Also please note of following:
>>>>
>>>> http://i.imgur.com/ZgFdO.png
>>>>
>>>> First two peaks are related to running rados bench, then goes cluster
>>>> recreation, automated debian install and final peaks are dd test.
>>>> Surely I can have more precise graphs, but current one probably enough
>>>> to state a situation - rbd utilizing about a quarter of possible
>>>> bandwidth(if we can count rados bench as 100%).
>>>>
>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>>>> with the osd.1 journal disk?
>>>>> -Sam
>>>>>
>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>>>
>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>>>
>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>>>>>>>> - at osd1.
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>>>>>>>> post osd.1's logs?
>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>>>
>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run
>>>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>>>> -Sam Just
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>>>> <skip>
>>>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>>>> Total writes made:     989
>>>>>>>>>>>> Write size:            4194304
>>>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>>>
>>>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>>>
>>>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>>>
>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :)
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting
>>>>>>>>>>>>>> previous message with no success - both 8M and this value cause a
>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can
>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with
>>>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback
>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc.
>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to
>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that
>>>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by
>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or
>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>>>>>>>>>>>>> lot faster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is
>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd,
>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of
>>>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>>>> <gregory.farnum@dreamhost.com>  wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both
>>>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>>>>>>>>>>>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-22 17:26                                   ` Samuel Just
@ 2012-03-22 18:40                                     ` Andrey Korolyov
       [not found]                                       ` <CABYiri9SYaTFgb7GMPi_VPT1vDWV+O=Q_P-xibsBb-xjRU1E=g@mail.gmail.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Andrey Korolyov @ 2012-03-22 18:40 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-devel

random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just <sam.just@dreamhost.com> wrote:
> Our journal writes are actually sequential.  Could you send FIO
> results for sequential 4k writes osd.0's journal and osd.1's journal?
> -Sam
>
> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> FIO output for journal partition, directio enabled, seems good(same
>> results for ext4 on other single sata disks).
>>
>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>     lat (msec): 500=0.04%
>>
>>
>>
>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>> (CCing the list)
>>>
>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>> we write the operation to the journal.  In this case, that operation
>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>> only allow a limited number of ops in flight at a time, so this
>>> latency is killing your throughput.  For comparison, the latency for
>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>> latency for writes to your osd.1 journal file?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>>> not Megabits.
>>>>
>>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>> [global]
>>>>>       log dir = /ceph/out
>>>>>       log_file = ""
>>>>>       logger dir = /ceph/log
>>>>>       pid file = /ceph/out/$type$id.pid
>>>>> [mds]
>>>>>       pid file = /ceph/out/$name.pid
>>>>>       lockdep = 1
>>>>>       mds log max segments = 2
>>>>> [osd]
>>>>>       lockdep = 1
>>>>>       filestore_xattr_use_omap = 1
>>>>>       osd data = /ceph/dev/osd$id
>>>>>       osd journal = /ceph/meta/journal
>>>>>       osd journal size = 100
>>>>> [mon]
>>>>>       lockdep = 1
>>>>>       mon data = /ceph/dev/mon$id
>>>>> [mon.0]
>>>>>       host = 172.20.1.32
>>>>>       mon addr = 172.20.1.32:6789
>>>>> [mon.1]
>>>>>       host = 172.20.1.33
>>>>>       mon addr = 172.20.1.33:6789
>>>>> [mon.2]
>>>>>       host = 172.20.1.35
>>>>>       mon addr = 172.20.1.35:6789
>>>>> [osd.0]
>>>>>       host = 172.20.1.32
>>>>> [osd.1]
>>>>>       host = 172.20.1.33
>>>>> [mds.a]
>>>>>       host = 172.20.1.32
>>>>>
>>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>>>>> metadata/. Also both machines do not hold anything else which may
>>>>> impact osd.
>>>>>
>>>>> Also please note of following:
>>>>>
>>>>> http://i.imgur.com/ZgFdO.png
>>>>>
>>>>> First two peaks are related to running rados bench, then goes cluster
>>>>> recreation, automated debian install and final peaks are dd test.
>>>>> Surely I can have more precise graphs, but current one probably enough
>>>>> to state a situation - rbd utilizing about a quarter of possible
>>>>> bandwidth(if we can count rados bench as 100%).
>>>>>
>>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>>>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>>>>> with the osd.1 journal disk?
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>>>>
>>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>>>>
>>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>>>>>>>>> - at osd1.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>>>>>>>>> post osd.1's logs?
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>>>>
>>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
>>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run
>>>>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>>>>> -Sam Just
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>>>>> <skip>
>>>>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>>>>> Total writes made:     989
>>>>>>>>>>>>> Write size:            4194304
>>>>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>>>>
>>>>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
>>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
>>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
>>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting
>>>>>>>>>>>>>>> previous message with no success - both 8M and this value cause a
>>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can
>>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with
>>>>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback
>>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc.
>>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to
>>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that
>>>>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>>>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by
>>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or
>>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>>>>>>>>>>>>>> lot faster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is
>>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>>>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd,
>>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of
>>>>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>>>>> <gregory.farnum@dreamhost.com>  wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both
>>>>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>>>>>>>>>>>>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
       [not found]                                       ` <CABYiri9SYaTFgb7GMPi_VPT1vDWV+O=Q_P-xibsBb-xjRU1E=g@mail.gmail.com>
@ 2012-03-23 17:53                                         ` Samuel Just
  2012-03-24 19:09                                           ` Andrey Korolyov
  0 siblings, 1 reply; 16+ messages in thread
From: Samuel Just @ 2012-03-23 17:53 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

(CCing the list)

Actually, can you could re-do the rados bench run with 'debug journal
= 20' along with the other debugging?  That should give us better
information.

-Sam

On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
> Hi Sam,
>
> Can you please suggest on where to start profiling osd? If the
> bottleneck has related to such non-complex things as directio speed,
> I`m sure that I was able to catch it long ago, even crossing around by
> results of other types of benchmarks at host system. I`ve just tried
> tmpfs under both journals, it has a small boost effect, as expected
> because of near-zero i/o delay. May be chunk distribution mechanism
> does not work well on such small amount of nodes but right now I don`t
> have necessary amount of hardware nodes to prove or disprove that.
>
> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>
>>
>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>> Our journal writes are actually sequential.  Could you send FIO
>>> results for sequential 4k writes osd.0's journal and osd.1's journal?
>>> -Sam
>>>
>>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>> FIO output for journal partition, directio enabled, seems good(same
>>>> results for ext4 on other single sata disks).
>>>>
>>>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>>> Starting 1 process
>>>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>>>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>>>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>     issued r/w: total=0/40960, short=0/0
>>>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>>>     lat (msec): 500=0.04%
>>>>
>>>>
>>>>
>>>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>> (CCing the list)
>>>>>
>>>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>>>> we write the operation to the journal.  In this case, that operation
>>>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>>>> only allow a limited number of ops in flight at a time, so this
>>>>> latency is killing your throughput.  For comparison, the latency for
>>>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>>>> latency for writes to your osd.1 journal file?
>>>>> -Sam
>>>>>
>>>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>>>>> not Megabits.
>>>>>>
>>>>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>> [global]
>>>>>>>       log dir = /ceph/out
>>>>>>>       log_file = ""
>>>>>>>       logger dir = /ceph/log
>>>>>>>       pid file = /ceph/out/$type$id.pid
>>>>>>> [mds]
>>>>>>>       pid file = /ceph/out/$name.pid
>>>>>>>       lockdep = 1
>>>>>>>       mds log max segments = 2
>>>>>>> [osd]
>>>>>>>       lockdep = 1
>>>>>>>       filestore_xattr_use_omap = 1
>>>>>>>       osd data = /ceph/dev/osd$id
>>>>>>>       osd journal = /ceph/meta/journal
>>>>>>>       osd journal size = 100
>>>>>>> [mon]
>>>>>>>       lockdep = 1
>>>>>>>       mon data = /ceph/dev/mon$id
>>>>>>> [mon.0]
>>>>>>>       host = 172.20.1.32
>>>>>>>       mon addr = 172.20.1.32:6789
>>>>>>> [mon.1]
>>>>>>>       host = 172.20.1.33
>>>>>>>       mon addr = 172.20.1.33:6789
>>>>>>> [mon.2]
>>>>>>>       host = 172.20.1.35
>>>>>>>       mon addr = 172.20.1.35:6789
>>>>>>> [osd.0]
>>>>>>>       host = 172.20.1.32
>>>>>>> [osd.1]
>>>>>>>       host = 172.20.1.33
>>>>>>> [mds.a]
>>>>>>>       host = 172.20.1.32
>>>>>>>
>>>>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>>>>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>>>>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>>>>>>> metadata/. Also both machines do not hold anything else which may
>>>>>>> impact osd.
>>>>>>>
>>>>>>> Also please note of following:
>>>>>>>
>>>>>>> http://i.imgur.com/ZgFdO.png
>>>>>>>
>>>>>>> First two peaks are related to running rados bench, then goes cluster
>>>>>>> recreation, automated debian install and final peaks are dd test.
>>>>>>> Surely I can have more precise graphs, but current one probably enough
>>>>>>> to state a situation - rbd utilizing about a quarter of possible
>>>>>>> bandwidth(if we can count rados bench as 100%).
>>>>>>>
>>>>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>>>>>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>>>>>>> with the osd.1 journal disk?
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>>>>>>
>>>>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>>>>>>
>>>>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>>>>>>>>>>> - at osd1.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>>>>>>>>>>> post osd.1's logs?
>>>>>>>>>>>> -Sam
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
>>>>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run
>>>>>>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>>>>>>> -Sam Just
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>>>>>>> <skip>
>>>>>>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>>>>>>> Total writes made:     989
>>>>>>>>>>>>>>> Write size:            4194304
>>>>>>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
>>>>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
>>>>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
>>>>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>>>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting
>>>>>>>>>>>>>>>>> previous message with no success - both 8M and this value cause a
>>>>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can
>>>>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with
>>>>>>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback
>>>>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc.
>>>>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to
>>>>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that
>>>>>>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>>>>>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>>>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>>>>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by
>>>>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or
>>>>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>>>>>>>>>>>>>>>> lot faster.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>>>>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is
>>>>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>>>>>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd,
>>>>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of
>>>>>>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>>>>>>> <gregory.farnum@dreamhost.com>  wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>>>>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both
>>>>>>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>>>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>>>>>>>>>>>>>>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-23 17:53                                         ` Samuel Just
@ 2012-03-24 19:09                                           ` Andrey Korolyov
  2012-03-27 16:39                                             ` Samuel Just
  0 siblings, 1 reply; 16+ messages in thread
From: Andrey Korolyov @ 2012-03-24 19:09 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-devel

http://xdel.ru/downloads/ceph-logs-dbg/

On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just <sam.just@dreamhost.com> wrote:
> (CCing the list)
>
> Actually, can you could re-do the rados bench run with 'debug journal
> = 20' along with the other debugging?  That should give us better
> information.
>
> -Sam
>
> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> Hi Sam,
>>
>> Can you please suggest on where to start profiling osd? If the
>> bottleneck has related to such non-complex things as directio speed,
>> I`m sure that I was able to catch it long ago, even crossing around by
>> results of other types of benchmarks at host system. I`ve just tried
>> tmpfs under both journals, it has a small boost effect, as expected
>> because of near-zero i/o delay. May be chunk distribution mechanism
>> does not work well on such small amount of nodes but right now I don`t
>> have necessary amount of hardware nodes to prove or disprove that.
>>
>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>> Starting 1 process
>>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>     issued r/w: total=0/40960, short=0/0
>>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>>
>>>
>>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>> Our journal writes are actually sequential.  Could you send FIO
>>>> results for sequential 4k writes osd.0's journal and osd.1's journal?
>>>> -Sam
>>>>
>>>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>> FIO output for journal partition, directio enabled, seems good(same
>>>>> results for ext4 on other single sata disks).
>>>>>
>>>>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>>>> Starting 1 process
>>>>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>>>>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>>>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>>>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>>>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>>>>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>>     issued r/w: total=0/40960, short=0/0
>>>>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>>>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>>>>     lat (msec): 500=0.04%
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>> (CCing the list)
>>>>>>
>>>>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>>>>> we write the operation to the journal.  In this case, that operation
>>>>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>>>>> only allow a limited number of ops in flight at a time, so this
>>>>>> latency is killing your throughput.  For comparison, the latency for
>>>>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>>>>> latency for writes to your osd.1 journal file?
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>>>>>> not Megabits.
>>>>>>>
>>>>>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>> [global]
>>>>>>>>       log dir = /ceph/out
>>>>>>>>       log_file = ""
>>>>>>>>       logger dir = /ceph/log
>>>>>>>>       pid file = /ceph/out/$type$id.pid
>>>>>>>> [mds]
>>>>>>>>       pid file = /ceph/out/$name.pid
>>>>>>>>       lockdep = 1
>>>>>>>>       mds log max segments = 2
>>>>>>>> [osd]
>>>>>>>>       lockdep = 1
>>>>>>>>       filestore_xattr_use_omap = 1
>>>>>>>>       osd data = /ceph/dev/osd$id
>>>>>>>>       osd journal = /ceph/meta/journal
>>>>>>>>       osd journal size = 100
>>>>>>>> [mon]
>>>>>>>>       lockdep = 1
>>>>>>>>       mon data = /ceph/dev/mon$id
>>>>>>>> [mon.0]
>>>>>>>>       host = 172.20.1.32
>>>>>>>>       mon addr = 172.20.1.32:6789
>>>>>>>> [mon.1]
>>>>>>>>       host = 172.20.1.33
>>>>>>>>       mon addr = 172.20.1.33:6789
>>>>>>>> [mon.2]
>>>>>>>>       host = 172.20.1.35
>>>>>>>>       mon addr = 172.20.1.35:6789
>>>>>>>> [osd.0]
>>>>>>>>       host = 172.20.1.32
>>>>>>>> [osd.1]
>>>>>>>>       host = 172.20.1.33
>>>>>>>> [mds.a]
>>>>>>>>       host = 172.20.1.32
>>>>>>>>
>>>>>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>>>>>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>>>>>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>>>>>>>> metadata/. Also both machines do not hold anything else which may
>>>>>>>> impact osd.
>>>>>>>>
>>>>>>>> Also please note of following:
>>>>>>>>
>>>>>>>> http://i.imgur.com/ZgFdO.png
>>>>>>>>
>>>>>>>> First two peaks are related to running rados bench, then goes cluster
>>>>>>>> recreation, automated debian install and final peaks are dd test.
>>>>>>>> Surely I can have more precise graphs, but current one probably enough
>>>>>>>> to state a situation - rbd utilizing about a quarter of possible
>>>>>>>> bandwidth(if we can count rados bench as 100%).
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>>>>>>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>>>>>>>> with the osd.1 journal disk?
>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>>>>>>>> -Sam
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>>>>>>>
>>>>>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>>>>>>>>>>>> - at osd1.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>>>>>>>>>>>> post osd.1's logs?
>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
>>>>>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run
>>>>>>>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>>>>>>>> -Sam Just
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>>>>>>>> <skip>
>>>>>>>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>>>>>>>> Total writes made:     989
>>>>>>>>>>>>>>>> Write size:            4194304
>>>>>>>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
>>>>>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
>>>>>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
>>>>>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>>>>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting
>>>>>>>>>>>>>>>>>> previous message with no success - both 8M and this value cause a
>>>>>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can
>>>>>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with
>>>>>>>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback
>>>>>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc.
>>>>>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to
>>>>>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that
>>>>>>>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>>>>>>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>>>>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>>>>>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by
>>>>>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or
>>>>>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>>>>>>>>>>>>>>>>> lot faster.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>>>>>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is
>>>>>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>>>>>>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd,
>>>>>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of
>>>>>>>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>>>>>>>> <gregory.farnum@dreamhost.com>  wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>>>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>>>>>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both
>>>>>>>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>>>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>>>>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>>>>>>>>>>>>>>>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mysteriously poor write performance
  2012-03-24 19:09                                           ` Andrey Korolyov
@ 2012-03-27 16:39                                             ` Samuel Just
  0 siblings, 0 replies; 16+ messages in thread
From: Samuel Just @ 2012-03-27 16:39 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

Sorry for the delayed reply... I've been tracking some issues which
cause high latency on our test machines, and it may be responsible for
your problems as well.  Could you retry those runs with the same
debugging and 'journal dio' set to false?

Thanks for your patience,
-Sam

On Sat, Mar 24, 2012 at 12:09 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
> http://xdel.ru/downloads/ceph-logs-dbg/
>
> On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>> (CCing the list)
>>
>> Actually, can you could re-do the rados bench run with 'debug journal
>> = 20' along with the other debugging?  That should give us better
>> information.
>>
>> -Sam
>>
>> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>> Hi Sam,
>>>
>>> Can you please suggest on where to start profiling osd? If the
>>> bottleneck has related to such non-complex things as directio speed,
>>> I`m sure that I was able to catch it long ago, even crossing around by
>>> results of other types of benchmarks at host system. I`ve just tried
>>> tmpfs under both journals, it has a small boost effect, as expected
>>> because of near-zero i/o delay. May be chunk distribution mechanism
>>> does not work well on such small amount of nodes but right now I don`t
>>> have necessary amount of hardware nodes to prove or disprove that.
>>>
>>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>>> Starting 1 process
>>>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>>>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>     issued r/w: total=0/40960, short=0/0
>>>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>>>
>>>>
>>>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>> Our journal writes are actually sequential.  Could you send FIO
>>>>> results for sequential 4k writes osd.0's journal and osd.1's journal?
>>>>> -Sam
>>>>>
>>>>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>> FIO output for journal partition, directio enabled, seems good(same
>>>>>> results for ext4 on other single sata disks).
>>>>>>
>>>>>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>>>>> Starting 1 process
>>>>>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>>>>>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>>>>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>>>>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>>>>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>>>>>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>>>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>>>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>>>     issued r/w: total=0/40960, short=0/0
>>>>>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>>>>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>>>>>     lat (msec): 500=0.04%
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>> (CCing the list)
>>>>>>>
>>>>>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>>>>>> we write the operation to the journal.  In this case, that operation
>>>>>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>>>>>> only allow a limited number of ops in flight at a time, so this
>>>>>>> latency is killing your throughput.  For comparison, the latency for
>>>>>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>>>>>> latency for writes to your osd.1 journal file?
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>>>>>>> not Megabits.
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>> [global]
>>>>>>>>>       log dir = /ceph/out
>>>>>>>>>       log_file = ""
>>>>>>>>>       logger dir = /ceph/log
>>>>>>>>>       pid file = /ceph/out/$type$id.pid
>>>>>>>>> [mds]
>>>>>>>>>       pid file = /ceph/out/$name.pid
>>>>>>>>>       lockdep = 1
>>>>>>>>>       mds log max segments = 2
>>>>>>>>> [osd]
>>>>>>>>>       lockdep = 1
>>>>>>>>>       filestore_xattr_use_omap = 1
>>>>>>>>>       osd data = /ceph/dev/osd$id
>>>>>>>>>       osd journal = /ceph/meta/journal
>>>>>>>>>       osd journal size = 100
>>>>>>>>> [mon]
>>>>>>>>>       lockdep = 1
>>>>>>>>>       mon data = /ceph/dev/mon$id
>>>>>>>>> [mon.0]
>>>>>>>>>       host = 172.20.1.32
>>>>>>>>>       mon addr = 172.20.1.32:6789
>>>>>>>>> [mon.1]
>>>>>>>>>       host = 172.20.1.33
>>>>>>>>>       mon addr = 172.20.1.33:6789
>>>>>>>>> [mon.2]
>>>>>>>>>       host = 172.20.1.35
>>>>>>>>>       mon addr = 172.20.1.35:6789
>>>>>>>>> [osd.0]
>>>>>>>>>       host = 172.20.1.32
>>>>>>>>> [osd.1]
>>>>>>>>>       host = 172.20.1.33
>>>>>>>>> [mds.a]
>>>>>>>>>       host = 172.20.1.32
>>>>>>>>>
>>>>>>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>>>>>>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>>>>>>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>>>>>>>>> metadata/. Also both machines do not hold anything else which may
>>>>>>>>> impact osd.
>>>>>>>>>
>>>>>>>>> Also please note of following:
>>>>>>>>>
>>>>>>>>> http://i.imgur.com/ZgFdO.png
>>>>>>>>>
>>>>>>>>> First two peaks are related to running rados bench, then goes cluster
>>>>>>>>> recreation, automated debian install and final peaks are dd test.
>>>>>>>>> Surely I can have more precise graphs, but current one probably enough
>>>>>>>>> to state a situation - rbd utilizing about a quarter of possible
>>>>>>>>> bandwidth(if we can count rados bench as 100%).
>>>>>>>>>
>>>>>>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>>>>>>>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>>>>>>>>> with the osd.1 journal disk?
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>>>>>>>>> -Sam
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>>>>>>>>>>>>> - at osd1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>>>>>>>>>>>>> post osd.1's logs?
>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
>>>>>>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@dreamhost.com> wrote:
>>>>>>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run
>>>>>>>>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>>>>>>>>> -Sam Just
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>>>>>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>>>>>>>>> <skip>
>>>>>>>>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>>>>>>>>> Total writes made:     989
>>>>>>>>>>>>>>>>> Write size:            4194304
>>>>>>>>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
>>>>>>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
>>>>>>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@dreamhost.com> wrote:
>>>>>>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>>>>>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting
>>>>>>>>>>>>>>>>>>> previous message with no success - both 8M and this value cause a
>>>>>>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can
>>>>>>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with
>>>>>>>>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback
>>>>>>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc.
>>>>>>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to
>>>>>>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that
>>>>>>>>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>>>>>>>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>>>>>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>>>>>>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by
>>>>>>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or
>>>>>>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>>>>>>>>>>>>>>>>>> lot faster.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>>>>>>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is
>>>>>>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>>>>>>>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd,
>>>>>>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of
>>>>>>>>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>>>>>>>>> <gregory.farnum@dreamhost.com>  wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>>>>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>>>>>>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both
>>>>>>>>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>>>>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>>>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>>>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>>>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>>>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>>>>>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@newdream.net
>>>>>>>>>>>>>>>>>>>>> (mailto:sage@newdream.net)>  wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>>>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>>>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>>>> (mailto:majordomo@vger.kernel.org)
>>>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-03-27 16:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-17 11:35 Mysteriously poor write performance Andrey Korolyov
2012-03-18 18:22 ` Sage Weil
2012-03-19 13:46   ` Andrey Korolyov
2012-03-19 16:59     ` Greg Farnum
2012-03-19 18:13       ` Andrey Korolyov
2012-03-19 18:25         ` Greg Farnum
2012-03-19 18:40         ` Josh Durgin
2012-03-19 19:30           ` Andrey Korolyov
2012-03-20 20:37           ` Andrey Korolyov
2012-03-20 22:36             ` Samuel Just
     [not found]               ` <CABYiri9An0sYP6pP1xU_Xjz7yhXdv1eF-4q-DqtqygYH76rMHw@mail.gmail.com>
     [not found]                 ` <CACLRD_0BK0WVX6=-n3doF680efibNVjq_qB_DwxF73MXNWQ_LA@mail.gmail.com>
     [not found]                   ` <CABYiri--y7zyS_+GvNqCayQd6PXyU8R0GEiBC0BD885DK6w7Rw@mail.gmail.com>
     [not found]                     ` <CACLRD_1Ndcoz3+aOzn4O-7DLL-ePV8ujeGyaN9AeHXTxsO2AeA@mail.gmail.com>
     [not found]                       ` <CABYiri-J-+SsqXOq=ircP-v80rfiXjoFX20wLa0Pfc7KvyA0SQ@mail.gmail.com>
     [not found]                         ` <CACLRD_1PWntwwyWTWij1OuC+LaSU=jz5gittQ5vwqGSXQnyzeQ@mail.gmail.com>
     [not found]                           ` <CABYiri_bjt9Equj4JcoVsN5AEMxxX_qaNNBYZzCfi2rirsTCkA@mail.gmail.com>
     [not found]                             ` <CABYiri8TfJXC7j3L5QXXFO2nmtwiyoP=YSGCZse0VrsY+_zbLw@mail.gmail.com>
2012-03-21 21:20                               ` Samuel Just
     [not found]                                 ` <CABYiri8qHCv6=dFUc-8tFo9bxEtUTQhV1cE9K=CK8hbhW3u10A@mail.gmail.com>
2012-03-22 17:26                                   ` Samuel Just
2012-03-22 18:40                                     ` Andrey Korolyov
     [not found]                                       ` <CABYiri9SYaTFgb7GMPi_VPT1vDWV+O=Q_P-xibsBb-xjRU1E=g@mail.gmail.com>
2012-03-23 17:53                                         ` Samuel Just
2012-03-24 19:09                                           ` Andrey Korolyov
2012-03-27 16:39                                             ` Samuel Just

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.