* speedup ceph / scaling / find the bottleneck
@ 2012-06-29 10:46 Stefan Priebe - Profihost AG
2012-06-29 11:32 ` Alexandre DERUMIER
` (2 more replies)
0 siblings, 3 replies; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-29 10:46 UTC (permalink / raw)
To: ceph-devel
Hello list,
i've made some further testing and have the problem that ceph doesn't
scale for me. I added a 4th osd server to my existing 3 node osd
cluster. I also reformated all to be able to start with a clean system.
While doing random 4k writes from two VMs i see about 8% idle on the osd
servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is
the limiting factor and also the reason why i don't see any improvement
by adding osd servers.
3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle
4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle
Even the cpu is not the limiting factor i think it would be really
important to lower the CPU usage while doing 4k writes. The CPU is only
used by the ceph-osd process. I see nearly no usage by other processes
(only 5% by kworker and 5% flush).
Could somebody recommand me a way to debug this? So we know where all
this CPU usage goes?
Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 10:46 speedup ceph / scaling / find the bottleneck Stefan Priebe - Profihost AG
@ 2012-06-29 11:32 ` Alexandre DERUMIER
2012-06-29 11:49 ` Mark Nelson
2012-06-29 12:33 ` Stefan Priebe - Profihost AG
2 siblings, 0 replies; 32+ messages in thread
From: Alexandre DERUMIER @ 2012-06-29 11:32 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG; +Cc: ceph-devel
I see something strange with my tests:
3 nodes (8cores E5420 @ 2.50GHz) , 5 osd (xfs) by node with 15k drives, journal on tmpfs
kvm guest, with cache=writeback or cache=none (same result):
random write test with 4k block: 5000iop/s , cpu idle 20%
sequential write test with 4k block: 20000iop/s , cpu idle 80% (I'm saturating my gibagit link)
So what's the difference in osd between random or sequential write if block have same size ?
----- Mail original -----
De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
À: ceph-devel@vger.kernel.org
Envoyé: Vendredi 29 Juin 2012 12:46:42
Objet: speedup ceph / scaling / find the bottleneck
Hello list,
i've made some further testing and have the problem that ceph doesn't
scale for me. I added a 4th osd server to my existing 3 node osd
cluster. I also reformated all to be able to start with a clean system.
While doing random 4k writes from two VMs i see about 8% idle on the osd
servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is
the limiting factor and also the reason why i don't see any improvement
by adding osd servers.
3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle
4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle
Even the cpu is not the limiting factor i think it would be really
important to lower the CPU usage while doing 4k writes. The CPU is only
used by the ceph-osd process. I see nearly no usage by other processes
(only 5% by kworker and 5% flush).
Could somebody recommand me a way to debug this? So we know where all
this CPU usage goes?
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 10:46 speedup ceph / scaling / find the bottleneck Stefan Priebe - Profihost AG
2012-06-29 11:32 ` Alexandre DERUMIER
@ 2012-06-29 11:49 ` Mark Nelson
2012-06-29 13:02 ` Stefan Priebe - Profihost AG
2012-06-29 12:33 ` Stefan Priebe - Profihost AG
2 siblings, 1 reply; 32+ messages in thread
From: Mark Nelson @ 2012-06-29 11:49 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG; +Cc: ceph-devel
On 6/29/12 5:46 AM, Stefan Priebe - Profihost AG wrote:
> Hello list,
>
> i've made some further testing and have the problem that ceph doesn't
> scale for me. I added a 4th osd server to my existing 3 node osd
> cluster. I also reformated all to be able to start with a clean system.
>
> While doing random 4k writes from two VMs i see about 8% idle on the osd
> servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is
> the limiting factor and also the reason why i don't see any improvement
> by adding osd servers.
>
> 3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle
> 4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle
>
> Even the cpu is not the limiting factor i think it would be really
> important to lower the CPU usage while doing 4k writes. The CPU is only
> used by the ceph-osd process. I see nearly no usage by other processes
> (only 5% by kworker and 5% flush).
>
> Could somebody recommand me a way to debug this? So we know where all
> this CPU usage goes?
Hi Stefan,
I'll try to replicate your findings in house. I've got some other
things I have to do today, but hopefully I can take a look next week.
If I recall correctly, in the other thread you said that sequential
writes are using much less CPU time on your systems? Do you see better
scaling in that case?
To figure out where CPU is being used, you could try various options:
oprofile, perf, valgrind, strace. Each has it's own advantages.
Here's how you can create a simple callgraph with perf:
http://lwn.net/Articles/340010/
A more general tutorial is here:
https://perf.wiki.kernel.org/index.php/Tutorial
Mark
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 10:46 speedup ceph / scaling / find the bottleneck Stefan Priebe - Profihost AG
2012-06-29 11:32 ` Alexandre DERUMIER
2012-06-29 11:49 ` Mark Nelson
@ 2012-06-29 12:33 ` Stefan Priebe - Profihost AG
2 siblings, 0 replies; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-29 12:33 UTC (permalink / raw)
To: ceph-devel
Some more testing / results:
===== lowering CPU cores =====
1.) disabling CPUs via
echo 0 >/sys/devices/system/cpu/cpuX/online
for core 4-7 does not change anything
2.) When only 50% of the CPUs are available each ceph-osd process takes
only half of the CPU load they use when all are useable.
3.) Even then iops stay at 14k
===== changing replication level =====
1.) Even changing the replication level from 2 to 1 results in still 14k
iops
===== change random ios to sequential ios =====
1.) when i change i do a write test with 4k blocks instead of randwrite
i get jumping values from 13k to 30k average is 18k
2.) the interesting thing is here that the ceph-osd processes take just
1% CPU load
===== direct io to disk =====
1.) when i directly write to the OSD disk from the system itself i
archieve around 25000 iops
2.) As with ceph the load should spread to several disks i should see
higher and not lower iops even when the network is involved
Stefan
Am 29.06.2012 12:46, schrieb Stefan Priebe - Profihost AG:
> Hello list,
>
> i've made some further testing and have the problem that ceph doesn't
> scale for me. I added a 4th osd server to my existing 3 node osd
> cluster. I also reformated all to be able to start with a clean system.
>
> While doing random 4k writes from two VMs i see about 8% idle on the osd
> servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is
> the limiting factor and also the reason why i don't see any improvement
> by adding osd servers.
>
> 3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle
> 4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle
>
> Even the cpu is not the limiting factor i think it would be really
> important to lower the CPU usage while doing 4k writes. The CPU is only
> used by the ceph-osd process. I see nearly no usage by other processes
> (only 5% by kworker and 5% flush).
>
> Could somebody recommand me a way to debug this? So we know where all
> this CPU usage goes?
>
> Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 11:49 ` Mark Nelson
@ 2012-06-29 13:02 ` Stefan Priebe - Profihost AG
2012-06-29 13:11 ` Stefan Priebe - Profihost AG
2012-06-29 15:28 ` Sage Weil
0 siblings, 2 replies; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-29 13:02 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel
Am 29.06.2012 13:49, schrieb Mark Nelson:
> I'll try to replicate your findings in house. I've got some other
> things I have to do today, but hopefully I can take a look next week. If
> I recall correctly, in the other thread you said that sequential writes
> are using much less CPU time on your systems?
Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle
> Do you see better scaling in that case?
3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops
Seq 4k writes: 19900 iops
2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops each VM 1
Seq 4k writes: 18500 iops each VM 2
4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops
Seq 4k writes: 19000 iops
2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each
> To figure out where CPU is being used, you could try various options:
> oprofile, perf, valgrind, strace. Each has it's own advantages.
>
> Here's how you can create a simple callgraph with perf:
>
> http://lwn.net/Articles/340010/
10s perf data output while doing random 4k writes:
https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt
Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 13:02 ` Stefan Priebe - Profihost AG
@ 2012-06-29 13:11 ` Stefan Priebe - Profihost AG
2012-06-29 13:16 ` Stefan Priebe - Profihost AG
2012-06-29 15:28 ` Sage Weil
1 sibling, 1 reply; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-29 13:11 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel, Alexandre DERUMIER
Another BIG hint.
While doing random 4k I/O from one VM i archieve 14k I/Os. This is
around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and
750MB/s. What do they write?!?!
Just an idea?:
Do they completely rewrite EACH 4MB block for each 4k write?
Stefan
Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG:
> Am 29.06.2012 13:49, schrieb Mark Nelson:
>> I'll try to replicate your findings in house. I've got some other
>> things I have to do today, but hopefully I can take a look next week. If
>> I recall correctly, in the other thread you said that sequential writes
>> are using much less CPU time on your systems?
>
> Random 4k writes: 10% idle
> Seq 4k writes: !! 99,7% !! idle
> Seq 4M writes: 90% idle
>
>
> > Do you see better scaling in that case?
>
> 3 osd nodes:
> 1 VM:
> Rand 4k writes: 7000 iops
> Seq 4k writes: 19900 iops
>
> 2 VMs:
> Rand 4k writes: 6000 iops each
> Seq 4k writes: 4000 iops each VM 1
> Seq 4k writes: 18500 iops each VM 2
>
>
> 4 osd nodes:
> 1 VM:
> Rand 4k writes: 14400 iops
> Seq 4k writes: 19000 iops
>
> 2 VMs:
> Rand 4k writes: 7000 iops each
> Seq 4k writes: 18000 iops each
>
>
>
>> To figure out where CPU is being used, you could try various options:
>> oprofile, perf, valgrind, strace. Each has it's own advantages.
>>
>> Here's how you can create a simple callgraph with perf:
>>
>> http://lwn.net/Articles/340010/
> 10s perf data output while doing random 4k writes:
> https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt
>
>
> Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 13:11 ` Stefan Priebe - Profihost AG
@ 2012-06-29 13:16 ` Stefan Priebe - Profihost AG
2012-06-29 13:22 ` Stefan Priebe - Profihost AG
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-29 13:16 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel, Alexandre DERUMIER
Big sorry. ceph was scrubbing during my last test. Didn't recognized this.
When i redo the test i see writes between 20MB/s and 100Mb/s. That is
OK. Sorry.
Stefan
Am 29.06.2012 15:11, schrieb Stefan Priebe - Profihost AG:
> Another BIG hint.
>
> While doing random 4k I/O from one VM i archieve 14k I/Os. This is
> around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and
> 750MB/s. What do they write?!?!
>
> Just an idea?:
> Do they completely rewrite EACH 4MB block for each 4k write?
>
> Stefan
>
> Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG:
>> Am 29.06.2012 13:49, schrieb Mark Nelson:
>>> I'll try to replicate your findings in house. I've got some other
>>> things I have to do today, but hopefully I can take a look next week. If
>>> I recall correctly, in the other thread you said that sequential writes
>>> are using much less CPU time on your systems?
>>
>> Random 4k writes: 10% idle
>> Seq 4k writes: !! 99,7% !! idle
>> Seq 4M writes: 90% idle
>>
>>
>> > Do you see better scaling in that case?
>>
>> 3 osd nodes:
>> 1 VM:
>> Rand 4k writes: 7000 iops
>> Seq 4k writes: 19900 iops
>>
>> 2 VMs:
>> Rand 4k writes: 6000 iops each
>> Seq 4k writes: 4000 iops each VM 1
>> Seq 4k writes: 18500 iops each VM 2
>>
>>
>> 4 osd nodes:
>> 1 VM:
>> Rand 4k writes: 14400 iops
>> Seq 4k writes: 19000 iops
>>
>> 2 VMs:
>> Rand 4k writes: 7000 iops each
>> Seq 4k writes: 18000 iops each
>>
>>
>>
>>> To figure out where CPU is being used, you could try various options:
>>> oprofile, perf, valgrind, strace. Each has it's own advantages.
>>>
>>> Here's how you can create a simple callgraph with perf:
>>>
>>> http://lwn.net/Articles/340010/
>> 10s perf data output while doing random 4k writes:
>> https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt
>>
>>
>>
>> Stefan
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 13:16 ` Stefan Priebe - Profihost AG
@ 2012-06-29 13:22 ` Stefan Priebe - Profihost AG
0 siblings, 0 replies; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-29 13:22 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel, Alexandre DERUMIER
iostat output via iostat -x -t 5 while 4k random writes
06/29/2012 03:20:55 PM
avg-cpu: %user %nice %system %iowait %steal %idle
31,63 0,00 52,64 0,78 0,00 14,95
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0,00 690,40 0,00 3143,60 0,00 33958,80
10,80 2,68 0,85 0,08 24,08
sdc 0,00 1069,80 0,00 5151,60 0,00 54693,00
10,62 8,31 1,61 0,06 29,68
sdd 0,00 581,00 0,00 2762,80 0,00 27809,00
10,07 2,45 0,89 0,08 21,12
sde 0,00 820,00 0,00 4208,20 0,00 43457,40
10,33 4,00 0,95 0,07 28,56
sda 0,00 0,00 0,00 0,40 0,00 9,60
24,00 0,00 0,00 0,00 0,00
06/29/2012 03:21:00 PM
avg-cpu: %user %nice %system %iowait %steal %idle
29,68 0,00 52,89 0,98 0,00 16,45
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0,00 1046,60 0,00 5544,20 0,00 57938,00
10,45 6,08 1,10 0,06 32,08
sdc 0,00 115,60 0,00 3483,60 0,00 29368,00
8,43 3,45 0,99 0,06 21,36
sdd 0,00 1143,20 0,00 5991,00 0,00 62607,40
10,45 6,03 1,01 0,06 35,20
sde 0,00 1070,00 0,00 5561,60 0,00 58207,20
10,47 5,76 1,04 0,07 38,08
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 0,00
06/29/2012 03:21:05 PM
avg-cpu: %user %nice %system %iowait %steal %idle
29,69 0,00 53,06 0,60 0,00 16,65
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0,00 199,60 0,00 4484,40 0,00 41338,20
9,22 1,96 0,44 0,07 30,56
sdc 0,00 766,60 0,00 3616,20 0,00 38829,00
10,74 3,62 1,00 0,07 25,68
sdd 0,00 149,20 0,00 5066,60 0,00 45793,60
9,04 4,48 0,89 0,06 28,48
sde 0,00 150,00 0,00 4328,80 0,00 36496,00
8,43 2,96 0,68 0,07 32,40
sda 0,00 0,00 0,00 0,40 0,00 35,20
88,00 0,00 0,00 0,00 0,00
06/29/2012 03:21:10 PM
avg-cpu: %user %nice %system %iowait %steal %idle
29,11 0,00 46,58 0,50 0,00 23,81
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sdb 0,00 881,20 0,00 3077,20 0,00 33382,80
10,85 3,44 1,12 0,06 18,16
sdc 0,00 867,60 0,00 5098,40 0,00 52056,20
10,21 5,65 1,11 0,05 24,32
sdd 0,00 864,40 0,00 2759,00 0,00 30321,60
10,99 3,39 1,23 0,06 17,36
sde 0,00 846,20 0,00 3193,40 0,00 36795,60
11,52 3,48 1,09 0,06 19,92
sda 0,00 0,00 0,00 1,40 0,00 11,20
8,00 0,01 4,57 2,29 0,32
Am 29.06.2012 15:16, schrieb Stefan Priebe - Profihost AG:
> Big sorry. ceph was scrubbing during my last test. Didn't recognized this.
>
> When i redo the test i see writes between 20MB/s and 100Mb/s. That is
> OK. Sorry.
>
> Stefan
>
> Am 29.06.2012 15:11, schrieb Stefan Priebe - Profihost AG:
>> Another BIG hint.
>>
>> While doing random 4k I/O from one VM i archieve 14k I/Os. This is
>> around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and
>> 750MB/s. What do they write?!?!
>>
>> Just an idea?:
>> Do they completely rewrite EACH 4MB block for each 4k write?
>>
>> Stefan
>>
>> Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG:
>>> Am 29.06.2012 13:49, schrieb Mark Nelson:
>>>> I'll try to replicate your findings in house. I've got some other
>>>> things I have to do today, but hopefully I can take a look next
>>>> week. If
>>>> I recall correctly, in the other thread you said that sequential writes
>>>> are using much less CPU time on your systems?
>>>
>>> Random 4k writes: 10% idle
>>> Seq 4k writes: !! 99,7% !! idle
>>> Seq 4M writes: 90% idle
>>>
>>>
>>> > Do you see better scaling in that case?
>>>
>>> 3 osd nodes:
>>> 1 VM:
>>> Rand 4k writes: 7000 iops
>>> Seq 4k writes: 19900 iops
>>>
>>> 2 VMs:
>>> Rand 4k writes: 6000 iops each
>>> Seq 4k writes: 4000 iops each VM 1
>>> Seq 4k writes: 18500 iops each VM 2
>>>
>>>
>>> 4 osd nodes:
>>> 1 VM:
>>> Rand 4k writes: 14400 iops
>>> Seq 4k writes: 19000 iops
>>>
>>> 2 VMs:
>>> Rand 4k writes: 7000 iops each
>>> Seq 4k writes: 18000 iops each
>>>
>>>
>>>
>>>> To figure out where CPU is being used, you could try various options:
>>>> oprofile, perf, valgrind, strace. Each has it's own advantages.
>>>>
>>>> Here's how you can create a simple callgraph with perf:
>>>>
>>>> http://lwn.net/Articles/340010/
>>> 10s perf data output while doing random 4k writes:
>>> https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt
>>>
>>>
>>>
>>>
>>> Stefan
>>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 13:02 ` Stefan Priebe - Profihost AG
2012-06-29 13:11 ` Stefan Priebe - Profihost AG
@ 2012-06-29 15:28 ` Sage Weil
2012-06-29 21:18 ` Stefan Priebe
1 sibling, 1 reply; 32+ messages in thread
From: Sage Weil @ 2012-06-29 15:28 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG; +Cc: Mark Nelson, ceph-devel
On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:
> Am 29.06.2012 13:49, schrieb Mark Nelson:
> > I'll try to replicate your findings in house. I've got some other
> > things I have to do today, but hopefully I can take a look next week. If
> > I recall correctly, in the other thread you said that sequential writes
> > are using much less CPU time on your systems?
>
> Random 4k writes: 10% idle
> Seq 4k writes: !! 99,7% !! idle
> Seq 4M writes: 90% idle
I take it 'rbd cache = true'? It sounds like librbd (or the guest file
system) is coalescing the sequential writes into big writes. I'm a bit
surprised that the 4k ones have lower CPU utilization, but there are lots
of opportunity for noise there, so I wouldn't read too far into it yet.
> > Do you see better scaling in that case?
>
> 3 osd nodes:
> 1 VM:
> Rand 4k writes: 7000 iops
> Seq 4k writes: 19900 iops
>
> 2 VMs:
> Rand 4k writes: 6000 iops each
> Seq 4k writes: 4000 iops each VM 1
> Seq 4k writes: 18500 iops each VM 2
>
>
> 4 osd nodes:
> 1 VM:
> Rand 4k writes: 14400 iops <------ ????
Can you double-check this number?
> Seq 4k writes: 19000 iops
>
> 2 VMs:
> Rand 4k writes: 7000 iops each
> Seq 4k writes: 18000 iops each
With the exception of that one number above, it really sounds like the
bottleneck is in the client (VM or librbd+librados) and not in the
cluster. Performance won't improve when you add OSDs if the limiting
factor is the clients ability to dispatch/stream/sustatin IOs. That also
seems concistent with the fact that limiting the # of CPUs on the OSDs
doesn't affect much.
Aboe, with 2 VMs, for instance, your total iops for the cluster doubled
(36000 total). Can you try with 4 VMs and see if it continues to scale in
that dimension? At some point you will start to saturate the OSDs, and at
that point adding more OSDs should show aggregate throughput going up.
I think the typical way to approach this is to first scale the client side
independently to get the iops-per-osd figure, then pick a reasonable ratio
between the two, then scale both the client and server side proportional
to make sure the load distribution and network infrastructure scales
properly.
sage
>
>
>
> > To figure out where CPU is being used, you could try various options:
> > oprofile, perf, valgrind, strace. Each has it's own advantages.
> >
> > Here's how you can create a simple callgraph with perf:
> >
> > http://lwn.net/Articles/340010/
> 10s perf data output while doing random 4k writes:
> https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 15:28 ` Sage Weil
@ 2012-06-29 21:18 ` Stefan Priebe
2012-07-01 21:01 ` Stefan Priebe
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Priebe @ 2012-06-29 21:18 UTC (permalink / raw)
To: Sage Weil; +Cc: Mark Nelson, ceph-devel
Am 29.06.2012 17:28, schrieb Sage Weil:
> On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:
>> Am 29.06.2012 13:49, schrieb Mark Nelson:
>>> I'll try to replicate your findings in house. I've got some other
>>> things I have to do today, but hopefully I can take a look next week. If
>>> I recall correctly, in the other thread you said that sequential writes
>>> are using much less CPU time on your systems?
>>
>> Random 4k writes: 10% idle
>> Seq 4k writes: !! 99,7% !! idle
>> Seq 4M writes: 90% idle
>
> I take it 'rbd cache = true'?
Yes
> It sounds like librbd (or the guest file
> system) is coalescing the sequential writes into big writes. I'm a bit
> surprised that the 4k ones have lower CPU utilization, but there are lots
> of opportunity for noise there, so I wouldn't read too far into it yet.
90 to 99,7 is OK the 9% goes to flush, kworker and xfs processes. It was
the overall system load. Not just ceph-osd.
>>> Do you see better scaling in that case?
>>
>> 3 osd nodes:
>> 1 VM:
>> Rand 4k writes: 7000 iops
<-- this one is WRONG! sorry it is 14100 iops
>> Seq 4k writes: 19900 iops
>>
>> 2 VMs:
>> Rand 4k writes: 6000 iops each
>> Seq 4k writes: 4000 iops VM 1
>> Seq 4k writes: 18500 iops VM 2
>>
>>
>> 4 osd nodes:
>> 1 VM:
>> Rand 4k writes: 14400 iops <------ ????
>
> Can you double-check this number?
Triple checked BUT i see the the Rand 4k writes with 3 osd nodes was
wrong. Sorry.
>> Seq 4k writes: 19000 iops
>>
>> 2 VMs:
>> Rand 4k writes: 7000 iops each
>> Seq 4k writes: 18000 iops each
>
> With the exception of that one number above, it really sounds like the
> bottleneck is in the client (VM or librbd+librados) and not in the
> cluster. Performance won't improve when you add OSDs if the limiting
> factor is the clients ability to dispatch/stream/sustatin IOs. That also
> seems concistent with the fact that limiting the # of CPUs on the OSDs
> doesn't affect much.
ACK
> Aboe, with 2 VMs, for instance, your total iops for the cluster doubled
> (36000 total). Can you try with 4 VMs and see if it continues to scale in
> that dimension? At some point you will start to saturate the OSDs, and at
> that point adding more OSDs should show aggregate throughput going up.
From where did you get that value? It scales to VMs on some points but
it does not scale with OSDs.
Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-06-29 21:18 ` Stefan Priebe
@ 2012-07-01 21:01 ` Stefan Priebe
2012-07-01 21:13 ` Mark Nelson
2012-07-02 13:19 ` Stefan Priebe - Profihost AG
0 siblings, 2 replies; 32+ messages in thread
From: Stefan Priebe @ 2012-07-01 21:01 UTC (permalink / raw)
To: Sage Weil; +Cc: Mark Nelson, ceph-devel
Hello list,
Hello sage,
i've made some further tests.
Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops
Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops
When i make random 4k writes over 100MB: 450% CPU usage of kvm process
and !! 25059 iops !!
Random 4k writes over 1GB: 380% CPU usage of kvm process 14387 iops
So the range where the random I/O happen seem to be important and the
cpu usage just seem to reflect the iops.
So i'm not sure if the problem is really the client rbd driver. Mark i
hope you can make some tests next week.
Greets
Stefan
Am 29.06.2012 23:18, schrieb Stefan Priebe:
> Am 29.06.2012 17:28, schrieb Sage Weil:
>> On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:
>>> Am 29.06.2012 13:49, schrieb Mark Nelson:
>>>> I'll try to replicate your findings in house. I've got some other
>>>> things I have to do today, but hopefully I can take a look next
>>>> week. If
>>>> I recall correctly, in the other thread you said that sequential writes
>>>> are using much less CPU time on your systems?
>>>
>>> Random 4k writes: 10% idle
>>> Seq 4k writes: !! 99,7% !! idle
>>> Seq 4M writes: 90% idle
>>
>> I take it 'rbd cache = true'?
> Yes
>
>> It sounds like librbd (or the guest file
>> system) is coalescing the sequential writes into big writes. I'm a bit
>> surprised that the 4k ones have lower CPU utilization, but there are lots
>> of opportunity for noise there, so I would
n't read too far into it yet.
> 90 to 99,7 is OK the 9% goes to flush, kworker and xfs processes. It was
> the overall system load. Not just ceph-osd.
>
>>>> Do you see better scaling in that case?
>>>
>>> 3 osd nodes:
>>> 1 VM:
>>> Rand 4k writes: 7000 iops
> <-- this one is WRONG! sorry it is 14100 iops
>
>
>>> Seq 4k writes: 19900 iops
>>>
>>> 2 VMs:
>>> Rand 4k writes: 6000 iops each
>>> Seq 4k writes: 4000 iops VM 1
>>> Seq 4k writes: 18500 iops VM 2
>>>
>>>
>>> 4 osd nodes:
>>> 1 VM:
>>> Rand 4k writes: 14400 iops <------ ????
>>
>> Can you double-check this number?
> Triple checked BUT i see the the Rand 4k writes with 3 osd nodes was
> wrong. Sorry.
>
>>> Seq 4k writes: 19000 iops
>>>
>>> 2 VMs:
>>> Rand 4k writes: 7000 iops each
>>> Seq 4k writes: 18000 iops each
>>
>> With the exception of that one number above, it really sounds like the
>> bottleneck is in the client (VM or librbd+librados) and not in the
>> cluster. Performance won't improve when you add OSDs if the limiting
>> factor is the clients ability to dispatch/stream/sustatin IOs. That also
>> seems concistent with the fact that limiting the # of CPUs on the OSDs
>> doesn't affect much.
> ACK
>
>> Aboe, with 2 VMs, for instance, your total iops for the cluster doubled
>> (36000 total). Can you try with 4 VMs and see if it continues to
>> scale in
>> that dimension? At some point you will start to saturate the OSDs,
>> and at
>> that point adding more OSDs should show aggregate throughput going up.
> From where did you get that value? It scales to VMs on some points but
> it does not scale with OSDs.
>
> Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-01 21:01 ` Stefan Priebe
@ 2012-07-01 21:13 ` Mark Nelson
2012-07-01 21:27 ` Stefan Priebe
2012-07-02 13:19 ` Stefan Priebe - Profihost AG
1 sibling, 1 reply; 32+ messages in thread
From: Mark Nelson @ 2012-07-01 21:13 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Sage Weil, ceph-devel
On 7/1/12 4:01 PM, Stefan Priebe wrote:
> Hello list,
> Hello sage,
>
> i've made some further tests.
>
> Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops
>
> Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops
>
> When i make random 4k writes over 100MB: 450% CPU usage of kvm process
> and !! 25059 iops !!
>
When you say 100MB vs 200GB, do you mean the total amount of data that
is written for the test? Also, are these starting out on a fresh
filesystem? Recently I've been working on tracking down an issue where
small write performance is degrading as data is written. The tests I've
done have been for sequential writes, but I wonder if the problem may be
significantly worse with random writes.
> Random 4k writes over 1GB: 380% CPU usage of kvm process 14387 iops
>
> So the range where the random I/O happen seem to be important and the
> cpu usage just seem to reflect the iops.
>
> So i'm not sure if the problem is really the client rbd driver. Mark i
> hope you can make some tests next week.
I need to get perf setup on our test boxes, but once I do that I'm
hoping to follow up on this.
>
> Greets
> Stefan
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-01 21:13 ` Mark Nelson
@ 2012-07-01 21:27 ` Stefan Priebe
2012-07-02 5:02 ` Alexandre DERUMIER
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Priebe @ 2012-07-01 21:27 UTC (permalink / raw)
To: Mark Nelson; +Cc: Sage Weil, ceph-devel
Am 01.07.2012 23:13, schrieb Mark Nelson:
> On 7/1/12 4:01 PM, Stefan Priebe wrote:
>> Hello list,
>> Hello sage,
>>
>> i've made some further tests.
>>
>> Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops
>>
>> Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops
>>
>> When i make random 4k writes over 100MB: 450% CPU usage of kvm process
>> and !! 25059 iops !!
>>
> When you say 100MB vs 200GB, do you mean the total amount of data that
> is written for the test?
Yes/No, it is the max amount of data written but for random I/O it is
also the range like random block device position between 0 and X where
to write the 4K block.
> Also, are these starting out on a fresh
> filesystem?
Yes, 5 Min old in this case ;-)
Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-01 21:27 ` Stefan Priebe
@ 2012-07-02 5:02 ` Alexandre DERUMIER
2012-07-02 6:12 ` Stefan Priebe - Profihost AG
0 siblings, 1 reply; 32+ messages in thread
From: Alexandre DERUMIER @ 2012-07-02 5:02 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Sage Weil, ceph-devel, Mark Nelson
Hi,
my 2cent,
maybe with lower range (like 100MB) of random io,
you have more chance to aggregate them in 4MB block ?
I'll do some tests today with my 15K drives
----- Mail original -----
De: "Stefan Priebe" <s.priebe@profihost.ag>
À: "Mark Nelson" <mark.nelson@inktank.com>
Cc: "Sage Weil" <sage@inktank.com>, ceph-devel@vger.kernel.org
Envoyé: Dimanche 1 Juillet 2012 23:27:30
Objet: Re: speedup ceph / scaling / find the bottleneck
Am 01.07.2012 23:13, schrieb Mark Nelson:
> On 7/1/12 4:01 PM, Stefan Priebe wrote:
>> Hello list,
>> Hello sage,
>>
>> i've made some further tests.
>>
>> Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops
>>
>> Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops
>>
>> When i make random 4k writes over 100MB: 450% CPU usage of kvm process
>> and !! 25059 iops !!
>>
> When you say 100MB vs 200GB, do you mean the total amount of data that
> is written for the test?
Yes/No, it is the max amount of data written but for random I/O it is
also the range like random block device position between 0 and X where
to write the 4K block.
> Also, are these starting out on a fresh
> filesystem?
Yes, 5 Min old in this case ;-)
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-02 5:02 ` Alexandre DERUMIER
@ 2012-07-02 6:12 ` Stefan Priebe - Profihost AG
2012-07-02 16:51 ` Gregory Farnum
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-07-02 6:12 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: Sage Weil, ceph-devel, Mark Nelson
Am 02.07.2012 07:02, schrieb Alexandre DERUMIER:
> Hi,
> my 2cent,
> maybe with lower range (like 100MB) of random io,
> you have more chance to aggregate them in 4MB block ?
Yes maybe. If you have just a range of 100MB the chance you'll hit the
same 4MB block again is very high.
@sage / mark
How does the aggregation work? Does it work 4MB blockwise or target node
based?
Greets
Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-01 21:01 ` Stefan Priebe
2012-07-01 21:13 ` Mark Nelson
@ 2012-07-02 13:19 ` Stefan Priebe - Profihost AG
1 sibling, 0 replies; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-07-02 13:19 UTC (permalink / raw)
To: Sage Weil; +Cc: Mark Nelson, ceph-devel
Hello,
i just want to report back some test results.
Just some results from a sheepdog test using the same hardware.
Sheepdog:
1 VM:
write: io=12544MB, bw=142678KB/s, iops=35669, runt= 90025msec
read : io=14519MB, bw=165186KB/s, iops=41296, runt= 90003msec
write: io=16520MB, bw=185842KB/s, iops=45, runt= 91026msec
read : io=102936MB, bw=1135MB/s, iops=283, runt= 90684msec
2 VMs:
write: io=7042MB, bw=80062KB/s, iops=20015, runt= 90062msec
read : io=8672MB, bw=98661KB/s, iops=24665, runt= 90004msec
write: io=14008MB, bw=157443KB/s, iops=38, runt= 91107msec
read : io=43924MB, bw=498462KB/s, iops=121, runt= 90234msec
write: io=6048MB, bw=68772KB/s, iops=17192, runt= 90055msec
read : io=9151MB, bw=104107KB/s, iops=26026, runt= 90006msec
write: io=12716MB, bw=142693KB/s, iops=34, runt= 91253msec
read : io=59616MB, bw=675648KB/s, iops=164, runt= 90353msec
Ceph:
2 VMs:
write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec
write: io=2222MB, bw=25275KB/s, iops=6318, runt= 90011msec
read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec
So ceph has pretty good values for sequential stuff but for random I/O
it would be really cool to improve it.
Right now my testsystem has a theoretical 4k random I/Os bandwith of
350.000 iops - 14 disks with 25 000 iops each (test with fio too).
Greets
Stefan
Am 01.07.2012 23:01, schrieb Stefan Priebe:
> Hello list,
> Hello sage,
>
> i've made some further tests.
>
> Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops
>
> Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops
>
> When i make random 4k writes over 100MB: 450% CPU usage of kvm process
> and !! 25059 iops !!
>
> Random 4k writes over 1GB: 380% CPU usage of kvm process 14387 iops
>
> So the range where the random I/O happen seem to be important and the
> cpu usage just seem to reflect the iops.
>
> So i'm not sure if the problem is really the client rbd driver. Mark i
> hope you can make some tests next week.
>
> Greets
> Stefan
>
>
> Am 29.06.2012 23:18, schrieb Stefan Priebe:
>> Am 29.06.2012 17:28, schrieb Sage Weil:
>>> On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:
>>>> Am 29.06.2012 13:49, schrieb Mark Nelson:
>>>>> I'll try to replicate your findings in house. I've got some other
>>>>> things I have to do today, but hopefully I can take a look next
>>>>> week. If
>>>>> I recall correctly, in the other thread you said that sequential
>>>>> writes
>>>>> are using much less CPU time on your systems?
>>>>
>>>> Random 4k writes: 10% idle
>>>> Seq 4k writes: !! 99,7% !! idle
>>>> Seq 4M writes: 90% idle
>>>
>>> I take it 'rbd cache = true'?
>> Yes
>>
>>> It sounds like librbd (or the guest file
>>> system) is coalescing the sequential writes into big writes. I'm a bit
>>> surprised that the 4k ones have lower CPU utilization, but there are
>>> lots
>>> of opportunity for noise there, so I would
>
>
> n't read too far into it yet.
>> 90 to 99,7 is OK the 9% goes to flush, kworker and xfs processes. It was
>> the overall system load. Not just ceph-osd.
>>
>>>>> Do you see better scaling in that case?
>>>>
>>>> 3 osd nodes:
>>>> 1 VM:
>>>> Rand 4k writes: 7000 iops
>> <-- this one is WRONG! sorry it is 14100 iops
>>
>>
>>>> Seq 4k writes: 19900 iops
>>>>
>>>> 2 VMs:
>>>> Rand 4k writes: 6000 iops each
>>>> Seq 4k writes: 4000 iops VM 1
>>>> Seq 4k writes: 18500 iops VM 2
>>>>
>>>>
>>>> 4 osd nodes:
>>>> 1 VM:
>>>> Rand 4k writes: 14400 iops <------ ????
>>>
>>> Can you double-check this number?
>> Triple checked BUT i see the the Rand 4k writes with 3 osd nodes was
>> wrong. Sorry.
>>
>>>> Seq 4k writes: 19000 iops
>>>>
>>>> 2 VMs:
>>>> Rand 4k writes: 7000 iops each
>>>> Seq 4k writes: 18000 iops each
>>>
>>> With the exception of that one number above, it really sounds like the
>>> bottleneck is in the client (VM or librbd+librados) and not in the
>>> cluster. Performance won't improve when you add OSDs if the limiting
>>> factor is the clients ability to dispatch/stream/sustatin IOs. That
>>> also
>>> seems concistent with the fact that limiting the # of CPUs on the OSDs
>>> doesn't affect much.
>> ACK
>>
>>> Aboe, with 2 VMs, for instance, your total iops for the cluster doubled
>>> (36000 total). Can you try with 4 VMs and see if it continues to
>>> scale in
>>> that dimension? At some point you will start to saturate the OSDs,
>>> and at
>>> that point adding more OSDs should show aggregate throughput going up.
>> From where did you get that value? It scales to VMs on some points but
>> it does not scale with OSDs.
>>
>> Stefan
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-02 6:12 ` Stefan Priebe - Profihost AG
@ 2012-07-02 16:51 ` Gregory Farnum
2012-07-02 19:22 ` Stefan Priebe
0 siblings, 1 reply; 32+ messages in thread
From: Gregory Farnum @ 2012-07-02 16:51 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG
Cc: Alexandre DERUMIER, Sage Weil, ceph-devel, Mark Nelson
On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Am 02.07.2012 07:02, schrieb Alexandre DERUMIER:
>
>> Hi,
>> my 2cent,
>> maybe with lower range (like 100MB) of random io,
>> you have more chance to aggregate them in 4MB block ?
>
>
> Yes maybe. If you have just a range of 100MB the chance you'll hit the same
> 4MB block again is very high.
>
> @sage / mark
> How does the aggregation work? Does it work 4MB blockwise or target node
> based?
Aggregation is based on the 4MB blocks, and if you've got caching
enabled then it's also not going to flush them out to disk very often
if you're continuously updating the block — I don't remember all the
conditions, but essentially, you'll run into dirty limits and it will
asynchronously flush out the data based on a combination of how old it
is, and how long it's been since some version of it was stable on
disk.
On Mon, Jul 2, 2012 at 6:19 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Hello,
>
> i just want to report back some test results.
>
> Just some results from a sheepdog test using the same hardware.
>
> Sheepdog:
>
> 1 VM:
> write: io=12544MB, bw=142678KB/s, iops=35669, runt= 90025msec
> read : io=14519MB, bw=165186KB/s, iops=41296, runt= 90003msec
> write: io=16520MB, bw=185842KB/s, iops=45, runt= 91026msec
> read : io=102936MB, bw=1135MB/s, iops=283, runt= 90684msec
>
> 2 VMs:
> write: io=7042MB, bw=80062KB/s, iops=20015, runt= 90062msec
> read : io=8672MB, bw=98661KB/s, iops=24665, runt= 90004msec
> write: io=14008MB, bw=157443KB/s, iops=38, runt= 91107msec
> read : io=43924MB, bw=498462KB/s, iops=121, runt= 90234msec
>
> write: io=6048MB, bw=68772KB/s, iops=17192, runt= 90055msec
> read : io=9151MB, bw=104107KB/s, iops=26026, runt= 90006msec
> write: io=12716MB, bw=142693KB/s, iops=34, runt= 91253msec
> read : io=59616MB, bw=675648KB/s, iops=164, runt= 90353msec
>
>
> Ceph:
> 2 VMs:
> write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
> read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
> write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
> read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec
>
> write: io=2222MB, bw=25275KB/s, iops=6318, runt= 90011msec
> read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
> write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
> read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec
I can't quite tell what's going on here, can you describe the test in
more detail?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-02 16:51 ` Gregory Farnum
@ 2012-07-02 19:22 ` Stefan Priebe
2012-07-02 20:30 ` Josh Durgin
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Priebe @ 2012-07-02 19:22 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Alexandre DERUMIER, Sage Weil, ceph-devel, Mark Nelson
Am 02.07.2012 18:51, schrieb Gregory Farnum:
> On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> @sage / mark
>> How does the aggregation work? Does it work 4MB blockwise or target node
>> based?
> Aggregation is based on the 4MB blocks, and if you've got caching
> enabled then it's also not going to flush them out to disk very often
> if you're continuously updating the block — I don't remember all the
> conditions, but essentially, you'll run into dirty limits and it will
> asynchronously flush out the data based on a combination of how old it
> is, and how long it's been since some version of it was stable on
> disk.
Is there any way to check if rbd caching works correctly? For me the I/O
values do not change if i switch writeback on or of and it also doesn't
matter how large i set the cache size.
...
>> Ceph:
>> 2 VMs:
>> write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
>> read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
>> write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
>> read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec
>>
>> write: io=2222MB, bw=25275KB/s, iops=6318, runt= 90011msec
>> read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
>> write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
>> read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec
>
> I can't quite tell what's going on here, can you describe the test in
> more detail?
I've network booted my VM and then run the following command:
export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite
--bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting
--name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k
--size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio
--filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50
--runtime=90 --group_reporting --name=file1;fio --filename=$DISK
--direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90
--group_reporting --name=file1 )|egrep " read| write"
- write random 4k I/O
- read random 4k I/O
- write seq 4M I/O
- read seq 4M I/O
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-02 19:22 ` Stefan Priebe
@ 2012-07-02 20:30 ` Josh Durgin
2012-07-03 4:42 ` Alexandre DERUMIER
` (2 more replies)
0 siblings, 3 replies; 32+ messages in thread
From: Josh Durgin @ 2012-07-02 20:30 UTC (permalink / raw)
To: Stefan Priebe
Cc: Gregory Farnum, Alexandre DERUMIER, Sage Weil, ceph-devel, Mark Nelson
On 07/02/2012 12:22 PM, Stefan Priebe wrote:
> Am 02.07.2012 18:51, schrieb Gregory Farnum:
>> On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
>> <s.priebe@profihost.ag> wrote:
>>> @sage / mark
>>> How does the aggregation work? Does it work 4MB blockwise or target node
>>> based?
>> Aggregation is based on the 4MB blocks, and if you've got caching
>> enabled then it's also not going to flush them out to disk very often
>> if you're continuously updating the block — I don't remember all the
>> conditions, but essentially, you'll run into dirty limits and it will
>> asynchronously flush out the data based on a combination of how old it
>> is, and how long it's been since some version of it was stable on
>> disk.
> Is there any way to check if rbd caching works correctly? For me the I/O
> values do not change if i switch writeback on or of and it also doesn't
> matter how large i set the cache size.
>
> ...
If you add admin_socket=/path/to/admin_socket for your client running
qemu (in that client's ceph.conf section or manually in the qemu
command line) you can check that caching is enabled:
ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
And see statistics it generates (look for cache) with:
ceph --admin-daemon /path/to/admin_socket perfcounters_dump
Josh
>>> Ceph:
>>> 2 VMs:
>>> write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
>>> read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
>>> write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
>>> read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec
>>>
>>> write: io=2222MB, bw=25275KB/s, iops=6318, runt= 90011msec
>>> read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
>>> write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
>>> read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec
>>
>> I can't quite tell what's going on here, can you describe the test in
>> more detail?
>
> I've network booted my VM and then run the following command:
> export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite
> --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting
> --name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k
> --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio
> --filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50
> --runtime=90 --group_reporting --name=file1;fio --filename=$DISK
> --direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90
> --group_reporting --name=file1 )|egrep " read| write"
>
> - write random 4k I/O
> - read random 4k I/O
> - write seq 4M I/O
> - read seq 4M I/O
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-02 20:30 ` Josh Durgin
@ 2012-07-03 4:42 ` Alexandre DERUMIER
2012-07-03 4:42 ` Alexandre DERUMIER
2012-07-03 7:49 ` Stefan Priebe - Profihost AG
2 siblings, 0 replies; 32+ messages in thread
From: Alexandre DERUMIER @ 2012-07-03 4:42 UTC (permalink / raw)
To: Josh Durgin
Cc: Gregory Farnum, Sage Weil, ceph-devel, Mark Nelson, Stefan Priebe
Stefan,
As fio benchmark use directio (--direct) , maybe the writeback cache is not working ?
perfcounters should give us the answer.
----- Mail original -----
De: "Josh Durgin" <josh.durgin@inktank.com>
À: "Stefan Priebe" <s.priebe@profihost.ag>
Cc: "Gregory Farnum" <greg@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "Sage Weil" <sage@inktank.com>, ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com>
Envoyé: Lundi 2 Juillet 2012 22:30:19
Objet: Re: speedup ceph / scaling / find the bottleneck
On 07/02/2012 12:22 PM, Stefan Priebe wrote:
> Am 02.07.2012 18:51, schrieb Gregory Farnum:
>> On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
>> <s.priebe@profihost.ag> wrote:
>>> @sage / mark
>>> How does the aggregation work? Does it work 4MB blockwise or target node
>>> based?
>> Aggregation is based on the 4MB blocks, and if you've got caching
>> enabled then it's also not going to flush them out to disk very often
>> if you're continuously updating the block — I don't remember all the
>> conditions, but essentially, you'll run into dirty limits and it will
>> asynchronously flush out the data based on a combination of how old it
>> is, and how long it's been since some version of it was stable on
>> disk.
> Is there any way to check if rbd caching works correctly? For me the I/O
> values do not change if i switch writeback on or of and it also doesn't
> matter how large i set the cache size.
>
> ...
If you add admin_socket=/path/to/admin_socket for your client running
qemu (in that client's ceph.conf section or manually in the qemu
command line) you can check that caching is enabled:
ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
And see statistics it generates (look for cache) with:
ceph --admin-daemon /path/to/admin_socket perfcounters_dump
Josh
>>> Ceph:
>>> 2 VMs:
>>> write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
>>> read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
>>> write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
>>> read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec
>>>
>>> write: io=2222MB, bw=25275KB/s, iops=6318, runt= 90011msec
>>> read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
>>> write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
>>> read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec
>>
>> I can't quite tell what's going on here, can you describe the test in
>> more detail?
>
> I've network booted my VM and then run the following command:
> export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite
> --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting
> --name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k
> --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio
> --filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50
> --runtime=90 --group_reporting --name=file1;fio --filename=$DISK
> --direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90
> --group_reporting --name=file1 )|egrep " read| write"
>
> - write random 4k I/O
> - read random 4k I/O
> - write seq 4M I/O
> - read seq 4M I/O
>
> Stefan
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-02 20:30 ` Josh Durgin
2012-07-03 4:42 ` Alexandre DERUMIER
@ 2012-07-03 4:42 ` Alexandre DERUMIER
2012-07-03 7:49 ` Stefan Priebe - Profihost AG
2 siblings, 0 replies; 32+ messages in thread
From: Alexandre DERUMIER @ 2012-07-03 4:42 UTC (permalink / raw)
To: Josh Durgin
Cc: Gregory Farnum, Sage Weil, ceph-devel, Mark Nelson, Stefan Priebe
Stefan,
As fio benchmark use directio (--direct) , maybe the writeback cache is not working ?
perfcounters should give us the answer.
----- Mail original -----
De: "Josh Durgin" <josh.durgin@inktank.com>
À: "Stefan Priebe" <s.priebe@profihost.ag>
Cc: "Gregory Farnum" <greg@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "Sage Weil" <sage@inktank.com>, ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com>
Envoyé: Lundi 2 Juillet 2012 22:30:19
Objet: Re: speedup ceph / scaling / find the bottleneck
On 07/02/2012 12:22 PM, Stefan Priebe wrote:
> Am 02.07.2012 18:51, schrieb Gregory Farnum:
>> On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
>> <s.priebe@profihost.ag> wrote:
>>> @sage / mark
>>> How does the aggregation work? Does it work 4MB blockwise or target node
>>> based?
>> Aggregation is based on the 4MB blocks, and if you've got caching
>> enabled then it's also not going to flush them out to disk very often
>> if you're continuously updating the block — I don't remember all the
>> conditions, but essentially, you'll run into dirty limits and it will
>> asynchronously flush out the data based on a combination of how old it
>> is, and how long it's been since some version of it was stable on
>> disk.
> Is there any way to check if rbd caching works correctly? For me the I/O
> values do not change if i switch writeback on or of and it also doesn't
> matter how large i set the cache size.
>
> ...
If you add admin_socket=/path/to/admin_socket for your client running
qemu (in that client's ceph.conf section or manually in the qemu
command line) you can check that caching is enabled:
ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
And see statistics it generates (look for cache) with:
ceph --admin-daemon /path/to/admin_socket perfcounters_dump
Josh
>>> Ceph:
>>> 2 VMs:
>>> write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
>>> read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
>>> write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
>>> read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec
>>>
>>> write: io=2222MB, bw=25275KB/s, iops=6318, runt= 90011msec
>>> read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
>>> write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
>>> read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec
>>
>> I can't quite tell what's going on here, can you describe the test in
>> more detail?
>
> I've network booted my VM and then run the following command:
> export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite
> --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting
> --name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k
> --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio
> --filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50
> --runtime=90 --group_reporting --name=file1;fio --filename=$DISK
> --direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90
> --group_reporting --name=file1 )|egrep " read| write"
>
> - write random 4k I/O
> - read random 4k I/O
> - write seq 4M I/O
> - read seq 4M I/O
>
> Stefan
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-02 20:30 ` Josh Durgin
2012-07-03 4:42 ` Alexandre DERUMIER
2012-07-03 4:42 ` Alexandre DERUMIER
@ 2012-07-03 7:49 ` Stefan Priebe - Profihost AG
2012-07-03 15:31 ` Sage Weil
2 siblings, 1 reply; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-07-03 7:49 UTC (permalink / raw)
To: Josh Durgin
Cc: Gregory Farnum, Alexandre DERUMIER, Sage Weil, ceph-devel, Mark Nelson
Hello,
Am 02.07.2012 22:30, schrieb Josh Durgin:
> If you add admin_socket=/path/to/admin_socket for your client running
> qemu (in that client's ceph.conf section or manually in the qemu
> command line) you can check that caching is enabled:
>
> ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
>
> And see statistics it generates (look for cache) with:
>
> ceph --admin-daemon /path/to/admin_socket perfcounters_dump
This doesn't work for me:
ceph --admin-daemon /var/run/ceph.sock show config
read only got 0 bytes of 4 expected for response length; invalid
command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0)
AdminSocket: request 'show config' not defined
Also perfcounters does not show anything:
# ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
{}
~]# ceph -v
ceph version 0.48argonaut-2-gb576faa
(commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-03 7:49 ` Stefan Priebe - Profihost AG
@ 2012-07-03 15:31 ` Sage Weil
2012-07-03 18:20 ` Stefan Priebe
2012-07-03 19:16 ` Stefan Priebe
0 siblings, 2 replies; 32+ messages in thread
From: Sage Weil @ 2012-07-03 15:31 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG
Cc: Josh Durgin, Gregory Farnum, Alexandre DERUMIER, ceph-devel, Mark Nelson
On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:
> Hello,
>
> Am 02.07.2012 22:30, schrieb Josh Durgin:
> > If you add admin_socket=/path/to/admin_socket for your client running
> > qemu (in that client's ceph.conf section or manually in the qemu
> > command line) you can check that caching is enabled:
> >
> > ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
> >
> > And see statistics it generates (look for cache) with:
> >
> > ceph --admin-daemon /path/to/admin_socket perfcounters_dump
>
> This doesn't work for me:
> ceph --admin-daemon /var/run/ceph.sock show config
> read only got 0 bytes of 4 expected for response length; invalid
> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) AdminSocket:
> request 'show config' not defined
Oh, it's 'config show'. Also, 'help' will list the supported commands.
> Also perfcounters does not show anything:
> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
> {}
There may be another daemon that tried to attach to the same socket file.
You might want to set 'admin socket = /var/run/ceph/$name.sock' or
something similar, or whatever else is necessary to make it a unique file.
> ~]# ceph -v
> ceph version 0.48argonaut-2-gb576faa
> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
Out of curiousity, what patches are you applying on top of the release?
sage
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-03 15:31 ` Sage Weil
@ 2012-07-03 18:20 ` Stefan Priebe
2012-07-05 21:33 ` Gregory Farnum
2012-07-03 19:16 ` Stefan Priebe
1 sibling, 1 reply; 32+ messages in thread
From: Stefan Priebe @ 2012-07-03 18:20 UTC (permalink / raw)
To: Sage Weil
Cc: Josh Durgin, Gregory Farnum, Alexandre DERUMIER, ceph-devel, Mark Nelson
I'm sorry but this is the KVM Host Machine there is no ceph running on
this machine.
If i change the admin socket to:
admin_socket=/var/run/ceph_$name.sock
i don't have any socket at all ;-(
Am 03.07.2012 17:31, schrieb Sage Weil:
> On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:
>> Hello,
>>
>> Am 02.07.2012 22:30, schrieb Josh Durgin:
>>> If you add admin_socket=/path/to/admin_socket for your client running
>>> qemu (in that client's ceph.conf section or manually in the qemu
>>> command line) you can check that caching is enabled:
>>>
>>> ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
>>>
>>> And see statistics it generates (look for cache) with:
>>>
>>> ceph --admin-daemon /path/to/admin_socket perfcounters_dump
>>
>> This doesn't work for me:
>> ceph --admin-daemon /var/run/ceph.sock show config
>> read only got 0 bytes of 4 expected for response length; invalid
>> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) AdminSocket:
>> request 'show config' not defined
>
> Oh, it's 'config show'. Also, 'help' will list the supported commands.
>
>> Also perfcounters does not show anything:
>> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
>> {}
>
> There may be another daemon that tried to attach to the same socket file.
> You might want to set 'admin socket = /var/run/ceph/$name.sock' or
> something similar, or whatever else is necessary to make it a unique file.
>
>> ~]# ceph -v
>> ceph version 0.48argonaut-2-gb576faa
>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
>
> Out of curiousity, what patches are you applying on top of the release?
>
> sage
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-03 15:31 ` Sage Weil
2012-07-03 18:20 ` Stefan Priebe
@ 2012-07-03 19:16 ` Stefan Priebe
1 sibling, 0 replies; 32+ messages in thread
From: Stefan Priebe @ 2012-07-03 19:16 UTC (permalink / raw)
To: Sage Weil
Cc: Josh Durgin, Gregory Farnum, Alexandre DERUMIER, ceph-devel, Mark Nelson
Am 03.07.2012 17:31, schrieb Sage Weil:
>> ~]# ceph -v
>> ceph version 0.48argonaut-2-gb576faa
>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
>
> Out of curiousity, what patches are you applying on top of the release?
just wip-filestore-min
Stefan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-03 18:20 ` Stefan Priebe
@ 2012-07-05 21:33 ` Gregory Farnum
2012-07-06 3:50 ` Alexandre DERUMIER
0 siblings, 1 reply; 32+ messages in thread
From: Gregory Farnum @ 2012-07-05 21:33 UTC (permalink / raw)
To: Stefan Priebe; +Cc: ceph-devel, Sage Weil
Could you send over the ceph.conf on your KVM host, as well as how
you're configuring KVM to use rbd?
On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> I'm sorry but this is the KVM Host Machine there is no ceph running on this
> machine.
>
> If i change the admin socket to:
> admin_socket=/var/run/ceph_$name.sock
>
> i don't have any socket at all ;-(
>
> Am 03.07.2012 17:31, schrieb Sage Weil:
>
>> On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:
>>>
>>> Hello,
>>>
>>> Am 02.07.2012 22:30, schrieb Josh Durgin:
>>>>
>>>> If you add admin_socket=/path/to/admin_socket for your client running
>>>> qemu (in that client's ceph.conf section or manually in the qemu
>>>> command line) you can check that caching is enabled:
>>>>
>>>> ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
>>>>
>>>> And see statistics it generates (look for cache) with:
>>>>
>>>> ceph --admin-daemon /path/to/admin_socket perfcounters_dump
>>>
>>>
>>> This doesn't work for me:
>>> ceph --admin-daemon /var/run/ceph.sock show config
>>> read only got 0 bytes of 4 expected for response length; invalid
>>> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0)
>>> AdminSocket:
>>> request 'show config' not defined
>>
>>
>> Oh, it's 'config show'. Also, 'help' will list the supported commands.
>>
>>> Also perfcounters does not show anything:
>>> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
>>> {}
>>
>>
>> There may be another daemon that tried to attach to the same socket file.
>> You might want to set 'admin socket = /var/run/ceph/$name.sock' or
>> something similar, or whatever else is necessary to make it a unique file.
>>
>>> ~]# ceph -v
>>> ceph version 0.48argonaut-2-gb576faa
>>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
>>
>>
>> Out of curiousity, what patches are you applying on top of the release?
>>
>> sage
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-05 21:33 ` Gregory Farnum
@ 2012-07-06 3:50 ` Alexandre DERUMIER
2012-07-06 8:54 ` Stefan Priebe
2012-07-06 17:11 ` Gregory Farnum
0 siblings, 2 replies; 32+ messages in thread
From: Alexandre DERUMIER @ 2012-07-06 3:50 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel, Sage Weil, Stefan Priebe
Hi,
Stefan is on vacation for the moment,I don't know if he can reply you.
But I can reoly for him for the kvm part (as we do same tests together in parallel).
- kvm is 1.1
- rbd 0.48
- drive option rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
-using writeback
writeback tuning in ceph.conf on the kvm host
rbd_cache_size = 33554432
rbd_cache_max_age = 2.0
benchmark use in kvm guest:
fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1
results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
so it doesn't scale
(bench is with directio, so maybe writeback cache don't help)
hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 40000io/s randwrite locally)
- Alexandre
----- Mail original -----
De: "Gregory Farnum" <greg@inktank.com>
À: "Stefan Priebe" <s.priebe@profihost.ag>
Cc: ceph-devel@vger.kernel.org, "Sage Weil" <sage@inktank.com>
Envoyé: Jeudi 5 Juillet 2012 23:33:18
Objet: Re: speedup ceph / scaling / find the bottleneck
Could you send over the ceph.conf on your KVM host, as well as how
you're configuring KVM to use rbd?
On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> I'm sorry but this is the KVM Host Machine there is no ceph running on this
> machine.
>
> If i change the admin socket to:
> admin_socket=/var/run/ceph_$name.sock
>
> i don't have any socket at all ;-(
>
> Am 03.07.2012 17:31, schrieb Sage Weil:
>
>> On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:
>>>
>>> Hello,
>>>
>>> Am 02.07.2012 22:30, schrieb Josh Durgin:
>>>>
>>>> If you add admin_socket=/path/to/admin_socket for your client running
>>>> qemu (in that client's ceph.conf section or manually in the qemu
>>>> command line) you can check that caching is enabled:
>>>>
>>>> ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
>>>>
>>>> And see statistics it generates (look for cache) with:
>>>>
>>>> ceph --admin-daemon /path/to/admin_socket perfcounters_dump
>>>
>>>
>>> This doesn't work for me:
>>> ceph --admin-daemon /var/run/ceph.sock show config
>>> read only got 0 bytes of 4 expected for response length; invalid
>>> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0)
>>> AdminSocket:
>>> request 'show config' not defined
>>
>>
>> Oh, it's 'config show'. Also, 'help' will list the supported commands.
>>
>>> Also perfcounters does not show anything:
>>> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
>>> {}
>>
>>
>> There may be another daemon that tried to attach to the same socket file.
>> You might want to set 'admin socket = /var/run/ceph/$name.sock' or
>> something similar, or whatever else is necessary to make it a unique file.
>>
>>> ~]# ceph -v
>>> ceph version 0.48argonaut-2-gb576faa
>>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
>>
>>
>> Out of curiousity, what patches are you applying on top of the release?
>>
>> sage
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
--
Alexandre D e rumier
Ingénieur Systèmes et Réseaux
Fixe : 03 20 68 88 85
Fax : 03 20 68 90 88
45 Bvd du Général Leclerc 59100 Roubaix
12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-06 3:50 ` Alexandre DERUMIER
@ 2012-07-06 8:54 ` Stefan Priebe
2012-07-06 17:11 ` Gregory Farnum
1 sibling, 0 replies; 32+ messages in thread
From: Stefan Priebe @ 2012-07-06 8:54 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: Gregory Farnum, ceph-devel, Sage Weil
Am 06.07.2012 um 05:50 schrieb Alexandre DERUMIER <aderumier@odiso.com>:
> Hi,
> Stefan is on vacation for the moment,I don't know if he can reply you.
Thanks!
>
> But I can reoly for him for the kvm part (as we do same tests together in parallel).
>
> - kvm is 1.1
> - rbd 0.48
> - drive option rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
> -using writeback
>
> writeback tuning in ceph.conf on the kvm host
>
> rbd_cache_size = 33554432
> rbd_cache_max_age = 2.0
Correct
>
> benchmark use in kvm guest:
> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1
>
> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
> so it doesn't scale
Correct too
>
> (bench is with directio, so maybe writeback cache don't help)
>
> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 40000io/s randwrite locally)
30000 but still enough
Stefan
> - Alexandre
>
> ----- Mail original -----
>
> De: "Gregory Farnum" <greg@inktank.com>
> À: "Stefan Priebe" <s.priebe@profihost.ag>
> Cc: ceph-devel@vger.kernel.org, "Sage Weil" <sage@inktank.com>
> Envoyé: Jeudi 5 Juillet 2012 23:33:18
> Objet: Re: speedup ceph / scaling / find the bottleneck
>
> Could you send over the ceph.conf on your KVM host, as well as how
> you're configuring KVM to use rbd?
>
> On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> I'm sorry but this is the KVM Host Machine there is no ceph running on this
>> machine.
>>
>> If i change the admin socket to:
>> admin_socket=/var/run/ceph_$name.sock
>>
>> i don't have any socket at all ;-(
>>
>> Am 03.07.2012 17:31, schrieb Sage Weil:
>>
>>> On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:
>>>>
>>>> Hello,
>>>>
>>>> Am 02.07.2012 22:30, schrieb Josh Durgin:
>>>>>
>>>>> If you add admin_socket=/path/to/admin_socket for your client running
>>>>> qemu (in that client's ceph.conf section or manually in the qemu
>>>>> command line) you can check that caching is enabled:
>>>>>
>>>>> ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
>>>>>
>>>>> And see statistics it generates (look for cache) with:
>>>>>
>>>>> ceph --admin-daemon /path/to/admin_socket perfcounters_dump
>>>>
>>>>
>>>> This doesn't work for me:
>>>> ceph --admin-daemon /var/run/ceph.sock show config
>>>> read only got 0 bytes of 4 expected for response length; invalid
>>>> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0)
>>>> AdminSocket:
>>>> request 'show config' not defined
>>>
>>>
>>> Oh, it's 'config show'. Also, 'help' will list the supported commands.
>>>
>>>> Also perfcounters does not show anything:
>>>> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
>>>> {}
>>>
>>>
>>> There may be another daemon that tried to attach to the same socket file.
>>> You might want to set 'admin socket = /var/run/ceph/$name.sock' or
>>> something similar, or whatever else is necessary to make it a unique file.
>>>
>>>> ~]# ceph -v
>>>> ceph version 0.48argonaut-2-gb576faa
>>>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
>>>
>>>
>>> Out of curiousity, what patches are you applying on top of the release?
>>>
>>> sage
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
>
> --
>
>
>
>
>
> Alexandre D e rumier
>
> Ingénieur Systèmes et Réseaux
>
>
> Fixe : 03 20 68 88 85
>
> Fax : 03 20 68 90 88
>
>
> 45 Bvd du Général Leclerc 59100 Roubaix
> 12 rue Marivaux 75002 Paris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-06 3:50 ` Alexandre DERUMIER
2012-07-06 8:54 ` Stefan Priebe
@ 2012-07-06 17:11 ` Gregory Farnum
2012-07-06 18:09 ` Stefan Priebe - Profihost AG
1 sibling, 1 reply; 32+ messages in thread
From: Gregory Farnum @ 2012-07-06 17:11 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: ceph-devel, Sage Weil, Stefan Priebe
On Thu, Jul 5, 2012 at 8:50 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
> Hi,
> Stefan is on vacation for the moment,I don't know if he can reply you.
>
> But I can reoly for him for the kvm part (as we do same tests together in parallel).
>
> - kvm is 1.1
> - rbd 0.48
> - drive option rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
> -using writeback
>
> writeback tuning in ceph.conf on the kvm host
>
> rbd_cache_size = 33554432
> rbd_cache_max_age = 2.0
>
> benchmark use in kvm guest:
> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1
>
> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
> so it doesn't scale
>
> (bench is with directio, so maybe writeback cache don't help)
>
> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 40000io/s randwrite locally)
I'm interested in figuring out why we aren't getting useful data out
of the admin socket, and for that I need the actual configuration
files. It wouldn't surprise me if there are several layers to this
issue but I'd like to start at the client's endpoint. :)
Regarding the random IO, you shouldn't overestimate your storage.
Under plenty of scenarios your drives are lucky to do more than 2k
IO/s, which is about what you're seeing....
http://techreport.com/articles.x/22415/9
-Greg
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-06 17:11 ` Gregory Farnum
@ 2012-07-06 18:09 ` Stefan Priebe - Profihost AG
2012-07-06 18:17 ` Gregory Farnum
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-07-06 18:09 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Alexandre DERUMIER, ceph-devel, Sage Weil
Am 06.07.2012 um 19:11 schrieb Gregory Farnum <greg@inktank.com>:
> On Thu, Jul 5, 2012 at 8:50 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>> Hi,
>> Stefan is on vacation for the moment,I don't know if he can reply you.
>>
>> But I can reoly for him for the kvm part (as we do same tests together in parallel).
>>
>> - kvm is 1.1
>> - rbd 0.48
>> - drive option rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
>> -using writeback
>>
>> writeback tuning in ceph.conf on the kvm host
>>
>> rbd_cache_size = 33554432
>> rbd_cache_max_age = 2.0
>>
>> benchmark use in kvm guest:
>> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1
>>
>> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
>> so it doesn't scale
>>
>> (bench is with directio, so maybe writeback cache don't help)
>>
>> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 40000io/s randwrite locally)
>
> I'm interested in figuring out why we aren't getting useful data out
> of the admin socket, and for that I need the actual configuration
> files. It wouldn't surprise me if there are several layers to this
> issue but I'd like to start at the client's endpoint. :)
While I'm on holiday I can't send you my ceph.conf but it doesn't contain anything else than the locations and journal dio false for tmpfs and /var/run/ceph_$name.sock
>
> Regarding the random IO, you shouldn't overestimate your storage.
> Under plenty of scenarios your drives are lucky to do more than 2k
> IO/s, which is about what you're seeing....
> http://techreport.com/articles.x/22415/9
You're fine if the ceph workload is the same as the iometer file server workload. I don't know. I've measured the raw random 4k workload. Also I've tested adding another osd and speed still doesn't change but with a size of 200gb I should hit several osd servers.
Stefan
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-06 18:09 ` Stefan Priebe - Profihost AG
@ 2012-07-06 18:17 ` Gregory Farnum
2012-07-09 18:21 ` Stefan Priebe
0 siblings, 1 reply; 32+ messages in thread
From: Gregory Farnum @ 2012-07-06 18:17 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG; +Cc: Alexandre DERUMIER, ceph-devel, Sage Weil
On Fri, Jul 6, 2012 at 11:09 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Am 06.07.2012 um 19:11 schrieb Gregory Farnum <greg@inktank.com>:
>
>> On Thu, Jul 5, 2012 at 8:50 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>> Hi,
>>> Stefan is on vacation for the moment,I don't know if he can reply you.
>>>
>>> But I can reoly for him for the kvm part (as we do same tests together in parallel).
>>>
>>> - kvm is 1.1
>>> - rbd 0.48
>>> - drive option rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
>>> -using writeback
>>>
>>> writeback tuning in ceph.conf on the kvm host
>>>
>>> rbd_cache_size = 33554432
>>> rbd_cache_max_age = 2.0
>>>
>>> benchmark use in kvm guest:
>>> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1
>>>
>>> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
>>> so it doesn't scale
>>>
>>> (bench is with directio, so maybe writeback cache don't help)
>>>
>>> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 40000io/s randwrite locally)
>>
>> I'm interested in figuring out why we aren't getting useful data out
>> of the admin socket, and for that I need the actual configuration
>> files. It wouldn't surprise me if there are several layers to this
>> issue but I'd like to start at the client's endpoint. :)
>
> While I'm on holiday I can't send you my ceph.conf but it doesn't contain anything else than the locations and journal dio false for tmpfs and /var/run/ceph_$name.sock
Is that socket in the global area? Does the KVM process have
permission to access that directory? If you enable logging can you get
any outputs that reference errors opening that file? (I realize you're
on holiday; these are just the questions we'll need answered to get it
working.)
>
>>
>> Regarding the random IO, you shouldn't overestimate your storage.
>> Under plenty of scenarios your drives are lucky to do more than 2k
>> IO/s, which is about what you're seeing....
>> http://techreport.com/articles.x/22415/9
> You're fine if the ceph workload is the same as the iometer file server workload. I don't know. I've measured the raw random 4k workload. Also I've tested adding another osd and speed still doesn't change but with a size of 200gb I should hit several osd servers.
Okay — just wanted to point it out.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: speedup ceph / scaling / find the bottleneck
2012-07-06 18:17 ` Gregory Farnum
@ 2012-07-09 18:21 ` Stefan Priebe
0 siblings, 0 replies; 32+ messages in thread
From: Stefan Priebe @ 2012-07-09 18:21 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Alexandre DERUMIER, ceph-devel, Sage Weil
Am 06.07.2012 20:17, schrieb Gregory Farnum:
>> Am 06.07.2012 um 19:11 schrieb Gregory Farnum <greg@inktank.com>:
>>> I'm interested in figuring out why we aren't getting useful data out
>>> of the admin socket, and for that I need the actual configuration
>>> files. It wouldn't surprise me if there are several layers to this
>>> issue but I'd like to start at the client's endpoint. :)
>>
>> While I'm on holiday I can't send you my ceph.conf but it doesn't contain anything else than the locations and journal dio false for tmpfs and /var/run/ceph_$name.sock
>
> Is that socket in the global area?
Yes
> Does the KVM process have
> permission to access that directory?
Yes it is also created if i skip $name and set it to /var/run/ceph.sock
>>> Regarding the random IO, you shouldn't overestimate your storage.
>>> Under plenty of scenarios your drives are lucky to do more than 2k
>>> IO/s, which is about what you're seeing....
>>> http://techreport.com/articles.x/22415/9
>> You're fine if the ceph workload is the same as the iometer file server workload. I don't know. I've measured the raw random 4k workload. Also I've tested adding another osd and speed still doesn't change but with a size of 200gb I should hit several osd servers.
> Okay — just wanted to point it out.
Thanks also with sheepdog i can get 40 000 IOp/s.
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2012-07-09 18:21 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-29 10:46 speedup ceph / scaling / find the bottleneck Stefan Priebe - Profihost AG
2012-06-29 11:32 ` Alexandre DERUMIER
2012-06-29 11:49 ` Mark Nelson
2012-06-29 13:02 ` Stefan Priebe - Profihost AG
2012-06-29 13:11 ` Stefan Priebe - Profihost AG
2012-06-29 13:16 ` Stefan Priebe - Profihost AG
2012-06-29 13:22 ` Stefan Priebe - Profihost AG
2012-06-29 15:28 ` Sage Weil
2012-06-29 21:18 ` Stefan Priebe
2012-07-01 21:01 ` Stefan Priebe
2012-07-01 21:13 ` Mark Nelson
2012-07-01 21:27 ` Stefan Priebe
2012-07-02 5:02 ` Alexandre DERUMIER
2012-07-02 6:12 ` Stefan Priebe - Profihost AG
2012-07-02 16:51 ` Gregory Farnum
2012-07-02 19:22 ` Stefan Priebe
2012-07-02 20:30 ` Josh Durgin
2012-07-03 4:42 ` Alexandre DERUMIER
2012-07-03 4:42 ` Alexandre DERUMIER
2012-07-03 7:49 ` Stefan Priebe - Profihost AG
2012-07-03 15:31 ` Sage Weil
2012-07-03 18:20 ` Stefan Priebe
2012-07-05 21:33 ` Gregory Farnum
2012-07-06 3:50 ` Alexandre DERUMIER
2012-07-06 8:54 ` Stefan Priebe
2012-07-06 17:11 ` Gregory Farnum
2012-07-06 18:09 ` Stefan Priebe - Profihost AG
2012-07-06 18:17 ` Gregory Farnum
2012-07-09 18:21 ` Stefan Priebe
2012-07-03 19:16 ` Stefan Priebe
2012-07-02 13:19 ` Stefan Priebe - Profihost AG
2012-06-29 12:33 ` Stefan Priebe - Profihost AG
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.