* Slow ceph fs performance
@ 2012-09-26 14:50 Bryan K. Wright
2012-09-26 15:26 ` Mark Nelson
0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-26 14:50 UTC (permalink / raw)
To: ceph-devel
Hi folks,
I'm seeing reasonable performance when I run rados
benchmarks, but really slow I/O when reading or writing
from a mounted ceph filesystem. The rados benchmarks
show about 150 MB/s for both read and write, but when I
go to a client machine with a mounted ceph filesystem
and try to rsync a large (60 GB) directory tree onto
the ceph fs, I'm getting rates of only 2-5 MB/s.
The OSDs and MDSs are all running 64-bit CentOS 6.3
with the stock CentOS 2.6.32 kernel. The client is also
64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
There are four OSDs, each with a hardware RAID 5 array
and an SSD for the OSD journal. The primary network
is a gigabit network, and the OSD, MDS and MON
machines have a dedicated backend gigabit network on a
second network interface.
Locally on the OSD, "hdparm -t -T" reports read rates
of ~350 MB/s, and bonnie++ shows:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
osd-local 23800M 1037 99 316048 92 131023 19 2272 98 312781 21 521.0 24
Latency 13103us 183ms 123ms 15316us 100ms 75899us
Version 1.96 ------Sequential Create------ --------Random Create--------
osd-local -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128 75
Latency 21549us 105us 134us 902us 12us 104us
While rsyncing the files, the ceph logs show lots
of warnings of the form:
[WRN] : slow request 91.848407 seconds old, received at 2012-09-26 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops
Snooping on traffic with wireshark shows bursts of
activity separated by long periods (30-60 sec) of idle time.
My first thought was that I was seeing a kind of
"bufferbloat". The SSDs are 120 GB, so they could easily contain
enough data to take a long time to dump. I changed to using a
journal file, limited to 1 GB, but I still see the same slow
behavior.
Any advice about how to go about debugging this would
be appreciated.
Thanks,
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-26 14:50 Slow ceph fs performance Bryan K. Wright
@ 2012-09-26 15:26 ` Mark Nelson
2012-09-26 20:54 ` Bryan K. Wright
0 siblings, 1 reply; 23+ messages in thread
From: Mark Nelson @ 2012-09-26 15:26 UTC (permalink / raw)
To: bryan; +Cc: Bryan K. Wright, ceph-devel
On 09/26/2012 09:50 AM, Bryan K. Wright wrote:
> Hi folks,
Hi Bryan!
>
> I'm seeing reasonable performance when I run rados
> benchmarks, but really slow I/O when reading or writing
> from a mounted ceph filesystem. The rados benchmarks
> show about 150 MB/s for both read and write, but when I
> go to a client machine with a mounted ceph filesystem
> and try to rsync a large (60 GB) directory tree onto
> the ceph fs, I'm getting rates of only 2-5 MB/s.
Was the rados benchmark run from the same client machine that the
filesystem is being mounted on? Also, what object size did you use for
rados bench? Does the directory tree have a lot of small files or a few
very large ones?
>
> The OSDs and MDSs are all running 64-bit CentOS 6.3
> with the stock CentOS 2.6.32 kernel. The client is also
> 64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
> There are four OSDs, each with a hardware RAID 5 array
> and an SSD for the OSD journal. The primary network
> is a gigabit network, and the OSD, MDS and MON
> machines have a dedicated backend gigabit network on a
> second network interface.
>
> Locally on the OSD, "hdparm -t -T" reports read rates
> of ~350 MB/s, and bonnie++ shows:
>
> Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
> osd-local 23800M 1037 99 316048 92 131023 19 2272 98 312781 21 521.0 24
> Latency 13103us 183ms 123ms 15316us 100ms 75899us
> Version 1.96 ------Sequential Create------ --------Random Create--------
> osd-local -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> 16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128 75
> Latency 21549us 105us 134us 902us 12us 104us
>
>
> While rsyncing the files, the ceph logs show lots
> of warnings of the form:
>
> [WRN] : slow request 91.848407 seconds old, received at 2012-09-26 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops
>
> Snooping on traffic with wireshark shows bursts of
> activity separated by long periods (30-60 sec) of idle time.
>
My guess here is that if there is a lot of small IO happening, your SSD
journal is handling it well and probably writing data really quickly,
while your spinning disk raid5 probably can't sustain anywhere near the
required IOPs to keep up. So you get a burst of network traffic and the
journal writes it to the SSD quickly until it is filled up, then the OSD
stalls while it waits for the raid5 to write data out. Whenever the
journal flushes, a new burst of traffic comes in and the process repeats.
> My first thought was that I was seeing a kind of
> "bufferbloat". The SSDs are 120 GB, so they could easily contain
> enough data to take a long time to dump. I changed to using a
> journal file, limited to 1 GB, but I still see the same slow
> behavior.
>
> Any advice about how to go about debugging this would
> be appreciated.
It'd probably be useful to look at the write sizes going to disk.
Increasing debugging levels in the Ceph logs will give you that, but it
can be a lot to parse. You can also use something like iostat or
collectl to see what the per-second average write sizes are.
>
> Thanks,
> Bryan
>
Mark
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-26 15:26 ` Mark Nelson
@ 2012-09-26 20:54 ` Bryan K. Wright
2012-09-27 15:16 ` Bryan K. Wright
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-26 20:54 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel
Hi Mark,
Thanks for your help. Some answers to your questions
are below.
mark.nelson@inktank.com said:
> On 09/26/2012 09:50 AM, Bryan K. Wright wrote:
> Hi folks,
> Hi Bryan!
> >
> I'm seeing reasonable performance when I run rados
> benchmarks, but really slow I/O when reading or writing
> from a mounted ceph filesystem. The rados benchmarks
> show about 150 MB/s for both read and write, but when I
> go to a client machine with a mounted ceph filesystem
> and try to rsync a large (60 GB) directory tree onto
> the ceph fs, I'm getting rates of only 2-5 MB/s.
> Was the rados benchmark run from the same client machine that the filesystem
> is being mounted on? Also, what object size did you use for rados bench?
> Does the directory tree have a lot of small files or a few very large ones?
The rados benchmark was run on one of the OSD
machines. Read and write results looked like this (the
objects size was just the default, which seems to be 4kB):
# rados bench -p pbench 900 write
Total time run: 900.549729
Total writes made: 33819
Write size: 4194304
Bandwidth (MB/sec): 150.215
Stddev Bandwidth: 16.2592
Max bandwidth (MB/sec): 212
Min bandwidth (MB/sec): 84
Average Latency: 0.426028
Stddev Latency: 0.24688
Max latency: 1.59936
Min latency: 0.06794
# rados bench -p pbench 900 seq
Total time run: 900.572788
Total reads made: 33676
Read size: 4194304
Bandwidth (MB/sec): 149.576
Average Latency: 0.427844
Max latency: 1.48576
Min latency: 0.015371
Regarding the rsync test, yes, the directory tree
was mostly small files.
> >
> The OSDs and MDSs are all running 64-bit CentOS 6.3
> with the stock CentOS 2.6.32 kernel. The client is also
> 64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
> There are four OSDs, each with a hardware RAID 5 array
> and an SSD for the OSD journal. The primary network
> is a gigabit network, and the OSD, MDS and MON
> machines have a dedicated backend gigabit network on a
> second network interface. >
> Locally on the OSD, "hdparm -t -T" reports read rates
> of ~350 MB/s, and bonnie++ shows: >
> Version 1.96 ------Sequential Output------ --Sequential Input-
> --Random-
> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec
> %CP
> osd-local 23800M 1037 99 316048 92 131023 19 2272 98 312781 21 521.0
> 24
> Latency 13103us 183ms 123ms 15316us 100ms 75899us
> Version 1.96 ------Sequential Create------ --------Random
> Create--------
> osd-local -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128
> 75
> Latency 21549us 105us 134us 902us 12us 104us >
> >
> While rsyncing the files, the ceph logs show lots
> of warnings of the form: >
> [WRN] : slow request 91.848407 seconds old, received at 2012-09-26
> 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write
> 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops >
> Snooping on traffic with wireshark shows bursts of
> activity separated by long periods (30-60 sec) of idle time. >
> My guess here is that if there is a lot of small IO happening, your SSD
> journal is handling it well and probably writing data really quickly, while
> your spinning disk raid5 probably can't sustain anywhere near the required
> IOPs to keep up. So you get a burst of network traffic and the journal
> writes it to the SSD quickly until it is filled up, then the OSD stalls while
> it waits for the raid5 to write data out. Whenever the journal flushes, a
> new burst of traffic comes in and the process repeats.
That sure sounds reasonable. Maybe I can play some more
with the journal size and location to see how it affects the
speed and burstyness.
> My first thought was that I was seeing a kind of
> "bufferbloat". The SSDs are 120 GB, so they could easily contain
> enough data to take a long time to dump. I changed to using a
> journal file, limited to 1 GB, but I still see the same slow
> behavior. >
> Any advice about how to go about debugging this would
> be appreciated.
> It'd probably be useful to look at the write sizes going to disk. Increasing
> debugging levels in the Ceph logs will give you that, but it can be a lot to
> parse. You can also use something like iostat or collectl to see what the
> per-second average write sizes are.
I'll see what I can find out. Here's a quick output
from iostat (on one of the OSD hosts) while an rsync was running:
avg-cpu: %user %nice %system %iowait %steal %idle
0.23 0.00 0.20 0.21 0.00 99.36
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdm 0.96 5.82 19.94 4523588 15495690
sdn 9.96 1.51 1080.91 1174143 839900311
sdb 0.00 0.00 0.00 2248 0
sdc 0.00 0.00 0.00 2248 0
sde 0.00 0.00 0.00 2248 0
sda 0.00 0.00 0.00 2248 0
sdf 0.00 0.00 0.00 2248 0
sdi 0.00 0.00 0.00 2248 0
sdl 0.00 0.00 0.00 2248 0
sdg 0.00 0.00 0.00 2248 0
sdj 0.00 0.00 0.00 2248 0
sdh 0.00 0.00 0.00 2248 0
sdd 0.00 0.00 0.00 2248 0
sdk 0.00 0.00 0.00 2248 0
dm-0 0.00 0.00 0.00 2616 0
dm-1 2.14 5.81 19.80 4512994 15387832
sdo 96.83 305.85 3156.74 237658672 2452896474
dm-2 0.00 0.00 0.00 800 48
The relevant lines are "sdo", which is the RAID array where
the object store lives, and "sdn", which is the journal SSD.
> >
> Thanks,
> Bryan >
> Mark
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-26 20:54 ` Bryan K. Wright
@ 2012-09-27 15:16 ` Bryan K. Wright
2012-09-27 18:04 ` Gregory Farnum
2012-09-27 23:40 ` Mark Kirkwood
2 siblings, 0 replies; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-27 15:16 UTC (permalink / raw)
To: ceph-devel
Hi folks,
I'm still struggling to get decent performance out of
cephfs. I've played around with journal size and location,
but I/O rates to the mounted ceph filesystem always hover in
the range of 2-6 MB/sec while rsyncing a large directory tree
onto the ceph fs. In contrast, using rsync over ssh to copy
the same tree on to the same RAID array on one of the OSDs gives
a rate of about 34 MB/sec.
Here's a time/sequence plot from wireshark showing
what the traffic looks like from the client's perspective
while rsyncing onto the ceph fs:
http://ayesha.phys.virginia.edu/~bryan/time-sequence-ceph-2.png
As you can see, most of the time is spent in long
waits between bursts of packets. Using a small journal file
instead of a whole SSD seems to slightly reduce the delays,
but not by much. What other tunable parameters should I be
trying?
Looking at outgoing network rates on the client
with iptraf, I see the following while rsyncing over ssh:
Rate: ~300Mb/s, ~8k packets/s --> ~40kb/packet
While rsyncing to the ceph fs, I see:
Rate: ~50Mb/s, ~1k packets/s --> ~50kb/packet
(i.e., the average packet size is about the same, but
about eight times fewer packets are being sent per unit
time.)
Looking at ops in flight on one of the OSDs,
using "ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok
dump_ops_in_flight", I see:
{ "num_ops": 3,
"ops": [
{ "description": "pg_log(0.8 epoch 12 query_epoch 12)",
"received_at": "2012-09-27 10:54:08.070493",
"age": "66.673834",
"flag_point": "delayed"},
{ "description": "pg_log(1.7 epoch 12 query_epoch 12)",
"received_at": "2012-09-27 10:54:08.070715",
"age": "66.673612",
"flag_point": "delayed"},
{ "description": "pg_log(2.6 epoch 12 query_epoch 12)",
"received_at": "2012-09-27 10:54:08.070750",
"age": "66.673577",
"flag_point": "delayed"}]}
Thanks for any advice.
Bryan
bkw1a@ayesha.phys.virginia.edu said:
> Hi folks,
> I'm seeing reasonable performance when I run rados benchmarks, but really
> slow I/O when reading or writing from a mounted ceph filesystem. The rados
> benchmarks show about 150 MB/s for both read and write, but when I go to a
> client machine with a mounted ceph filesystem and try to rsync a large (60 GB)
> directory tree onto the ceph fs, I'm getting rates of only 2-5 MB/s.
> The OSDs and MDSs are all running 64-bit CentOS 6.3 with the stock CentOS
> 2.6.32 kernel. The client is also 64-bit CentOS 6.3, but it's running the
> "elrepo" 3.5.4 kernel. There are four OSDs, each with a hardware RAID 5 array
> and an SSD for the OSD journal. The primary network is a gigabit network, and
> the OSD, MDS and MON machines have a dedicated backend gigabit network on a
> second network interface.
> Locally on the OSD, "hdparm -t -T" reports read rates of ~350 MB/s, and
> bonnie++ shows:
> Version 1.96 ------Sequential Output------ --Sequential Input-
> --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr-
> --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec
> %CP K/sec %CP /sec %CP osd-local 23800M 1037 99 316048 92 131023 19
> 2272 98 312781 21 521.0 24 Latency 13103us 183ms 123ms
> 15316us 100ms 75899us Version 1.96 ------Sequential Create------
> --------Random Create-------- osd-local -Create-- --Read---
> -Delete-- -Create-- --Read--- -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128
> 75 Latency 21549us 105us 134us 902us 12us
> 104us
> While rsyncing the files, the ceph logs show lots of warnings of the form:
> [WRN] : slow request 91.848407 seconds old, received at 2012-09-26
> 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write
> 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops
> Snooping on traffic with wireshark shows bursts of activity separated by
> long periods (30-60 sec) of idle time.
> My first thought was that I was seeing a kind of "bufferbloat". The SSDs are
> 120 GB, so they could easily contain enough data to take a long time to dump.
> I changed to using a journal file, limited to 1 GB, but I still see the same
> slow behavior.
> Any advice about how to go about debugging this would be appreciated.
> Thanks,
> Bryan
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-26 20:54 ` Bryan K. Wright
2012-09-27 15:16 ` Bryan K. Wright
@ 2012-09-27 18:04 ` Gregory Farnum
2012-09-27 18:47 ` Bryan K. Wright
2012-10-01 16:47 ` Tommi Virtanen
2012-09-27 23:40 ` Mark Kirkwood
2 siblings, 2 replies; 23+ messages in thread
From: Gregory Farnum @ 2012-09-27 18:04 UTC (permalink / raw)
To: bryan; +Cc: Mark Nelson, ceph-devel
On Wed, Sep 26, 2012 at 1:54 PM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
> Hi Mark,
>
> Thanks for your help. Some answers to your questions
> are below.
>
> mark.nelson@inktank.com said:
>> On 09/26/2012 09:50 AM, Bryan K. Wright wrote:
>> Hi folks,
>> Hi Bryan!
>> >
>> I'm seeing reasonable performance when I run rados
>> benchmarks, but really slow I/O when reading or writing
>> from a mounted ceph filesystem. The rados benchmarks
>> show about 150 MB/s for both read and write, but when I
>> go to a client machine with a mounted ceph filesystem
>> and try to rsync a large (60 GB) directory tree onto
>> the ceph fs, I'm getting rates of only 2-5 MB/s.
>> Was the rados benchmark run from the same client machine that the filesystem
>> is being mounted on? Also, what object size did you use for rados bench?
>> Does the directory tree have a lot of small files or a few very large ones?
>
> The rados benchmark was run on one of the OSD
> machines. Read and write results looked like this (the
> objects size was just the default, which seems to be 4kB):
Actually, that's 4MB. ;) Can you run
# rados bench -p pbench 900 write -t 256 -b 4096
and see what that gets? It'll run 256 simultaneous 4KB writes. (You
can also vary the number of simultaneous writes and see if that
impacts it.)
However, my suspicion is that you're limited by metadata throughput
here. How large are your files? There might be some MDS or client
tunables we can adjust, but rsync's workload is a known weak spot for
CephFS.
-Greg
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-27 18:04 ` Gregory Farnum
@ 2012-09-27 18:47 ` Bryan K. Wright
2012-09-27 19:47 ` Gregory Farnum
2012-10-01 16:47 ` Tommi Virtanen
1 sibling, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-27 18:47 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel
greg@inktank.com said:
> >
> The rados benchmark was run on one of the OSD
> machines. Read and write results looked like this (the
> objects size was just the default, which seems to be 4kB):
> Actually, that's 4MB. ;)
Oops! My plea is that I was the victim of a
man page bug:
bench seconds mode [ -b objsize ] [ -t threads ]
Benchmark for seconds. The mode can be write or read. The
default object size is 4 KB, and the default number of simulated
threads (parallel writes) is 16.
> Can you run # rados bench -p pbench 900 write -t 256
> -b 4096 and see what that gets? It'll run 256 simultaneous 4KB writes. (You
> can also vary the number of simultaneous writes and see if that impacts it.)
Here's the new benchmark output:
Total time run: 900.880070
Total writes made: 537187
Write size: 4096
Bandwidth (MB/sec): 2.329
Stddev Bandwidth: 2.57691
Max bandwidth (MB/sec): 12.6055
Min bandwidth (MB/sec): 0
Average Latency: 0.429315
Stddev Latency: 0.891734
Max latency: 19.7647
Min latency: 0.016743
> However, my suspicion is that you're limited by metadata throughput here. How
> large are your files? There might be some MDS or client tunables we can
> adjust, but rsync's workload is a known weak spot for CephFS. -Greg
The file size is generally small. Here's the distribution:
http://ayesha.phys.virginia.edu/~bryan/filesize.png
The mean is about 2.5 MB.
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-27 18:47 ` Bryan K. Wright
@ 2012-09-27 19:47 ` Gregory Farnum
0 siblings, 0 replies; 23+ messages in thread
From: Gregory Farnum @ 2012-09-27 19:47 UTC (permalink / raw)
To: Bryan K. Wright; +Cc: ceph-devel
On Thu, Sep 27, 2012 at 11:47 AM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
>
> greg@inktank.com said:
>> >
>> The rados benchmark was run on one of the OSD
>> machines. Read and write results looked like this (the
>> objects size was just the default, which seems to be 4kB):
>> Actually, that's 4MB. ;)
>
> Oops! My plea is that I was the victim of a
> man page bug:
>
> bench seconds mode [ -b objsize ] [ -t threads ]
> Benchmark for seconds. The mode can be write or read. The
> default object size is 4 KB, and the default number of simulated
> threads (parallel writes) is 16.
Whoops! I'd fix it but it's obfuscated somewhat now, so:
http://tracker.newdream.net/issues/3230
>
>
>> Can you run # rados bench -p pbench 900 write -t 256
>> -b 4096 and see what that gets? It'll run 256 simultaneous 4KB writes. (You
>> can also vary the number of simultaneous writes and see if that impacts it.)
>
> Here's the new benchmark output:
>
> Total time run: 900.880070
> Total writes made: 537187
> Write size: 4096
> Bandwidth (MB/sec): 2.329
>
> Stddev Bandwidth: 2.57691
> Max bandwidth (MB/sec): 12.6055
> Min bandwidth (MB/sec): 0
> Average Latency: 0.429315
> Stddev Latency: 0.891734
> Max latency: 19.7647
> Min latency: 0.016743
Hmm, that is significantly lower than I would have expected. Can you
check and see if you can get that number higher by increasing (or
decreasing) the number of in-flight ops? (-t param)
Given your size distribution, it could just be that your RAID arrays
aren't giving you the small random write throughput you expect.
>> However, my suspicion is that you're limited by metadata throughput here. How
>> large are your files? There might be some MDS or client tunables we can
>> adjust, but rsync's workload is a known weak spot for CephFS. -Greg
>
> The file size is generally small. Here's the distribution:
>
> http://ayesha.phys.virginia.edu/~bryan/filesize.png
>
> The mean is about 2.5 MB.
So that chart is measuring in KB? Anyway, it might be metadata — you
could see what the CPU usage on the MDS server looks like while
running the rsync.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-26 20:54 ` Bryan K. Wright
2012-09-27 15:16 ` Bryan K. Wright
2012-09-27 18:04 ` Gregory Farnum
@ 2012-09-27 23:40 ` Mark Kirkwood
2012-09-27 23:49 ` Mark Kirkwood
2 siblings, 1 reply; 23+ messages in thread
From: Mark Kirkwood @ 2012-09-27 23:40 UTC (permalink / raw)
To: bryan; +Cc: Bryan K. Wright, Mark Nelson, ceph-devel
Bryan -
Note that the default block size for the rados bench is 4MB...and
performance decreases quite dramatically with smaller block sizes (-b
option to rados bench).
On 27/09/12 08:54, Bryan K. Wright wrote:
>
> The rados benchmark was run on one of the OSD
> machines. Read and write results looked like this (the
> objects size was just the default, which seems to be 4kB):
>
> # rados bench -p pbench 900 write
> Total time run: 900.549729
> Total writes made: 33819
> Write size: 4194304
> Bandwidth (MB/sec): 150.215
>
> Stddev Bandwidth: 16.2592
> Max bandwidth (MB/sec): 212
> Min bandwidth (MB/sec): 84
> Average Latency: 0.426028
> Stddev Latency: 0.24688
> Max latency: 1.59936
> Min latency: 0.06794
>
> # rados bench -p pbench 900 seq
> Total time run: 900.572788
> Total reads made: 33676
> Read size: 4194304
> Bandwidth (MB/sec): 149.576
>
> Average Latency: 0.427844
> Max latency: 1.48576
> Min latency: 0.015371
>
>
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-27 23:40 ` Mark Kirkwood
@ 2012-09-27 23:49 ` Mark Kirkwood
2012-09-28 12:22 ` mark seger
0 siblings, 1 reply; 23+ messages in thread
From: Mark Kirkwood @ 2012-09-27 23:49 UTC (permalink / raw)
To: bryan; +Cc: Bryan K. Wright, Mark Nelson, ceph-devel
Sorry Bryan - I should have read further down the thread and noted that
you have this figured out... nothing to see here!
On 28/09/12 11:40, Mark Kirkwood wrote:
> Bryan -
>
> Note that the default block size for the rados bench is 4MB...and
> performance decreases quite dramatically with smaller block sizes (-b
> option to rados bench).
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-27 23:49 ` Mark Kirkwood
@ 2012-09-28 12:22 ` mark seger
2012-10-01 15:41 ` Bryan K. Wright
0 siblings, 1 reply; 23+ messages in thread
From: mark seger @ 2012-09-28 12:22 UTC (permalink / raw)
To: ceph-devel
I realize I'm a little late to this party but since collectl was mentioned
thought I'd jump in. ;)
Whenever I do any file system testing I also have a copy of collectl running in
another window.
Just looking at total transfer times can end up taking you down
the wrong path.
What is there are long stalls and very burst I/O? could be a
starved resource or network issue that has nothing to do with he disks at all.
As for iostat, while you're certainly welcome to use it and I based the collectl
output display format on it, I'd highly recommend using iostat -x to see
wait/service times as those can be key to seeing what's happening.
Also, if you use collectl in stead with "-sD --home" you'll basically see the
output in a top-like format, making it real easy to see what's happening.
Further if you apply the right filter you can simply watch a single disk, line by
line w/o any pesky headers in your way.
-mark
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-28 12:22 ` mark seger
@ 2012-10-01 15:41 ` Bryan K. Wright
2012-10-01 16:43 ` Mark Nelson
0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-01 15:41 UTC (permalink / raw)
To: ceph-devel
Hi again,
I've fiddled around a lot with journal settings, so
to make sure I'm comparing apples to apples, I went back and
systematically re-ran the benchmark tests I've been running
(and some more). A long data dump follows, but the end result
is that it does look like something fishy is going on for small
file sizes. For example, performance difference between 4MB
and 4KB files in the rados write benchmark is a factor of 25 or
more. Here are the details, with a recap of the configuration
at the end.
I started out by remaking the underlying xfs filesystems
on the OSD hosts, and then rerunning mkcephfs. The journals
are 120 GB SSDs.
First, the rsync tests again:
* Rsync of ~60 GB directory tree (mostly small files) from ceph client
to mounted cephfs goes at about 5.2 MB/s.
* I then turned off ceph (service ceph -a stop) and did the same
rsync between the same two hosts, onto the same RAID array on
one of the OSD hosts, but using ssh this time. This time it
goes at about 37 MB/s.
This implies to me that the slowdown is somewhere in ceph, not in
the RAID array or the network connectivity.
I then remade the xfs filessytems again, re-ran mkcephfs,
restarted ceph and did some rados benchmarks.
* rados bench -p pbench 900 write -t 256 -b 4096
Total time run: 900.184096
Total writes made: 1052511
Write size: 4096
Bandwidth (MB/sec): 4.567
Stddev Bandwidth: 4.34241
Max bandwidth (MB/sec): 23.1719
Min bandwidth (MB/sec): 0
Average Latency: 0.218949
Stddev Latency: 0.566181
Max latency: 9.92952
Min latency: 0.001449
* rados bench -p pbench 900 write -t 256 (default 4MB size)
Total time run: 900.816140
Total writes made: 25263
Write size: 4194304
Bandwidth (MB/sec): 112.178
Stddev Bandwidth: 27.1239
Max bandwidth (MB/sec): 840
Min bandwidth (MB/sec): 0
Average Latency: 9.08281
Stddev Latency: 0.505372
Max latency: 9.31865
Min latency: 0.818949
I repeated each of these benchmarks three times, but saw
similar results each time (a factor of 25 or more in speed between
small and large object sizes).
Next, I stopped ceph and took a look at local RAID
performance as a function of file size using "iozone":
http://ayesha.phys.virginia.edu/~bryan/iozone-write-local-raid.pdf
Then I re-made the ceph filesystem and restarted ceph, and used
iozone on the ceph client to look at the mounted ceph filesystem:
http://ayesha.phys.virginia.edu/~bryan/iozone-write-cephfs.pdf
I'm not sure how to interpret the iozone performance numbers,
but the distribution certainly looks much less uniform across
different file and chunk sizes for the mounted ceph filesystem.
Finally, I took a look at the results of bonnie++
benchmarks for I/O directly to the RAID array, or to the
mounted ceph filesystem.
* Looking at RAID array from one of the OSD hosts:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
RAID on OSD 23800M 1155 99 318264 26 132959 19 2884 99 293464 20 535.4 23
Latency 7354us 30955us 129ms 8220us 119ms 62188us
Version 1.96 ------Sequential Create------ --------Random Create--------
RAID on OSD -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 17680 58 +++++ +++ 26994 78 24715 81 +++++ +++ 26597 78
Latency 113us 105us 153us 109us 15us 94us
* Looking at the mounted ceph filesystem from the ceph client:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cephfs, client 16G 1101 95 114623 8 45713 2 2665 98 133537 3 882.0 14
Latency 44515us 37018us 6437ms 12747us 469ms 60004us
Version 1.96 ------Sequential Create------ --------Random Create--------
cephfs, client -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 653 3 19886 9 601 3 746 3 +++++ +++ 585 2
Latency 1171ms 7467us 174ms 104ms 19us 228ms
This seems to show about a factor of 3 difference in speed between
writing to the mounted ceph filesystem and writing directly to the RAID
array.
While I was doing these, I kept an eye on the OSDs and MDSs
with collectl and atop, but I didn't see anything that looked
like an obvious problem. The MDSs didn't see very high CPU, I/O
or memory usage, for example.
Finally, to recap the configuration:
3 MDS hosts
4 OSD hosts, each with a RAID array for object storage and an SSD journal
xfs filesystems for the object stores
gigabit network on the front end, and a separate back end gigabit network for the ceph hosts.
64-bit CentOS 6.3 and ceph 0.48.2 everywhere
ceph servers running stock CentOS 2.6.32-279.9.1 kernel.
client running "elrepo" 3.5.4-1 kernel.
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-01 15:41 ` Bryan K. Wright
@ 2012-10-01 16:43 ` Mark Nelson
0 siblings, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-10-01 16:43 UTC (permalink / raw)
To: bryan; +Cc: Bryan K. Wright, ceph-devel
On 10/01/2012 10:41 AM, Bryan K. Wright wrote:
> Hi again,
>
Hello!
> I've fiddled around a lot with journal settings, so
> to make sure I'm comparing apples to apples, I went back and
> systematically re-ran the benchmark tests I've been running
> (and some more). A long data dump follows, but the end result
> is that it does look like something fishy is going on for small
> file sizes. For example, performance difference between 4MB
> and 4KB files in the rados write benchmark is a factor of 25 or
> more. Here are the details, with a recap of the configuration
> at the end.
>
Probably one of the most important things to think about when dealing
with small IOs on spinning disks is how well the operating system / file
system combine small writes into larger ones. With spinning disks you
get so few iops to work with that your throughput is almost entirely
governed by seek behavior. There are many possible reasons for slow
performance, but this should always be something you keep in mind during
your tests.
> I started out by remaking the underlying xfs filesystems
> on the OSD hosts, and then rerunning mkcephfs. The journals
> are 120 GB SSDs.
>
> First, the rsync tests again:
>
> * Rsync of ~60 GB directory tree (mostly small files) from ceph client
> to mounted cephfs goes at about 5.2 MB/s.
>
When you were doing this, what kind of results did collectl give you for
average write sizes to the underlying OSD disks?
> * I then turned off ceph (service ceph -a stop) and did the same
> rsync between the same two hosts, onto the same RAID array on
> one of the OSD hosts, but using ssh this time. This time it
> goes at about 37 MB/s.
>
> This implies to me that the slowdown is somewhere in ceph, not in
> the RAID array or the network connectivity.
>
There's multiple issues potentially here. Part of it might be how
writes are coalesced by XFS in each scenario. Part of it might also be
overhead due to XFS metadata reads/writes. You could probably get a
better idea of both of these by running blktrace during the tests and
making seekwatcher movies of the results. You not only can look at the
numbers of seeks, but also the kind (read/writes) and where on the disk
they are going. That, and some of the raw blktrace data can give you a
lot of information about what is going on and whether or not seeks are
related to metadata.
Beyond that, I do think you are correct in suspecting that there are
some Ceph limitations as well. Some things that may be interesting to try:
- 1 OSD per Disk
- Multiple OSDs on the RAID array.
- Increasing various thread counts
- Increasing various op and byte limits (such as
journal_max_write_entries and journal_max_write_bytes).
- EXT4 or BTRFS under the OSDs.
> I then remade the xfs filessytems again, re-ran mkcephfs,
> restarted ceph and did some rados benchmarks.
>
> * rados bench -p pbench 900 write -t 256 -b 4096
> Total time run: 900.184096
> Total writes made: 1052511
> Write size: 4096
> Bandwidth (MB/sec): 4.567
>
> Stddev Bandwidth: 4.34241
> Max bandwidth (MB/sec): 23.1719
> Min bandwidth (MB/sec): 0
> Average Latency: 0.218949
> Stddev Latency: 0.566181
> Max latency: 9.92952
> Min latency: 0.001449
>
XFS does pretty poorly with RADOS bench at small IO sizes from what I've
seen. EXT4 and BTRFS tend to do better, but probably not more than 2-3
times better.
>
> * rados bench -p pbench 900 write -t 256 (default 4MB size)
> Total time run: 900.816140
> Total writes made: 25263
> Write size: 4194304
> Bandwidth (MB/sec): 112.178
>
> Stddev Bandwidth: 27.1239
> Max bandwidth (MB/sec): 840
> Min bandwidth (MB/sec): 0
> Average Latency: 9.08281
> Stddev Latency: 0.505372
> Max latency: 9.31865
> Min latency: 0.818949
>
I imagine your Max throughput for 4MB IOs is being limited by the
network here. You may be able to get higher aggregate performance by
running rados bench on multiple clients concurrently.
> I repeated each of these benchmarks three times, but saw
> similar results each time (a factor of 25 or more in speed between
> small and large object sizes).
>
> Next, I stopped ceph and took a look at local RAID
> performance as a function of file size using "iozone":
>
> http://ayesha.phys.virginia.edu/~bryan/iozone-write-local-raid.pdf
>
> Then I re-made the ceph filesystem and restarted ceph, and used
> iozone on the ceph client to look at the mounted ceph filesystem:
>
> http://ayesha.phys.virginia.edu/~bryan/iozone-write-cephfs.pdf
>
Do you happen to have the settings you used when you ran these tests? I
probably don't have time to try to repeat them now, but I can at least
take a quick look at them.
> I'm not sure how to interpret the iozone performance numbers,
> but the distribution certainly looks much less uniform across
> different file and chunk sizes for the mounted ceph filesystem.
>
Indeed. Some of that is to be expected just because of the increased
complexity and number of ways that things can get backed up in a
distributed system like Ceph. Having said that, the trench in the
middle of the Ceph distribution is interesting. I wouldn't mind digging
into that more.
I'm slightly confused by the labels on the graph. They can't possibly
mean that 2^16384 KB record sizes were tested. Was that just up to 16MB
records and 16GB files? That would make a lot more sense.
> Finally, I took a look at the results of bonnie++
> benchmarks for I/O directly to the RAID array, or to the
> mounted ceph filesystem.
>
> * Looking at RAID array from one of the OSD hosts:
> Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
> RAID on OSD 23800M 1155 99 318264 26 132959 19 2884 99 293464 20 535.4 23
> Latency 7354us 30955us 129ms 8220us 119ms 62188us
> Version 1.96 ------Sequential Create------ --------Random Create--------
> RAID on OSD -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> 16 17680 58 +++++ +++ 26994 78 24715 81 +++++ +++ 26597 78
> Latency 113us 105us 153us 109us 15us 94us
>
> * Looking at the mounted ceph filesystem from the ceph client:
> Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
> cephfs, client 16G 1101 95 114623 8 45713 2 2665 98 133537 3 882.0 14
> Latency 44515us 37018us 6437ms 12747us 469ms 60004us
> Version 1.96 ------Sequential Create------ --------Random Create--------
> cephfs, client -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> 16 653 3 19886 9 601 3 746 3 +++++ +++ 585 2
> Latency 1171ms 7467us 174ms 104ms 19us 228ms
>
> This seems to show about a factor of 3 difference in speed between
> writing to the mounted ceph filesystem and writing directly to the RAID
> array.
This might be a dumb question, but was the ceph version of this test on
a single client on gigabit Ethernet? If so, wouldn't that be the reason
you are maxing out at like 114MB/s?
>
> While I was doing these, I kept an eye on the OSDs and MDSs
> with collectl and atop, but I didn't see anything that looked
> like an obvious problem. The MDSs didn't see very high CPU, I/O
> or memory usage, for example.
>
> Finally, to recap the configuration:
>
> 3 MDS hosts
> 4 OSD hosts, each with a RAID array for object storage and an SSD journal
> xfs filesystems for the object stores
> gigabit network on the front end, and a separate back end gigabit network for the ceph hosts.
> 64-bit CentOS 6.3 and ceph 0.48.2 everywhere
> ceph servers running stock CentOS 2.6.32-279.9.1 kernel.
> client running "elrepo" 3.5.4-1 kernel.
>
> Bryan
>
Mark
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-09-27 18:04 ` Gregory Farnum
2012-09-27 18:47 ` Bryan K. Wright
@ 2012-10-01 16:47 ` Tommi Virtanen
2012-10-01 17:00 ` Gregory Farnum
2012-10-01 17:03 ` Mark Nelson
1 sibling, 2 replies; 23+ messages in thread
From: Tommi Virtanen @ 2012-10-01 16:47 UTC (permalink / raw)
To: Gregory Farnum; +Cc: bryan, Mark Nelson, ceph-devel
On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum <greg@inktank.com> wrote:
> However, my suspicion is that you're limited by metadata throughput
> here. How large are your files? There might be some MDS or client
> tunables we can adjust, but rsync's workload is a known weak spot for
> CephFS.
I feel like people are missing this part of Greg's message. Everyone
is so busy benchmarking RADOS small I/O, but what if it's currently
bottlenecked by all the file-level access operations that interact
with the MDS? Rsync causes a ton of those.
If you want to benchmark just the small IO, you can't compare rsync to rsync.
If you want to benchmark just the metadata part, rsync with 0-size
files might actually be an interesting workload.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-01 16:47 ` Tommi Virtanen
@ 2012-10-01 17:00 ` Gregory Farnum
2012-10-03 14:55 ` Bryan K. Wright
2012-10-01 17:03 ` Mark Nelson
1 sibling, 1 reply; 23+ messages in thread
From: Gregory Farnum @ 2012-10-01 17:00 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: bryan, Mark Nelson, ceph-devel
On Mon, Oct 1, 2012 at 9:47 AM, Tommi Virtanen <tv@inktank.com> wrote:
> On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum <greg@inktank.com> wrote:
>> However, my suspicion is that you're limited by metadata throughput
>> here. How large are your files? There might be some MDS or client
>> tunables we can adjust, but rsync's workload is a known weak spot for
>> CephFS.
>
> I feel like people are missing this part of Greg's message. Everyone
> is so busy benchmarking RADOS small I/O, but what if it's currently
> bottlenecked by all the file-level access operations that interact
> with the MDS? Rsync causes a ton of those.
Yes. Bryan, you mentioned that you didn't see a lot of resource usage
— was it perhaps flatlined at (100 * 1 / num_cpus)? The MDS is
multi-threaded in theory, but in practice it has the equivalent of a
Big Kernel Lock so it's not going to get much past one cpu core of
time...
The rados bench results do indicate some pretty bad small-file write
performance as well though, so I guess it's possible your testing is
running long enough that the page cache isn't absorbing that hit. Did
performance start out higher or has it been flat?
> If you want to benchmark just the small IO, you can't compare rsync to rsync.
>
> If you want to benchmark just the metadata part, rsync with 0-size
> files might actually be an interesting workload.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-01 16:47 ` Tommi Virtanen
2012-10-01 17:00 ` Gregory Farnum
@ 2012-10-01 17:03 ` Mark Nelson
1 sibling, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-10-01 17:03 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Gregory Farnum, bryan, ceph-devel
On 10/01/2012 11:47 AM, Tommi Virtanen wrote:
> On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum<greg@inktank.com> wrote:
>> However, my suspicion is that you're limited by metadata throughput
>> here. How large are your files? There might be some MDS or client
>> tunables we can adjust, but rsync's workload is a known weak spot for
>> CephFS.
>
> I feel like people are missing this part of Greg's message. Everyone
> is so busy benchmarking RADOS small I/O, but what if it's currently
> bottlenecked by all the file-level access operations that interact
> with the MDS? Rsync causes a ton of those.
>
> If you want to benchmark just the small IO, you can't compare rsync to rsync.
>
> If you want to benchmark just the metadata part, rsync with 0-size
> files might actually be an interesting workload.
I guess most of the small IO testing we've seen/done has been without
CephFS at all. It's entirely possible that the MDS is slowing things
down with an rsync workload like this on a fresh filesystem though.
Having said that, I don't like the way that our small IO performance
behaves (especially over time) when doing something like RADOS Bench.
It definitely seems like there is some pretty nasty underlying
filesystem metadata fragmentation or something going on after a while.
Mark
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-01 17:00 ` Gregory Farnum
@ 2012-10-03 14:55 ` Bryan K. Wright
2012-10-03 18:35 ` Gregory Farnum
0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-03 14:55 UTC (permalink / raw)
To: ceph-devel
Hi again,
A few answers to questions from various people on the list
after my last e-mail:
greg@inktank.com said:
> Yes. Bryan, you mentioned that you didn't see a lot of resource usage was it
> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in
> theory, but in practice it has the equivalent of a Big Kernel Lock so it's not
> going to get much past one cpu core of time...
The CPU usage on the MDSs hovered around a few percent.
They're quad-core machines, and I didn't see it ever get as high
as 25% usage on any of the cores while watching with atop.
greg@inktank.com said:
> The rados bench results do indicate some pretty bad small-file write
> performance as well though, so I guess it's possible your testing is running
> long enough that the page cache isn't absorbing that hit. Did performance
> start out higher or has it been flat?
Looking at the details of the rados benchmark output, it does
look like performance starts out better for the first few iterations,
and then goes bad. Here's the begining of a typical small-file run:
Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 255 3683 3428 13.3894 13.3906 0.002569 0.0696906
2 256 7561 7305 14.2661 15.1445 0.106437 0.0669534
3 256 10408 10152 13.2173 11.1211 0.002176 0.0689543
4 256 11256 11000 10.741 3.3125 0.002097 0.0846414
5 256 11256 11000 8.5928 0 - 0.0846414
6 256 11370 11114 7.23489 0.222656 0.002399 0.0962989
7 255 12480 12225 6.82126 4.33984 0.117658 0.142335
8 256 13289 13033 6.36311 3.15625 0.002574 0.151261
9 256 13737 13481 5.85051 1.75 0.120657 0.158865
10 256 14341 14085 5.50138 2.35938 0.022544 0.178298
I see the same behavior every time I repeat the small-file
rados benchmark. Here's a graph showing the first 100 "cur MB/s" values
for a short-file benchmark:
http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf
On the other hand, with 4MB files, I see results that start out like
this:
Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 49 49 0 0 0 - 0
2 76 76 0 0 0 - 0
3 105 105 0 0 0 - 0
4 133 133 0 0 0 - 0
5 159 159 0 0 0 - 0
6 188 188 0 0 0 - 0
7 218 218 0 0 0 - 0
8 246 246 0 0 0 - 0
9 256 274 18 7.99904 8 8.97759 8.66218
10 255 301 46 18.3978 112 9.1456 8.94095
11 255 330 75 27.2695 116 9.06968 9.013
12 255 358 103 34.3292 112 9.12486 9.04374
Here's a graph showing the first 100 "cur MB/s" values for a typical
4MB file benchmark:
http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf
mark.nelson@inktank.com said:
> When you were doing this, what kind of results did collectl give you for
> average write sizes to the underlying OSD disks?
The average "rwsize" reported by collectl hovered around
6 +/- a few (in whatever units collectl reports) for the RAID
array, and around 15 for the journal SSD, while doing the small-file
rados benchmark. Here's a screenshot showing atop running on
each of the MDS hosts, and collectl running on each of the OSD
hosts, while the benchmark was running:
http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png
Here's the same, but with collectl running on the MDSs instead of atop:
http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png
Looking at the last screenshot again, it does look like the disks on
the MDSs are getting some exercise, with ~40% utilization (if I'm
interpreting the collectl output correctly).
Here's a similar snapshot for the 4MB test:
http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png
It looks like similar "pct util" on the MDS disks, but much higher
average rwsize values on the OSDs.
mark.nelson@inktank.com said:
> There's multiple issues potentially here. Part of it might be how writes are
> coalesced by XFS in each scenario. Part of it might also be overhead due to
> XFS metadata reads/writes. You could probably get a better idea of both of
> these by running blktrace during the tests and making seekwatcher movies of
> the results. You not only can look at the numbers of seeks, but also the
> kind (read/writes) and where on the disk they are going. That, and some of
> the raw blktrace data can give you a lot of information about what is going
> on and whether or not seeks are
I'll take a look at blktrace and see what I can find out.
mark.nelson@inktank.com said:
> Beyond that, I do think you are correct in suspecting that there are some
> Ceph limitations as well. Some things that may be interesting to try:
> - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various
> thread counts - Increasing various op and byte limits (such as
> journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under
> the OSDs.
And I'll give some of these a try.
Regarding the iozone benchmarks:
mark.nelson@inktank.com said:
> Do you happen to have the settings you used when you ran these tests? I
> probably don't have time to try to repeat them now, but I can at least take a
> quick look at them.
> I'm slightly confused by the labels on the graph. They can't possibly mean
> that 2^16384 KB record sizes were tested. Was that just up to 16MB records
> and 16GB files? That would make a lot more sense.
I just did something like:
cd /mnt/tmp (where the cephfs was mounted)
iozone -a > /tmp/iozone.log
By default, iozone does its tests in the current working directory.
The graphs were just produced with the Generate_Graphs script
that comes with iozone. There are certainly some problems with
the axis labeling, but I think your interpretation is correct.
mark.nelson@inktank.com said:
> This might be a dumb question, but was the ceph version of this test on a
> single client on gigabit Ethernet? If so, wouldn't that be the reason you
> are maxing out at like 114MB/s?
Duh. You're exactly right. I should have noticed this.
And finally:
tv@inktank.com said:
> If you want to benchmark just the metadata part, rsync with 0-size files might
> actually be an interesting workload.
I'll see if I can work out a way to do this.
Thanks to everyone for the suggestions.
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-03 14:55 ` Bryan K. Wright
@ 2012-10-03 18:35 ` Gregory Farnum
2012-10-04 13:14 ` Bryan K. Wright
0 siblings, 1 reply; 23+ messages in thread
From: Gregory Farnum @ 2012-10-03 18:35 UTC (permalink / raw)
To: bryan; +Cc: ceph-devel
I think I'm with Mark now — this does indeed look like too much random
IO for the disks to handle. In particular, Ceph requires that each
write be synced to disk before it's considered complete, which rsync
definitely doesn't. In the filesystem this is generally disguised
fairly well by all the caches and such in the way, but this use case
is unfriendly to that arrangement.
However, I am particularly struck by seeing one of your OSDs at 96%
disk utilization while the others remain <50%, and I've just realized
we never saw output from ceph -s. Can you provide that, please?
-Greg
On Wed, Oct 3, 2012 at 7:55 AM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
> Hi again,
>
> A few answers to questions from various people on the list
> after my last e-mail:
>
> greg@inktank.com said:
>> Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was it
>> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in
>> theory, but in practice it has the equivalent of a Big Kernel Lock so it's not
>> going to get much past one cpu core of time...
>
> The CPU usage on the MDSs hovered around a few percent.
> They're quad-core machines, and I didn't see it ever get as high
> as 25% usage on any of the cores while watching with atop.
>
> greg@inktank.com said:
>> The rados bench results do indicate some pretty bad small-file write
>> performance as well though, so I guess it's possible your testing is running
>> long enough that the page cache isn't absorbing that hit. Did performance
>> start out higher or has it been flat?
>
> Looking at the details of the rados benchmark output, it does
> look like performance starts out better for the first few iterations,
> and then goes bad. Here's the begining of a typical small-file run:
>
> Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds.
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> 0 0 0 0 0 0 - 0
> 1 255 3683 3428 13.3894 13.3906 0.002569 0.0696906
> 2 256 7561 7305 14.2661 15.1445 0.106437 0.0669534
> 3 256 10408 10152 13.2173 11.1211 0.002176 0.0689543
> 4 256 11256 11000 10.741 3.3125 0.002097 0.0846414
> 5 256 11256 11000 8.5928 0 - 0.0846414
> 6 256 11370 11114 7.23489 0.222656 0.002399 0.0962989
> 7 255 12480 12225 6.82126 4.33984 0.117658 0.142335
> 8 256 13289 13033 6.36311 3.15625 0.002574 0.151261
> 9 256 13737 13481 5.85051 1.75 0.120657 0.158865
> 10 256 14341 14085 5.50138 2.35938 0.022544 0.178298
>
> I see the same behavior every time I repeat the small-file
> rados benchmark. Here's a graph showing the first 100 "cur MB/s" values
> for a short-file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf
>
> On the other hand, with 4MB files, I see results that start out like
> this:
>
> Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds.
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> 0 0 0 0 0 0 - 0
> 1 49 49 0 0 0 - 0
> 2 76 76 0 0 0 - 0
> 3 105 105 0 0 0 - 0
> 4 133 133 0 0 0 - 0
> 5 159 159 0 0 0 - 0
> 6 188 188 0 0 0 - 0
> 7 218 218 0 0 0 - 0
> 8 246 246 0 0 0 - 0
> 9 256 274 18 7.99904 8 8.97759 8.66218
> 10 255 301 46 18.3978 112 9.1456 8.94095
> 11 255 330 75 27.2695 116 9.06968 9.013
> 12 255 358 103 34.3292 112 9.12486 9.04374
>
> Here's a graph showing the first 100 "cur MB/s" values for a typical
> 4MB file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf
>
> mark.nelson@inktank.com said:
>> When you were doing this, what kind of results did collectl give you for
>> average write sizes to the underlying OSD disks?
>
> The average "rwsize" reported by collectl hovered around
> 6 +/- a few (in whatever units collectl reports) for the RAID
> array, and around 15 for the journal SSD, while doing the small-file
> rados benchmark. Here's a screenshot showing atop running on
> each of the MDS hosts, and collectl running on each of the OSD
> hosts, while the benchmark was running:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png
>
> Here's the same, but with collectl running on the MDSs instead of atop:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png
>
> Looking at the last screenshot again, it does look like the disks on
> the MDSs are getting some exercise, with ~40% utilization (if I'm
> interpreting the collectl output correctly).
>
> Here's a similar snapshot for the 4MB test:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png
>
> It looks like similar "pct util" on the MDS disks, but much higher
> average rwsize values on the OSDs.
>
> mark.nelson@inktank.com said:
>> There's multiple issues potentially here. Part of it might be how writes are
>> coalesced by XFS in each scenario. Part of it might also be overhead due to
>> XFS metadata reads/writes. You could probably get a better idea of both of
>> these by running blktrace during the tests and making seekwatcher movies of
>> the results. You not only can look at the numbers of seeks, but also the
>> kind (read/writes) and where on the disk they are going. That, and some of
>> the raw blktrace data can give you a lot of information about what is going
>> on and whether or not seeks are
>
> I'll take a look at blktrace and see what I can find out.
>
> mark.nelson@inktank.com said:
>> Beyond that, I do think you are correct in suspecting that there are some
>> Ceph limitations as well. Some things that may be interesting to try:
>
>> - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various
>> thread counts - Increasing various op and byte limits (such as
>> journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under
>> the OSDs.
>
> And I'll give some of these a try.
>
> Regarding the iozone benchmarks:
> mark.nelson@inktank.com said:
>> Do you happen to have the settings you used when you ran these tests? I
>> probably don't have time to try to repeat them now, but I can at least take a
>> quick look at them.
>> I'm slightly confused by the labels on the graph. They can't possibly mean
>> that 2^16384 KB record sizes were tested. Was that just up to 16MB records
>> and 16GB files? That would make a lot more sense.
>
> I just did something like:
>
> cd /mnt/tmp (where the cephfs was mounted)
> iozone -a > /tmp/iozone.log
>
> By default, iozone does its tests in the current working directory.
> The graphs were just produced with the Generate_Graphs script
> that comes with iozone. There are certainly some problems with
> the axis labeling, but I think your interpretation is correct.
>
> mark.nelson@inktank.com said:
>> This might be a dumb question, but was the ceph version of this test on a
>> single client on gigabit Ethernet? If so, wouldn't that be the reason you
>> are maxing out at like 114MB/s?
>
> Duh. You're exactly right. I should have noticed this.
>
> And finally:
> tv@inktank.com said:
>> If you want to benchmark just the metadata part, rsync with 0-size files might
>> actually be an interesting workload.
>
> I'll see if I can work out a way to do this.
>
> Thanks to everyone for the suggestions.
> Bryan
> --
> ========================================================================
> Bryan Wright |"If you take cranberries and stew them like
> Physics Department | applesauce, they taste much more like prunes
> University of Virginia | than rhubarb does." -- Groucho
> Charlottesville, VA 22901|
> (434) 924-7218 | bryan@virginia.edu
> ========================================================================
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-03 18:35 ` Gregory Farnum
@ 2012-10-04 13:14 ` Bryan K. Wright
2012-10-04 15:24 ` Sage Weil
0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-04 13:14 UTC (permalink / raw)
To: ceph-devel
Hi Greg,
greg@inktank.com said:
> I think I'm with Mark now this does indeed look like too much random IO for
> the disks to handle. In particular, Ceph requires that each write be synced to
> disk before it's considered complete, which rsync definitely doesn't. In the
> filesystem this is generally disguised fairly well by all the caches and such
> in the way, but this use case is unfriendly to that arrangement.
> However, I am particularly struck by seeing one of your OSDs at 96% disk
> utilization while the others remain <50%, and I've just realized we never saw
> output from ceph -s. Can you provide that, please?
Here's the ceph -s output:
health HEALTH_OK
monmap e1: 3 mons at {0=192.168.1.31:6789/0,1=192.168.1.32:6789/0,2=192.168.1
.33:6789/0}, election epoch 2, quorum 0,1,2 0,1,2
osdmap e24: 4 osds: 4 up, 4 in
pgmap v8363: 960 pgs: 960 active+clean; 15099 MB data, 38095 MB used, 74354
GB / 74391 GB avail
mdsmap e25: 1/1/1 up {0=2=up:active}, 2 up:standby
The OSD disk utilization seems to vary a lot during these
benchmarks. My recollection is that each of the OSD hosts sometimes
sees near-100% utilization.
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-04 13:14 ` Bryan K. Wright
@ 2012-10-04 15:24 ` Sage Weil
2012-10-04 15:54 ` Bryan K. Wright
0 siblings, 1 reply; 23+ messages in thread
From: Sage Weil @ 2012-10-04 15:24 UTC (permalink / raw)
To: bryan; +Cc: ceph-devel
On Thu, 4 Oct 2012, Bryan K. Wright wrote:
> Hi Greg,
>
> greg@inktank.com said:
> > I think I'm with Mark now ? this does indeed look like too much random IO for
> > the disks to handle. In particular, Ceph requires that each write be synced to
> > disk before it's considered complete, which rsync definitely doesn't. In the
> > filesystem this is generally disguised fairly well by all the caches and such
> > in the way, but this use case is unfriendly to that arrangement.
>
> > However, I am particularly struck by seeing one of your OSDs at 96% disk
> > utilization while the others remain <50%, and I've just realized we never saw
> > output from ceph -s. Can you provide that, please?
>
> Here's the ceph -s output:
>
> health HEALTH_OK
> monmap e1: 3 mons at {0=192.168.1.31:6789/0,1=192.168.1.32:6789/0,2=192.168.1
> .33:6789/0}, election epoch 2, quorum 0,1,2 0,1,2
> osdmap e24: 4 osds: 4 up, 4 in
> pgmap v8363: 960 pgs: 960 active+clean; 15099 MB data, 38095 MB used, 74354
> GB / 74391 GB avail
> mdsmap e25: 1/1/1 up {0=2=up:active}, 2 up:standby
>
> The OSD disk utilization seems to vary a lot during these
> benchmarks. My recollection is that each of the OSD hosts sometimes
> sees near-100% utilization.
Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump'
output? So we can make sure CRUSH is distributing things well?
Thanks!
sage
>
> Bryan
>
>
> --
> ========================================================================
> Bryan Wright |"If you take cranberries and stew them like
> Physics Department | applesauce, they taste much more like prunes
> University of Virginia | than rhubarb does." -- Groucho
> Charlottesville, VA 22901|
> (434) 924-7218 | bryan@virginia.edu
> ========================================================================
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-04 15:24 ` Sage Weil
@ 2012-10-04 15:54 ` Bryan K. Wright
2012-10-26 20:48 ` Gregory Farnum
0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-04 15:54 UTC (permalink / raw)
To: ceph-devel
Hi Sage,
sage@inktank.com said:
> Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump'
> output? So we can make sure CRUSH is distributing things well?
Here they are:
# ceph osd tree
dumped osdmap tree epoch 24
# id weight type name up/down reweight
-1 4 pool default
-3 4 rack unknownrack
-2 1 host ceph-osd-1
1 1 osd.1 up 1
-4 1 host ceph-osd-2
2 1 osd.2 up 1
-5 1 host ceph-osd-3
3 1 osd.3 up 1
-6 1 host ceph-osd-4
4 1 osd.4 up 1
# ceph osd dump
dumped osdmap epoch 24
epoch 24
fsid 7e4e4302-4ced-439e-9786-49e6036dfda4
created 2012-09-28 13:17:40.774580
modifed 2012-09-28 16:56:02.864965
flags
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0
max_osd 5
osd.1 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [10,15) 192.168.1.21:6800/3702 192.168.12.21:6800/3702 192.168.12.21:6801/3702 exists,up 4ad0b4cd-cbff-4693-b8f7-667148386cf3
osd.2 up in weight 1 up_from 17 up_thru 21 down_at 16 last_clean_interval [8,15) 192.168.1.22:6800/3428 192.168.12.22:6800/3428 192.168.12.22:6801/3428 exists,up 6a829cc6-fc60-450a-ac1d-8e148b757e57
osd.3 up in weight 1 up_from 21 up_thru 21 down_at 20 last_clean_interval [9,15) 192.168.1.23:6800/3436 192.168.12.23:6800/3436 192.168.12.23:6801/3436 exists,up 387cff7a-b857-434b-af66-0e08f56fd0f7
osd.4 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [9,15) 192.168.1.24:6800/3486 192.168.12.24:6800/3486 192.168.12.24:6801/3486 exists,up fe8c4bf0-ff6f-41e9-91ac-d5826672f8b5
# ceph pg dump
See http://ayesha.phys.virginia.edu/~bryan/ceph-pg-dump.txt
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-04 15:54 ` Bryan K. Wright
@ 2012-10-26 20:48 ` Gregory Farnum
2012-10-29 15:08 ` Bryan K. Wright
0 siblings, 1 reply; 23+ messages in thread
From: Gregory Farnum @ 2012-10-26 20:48 UTC (permalink / raw)
To: bryan; +Cc: ceph-devel
On Thu, Oct 4, 2012 at 8:54 AM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
> Hi Sage,
>
> sage@inktank.com said:
>> Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump'
>> output? So we can make sure CRUSH is distributing things well?
>
> Here they are:
>
> # ceph osd tree
> dumped osdmap tree epoch 24
> # id weight type name up/down reweight
> -1 4 pool default
> -3 4 rack unknownrack
> -2 1 host ceph-osd-1
> 1 1 osd.1 up 1
> -4 1 host ceph-osd-2
> 2 1 osd.2 up 1
> -5 1 host ceph-osd-3
> 3 1 osd.3 up 1
> -6 1 host ceph-osd-4
> 4 1 osd.4 up 1
>
> # ceph osd dump
> dumped osdmap epoch 24
> epoch 24
> fsid 7e4e4302-4ced-439e-9786-49e6036dfda4
> created 2012-09-28 13:17:40.774580
> modifed 2012-09-28 16:56:02.864965
> flags
>
> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 crash_replay_interval 45
> pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0
> pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0
>
> max_osd 5
> osd.1 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [10,15) 192.168.1.21:6800/3702 192.168.12.21:6800/3702 192.168.12.21:6801/3702 exists,up 4ad0b4cd-cbff-4693-b8f7-667148386cf3
> osd.2 up in weight 1 up_from 17 up_thru 21 down_at 16 last_clean_interval [8,15) 192.168.1.22:6800/3428 192.168.12.22:6800/3428 192.168.12.22:6801/3428 exists,up 6a829cc6-fc60-450a-ac1d-8e148b757e57
> osd.3 up in weight 1 up_from 21 up_thru 21 down_at 20 last_clean_interval [9,15) 192.168.1.23:6800/3436 192.168.12.23:6800/3436 192.168.12.23:6801/3436 exists,up 387cff7a-b857-434b-af66-0e08f56fd0f7
> osd.4 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [9,15) 192.168.1.24:6800/3486 192.168.12.24:6800/3486 192.168.12.24:6801/3486 exists,up fe8c4bf0-ff6f-41e9-91ac-d5826672f8b5
>
> # ceph pg dump
> See http://ayesha.phys.virginia.edu/~bryan/ceph-pg-dump.txt
Eeek, I was going through my email backlog and came across this thread
again. Everything here does look good; the data distribution etc is
pretty reasonable.
If you're still testing, we can at least get a rough idea of the sorts
of IO the OSD is doing by looking at the perfcounters out of the admin
socket:
ceph --admin-daemon /path/to/socket perf dump
(I believe the default path is /var/run/ceph/ceph-osd.*.asok)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-26 20:48 ` Gregory Farnum
@ 2012-10-29 15:08 ` Bryan K. Wright
2012-11-03 17:55 ` Gregory Farnum
0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-29 15:08 UTC (permalink / raw)
To: Gregory Farnum; +Cc: bryan, ceph-devel
greg@inktank.com said:
> Eeek, I was going through my email backlog and came across this thread again.
> Everything here does look good; the data distribution etc is pretty
> reasonable. If you're still testing, we can at least get a rough idea of the
> sorts of IO the OSD is doing by looking at the perfcounters out of the admin
> socket: ceph --admin-daemon /path/to/socket perf dump (I believe the default
> path is /var/run/ceph/ceph-osd.*.asok)
Hi Greg,
Thanks for your help. I've been experimenting with other things,
so the cluster has a different arrangement now, but the performance
seems to be about the same. I've now broken down the RAID arrays into
JBOD disks, and I'm running one OSD per disk, recklessly ignoring
the warning about syncfs being missing. (Performance doesn't seem
any better or worse than it was before when rsyncing a large directory
of small files.) I've also added another osd node into the mix, with
a different disk controller.
For what it's worth, here are "perf dump" outputs for a
couple of OSDs running on the old and new hardware, respectively:
http://ayesha.phys.virginia.edu/~bryan/perf.osd.200.txt
http://ayesha.phys.virginia.edu/~bryan/perf.osd.100.txt
If you could take a look at them and let me know if you see
anything enlightening, I'd really appreciate it.
Thanks,
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance
2012-10-29 15:08 ` Bryan K. Wright
@ 2012-11-03 17:55 ` Gregory Farnum
0 siblings, 0 replies; 23+ messages in thread
From: Gregory Farnum @ 2012-11-03 17:55 UTC (permalink / raw)
To: Bryan K. Wright, Samuel Just; +Cc: bryan, ceph-devel
On Mon, Oct 29, 2012 at 4:08 PM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
>
> greg@inktank.com said:
>> Eeek, I was going through my email backlog and came across this thread again.
>> Everything here does look good; the data distribution etc is pretty
>> reasonable. If you're still testing, we can at least get a rough idea of the
>> sorts of IO the OSD is doing by looking at the perfcounters out of the admin
>> socket: ceph --admin-daemon /path/to/socket perf dump (I believe the default
>> path is /var/run/ceph/ceph-osd.*.asok)
>
> Hi Greg,
>
> Thanks for your help. I've been experimenting with other things,
> so the cluster has a different arrangement now, but the performance
> seems to be about the same. I've now broken down the RAID arrays into
> JBOD disks, and I'm running one OSD per disk, recklessly ignoring
> the warning about syncfs being missing. (Performance doesn't seem
> any better or worse than it was before when rsyncing a large directory
> of small files.) I've also added another osd node into the mix, with
> a different disk controller.
>
> For what it's worth, here are "perf dump" outputs for a
> couple of OSDs running on the old and new hardware, respectively:
>
> http://ayesha.phys.virginia.edu/~bryan/perf.osd.200.txt
> http://ayesha.phys.virginia.edu/~bryan/perf.osd.100.txt
>
> If you could take a look at them and let me know if you see
> anything enlightening, I'd really appreciate it.
Sam, can you check these out? I notice in particular that the average
"apply_latency" is 1.44 seconds — but I don't know if I have the units
right on that or have parsed something else wrong.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2012-11-03 17:55 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-26 14:50 Slow ceph fs performance Bryan K. Wright
2012-09-26 15:26 ` Mark Nelson
2012-09-26 20:54 ` Bryan K. Wright
2012-09-27 15:16 ` Bryan K. Wright
2012-09-27 18:04 ` Gregory Farnum
2012-09-27 18:47 ` Bryan K. Wright
2012-09-27 19:47 ` Gregory Farnum
2012-10-01 16:47 ` Tommi Virtanen
2012-10-01 17:00 ` Gregory Farnum
2012-10-03 14:55 ` Bryan K. Wright
2012-10-03 18:35 ` Gregory Farnum
2012-10-04 13:14 ` Bryan K. Wright
2012-10-04 15:24 ` Sage Weil
2012-10-04 15:54 ` Bryan K. Wright
2012-10-26 20:48 ` Gregory Farnum
2012-10-29 15:08 ` Bryan K. Wright
2012-11-03 17:55 ` Gregory Farnum
2012-10-01 17:03 ` Mark Nelson
2012-09-27 23:40 ` Mark Kirkwood
2012-09-27 23:49 ` Mark Kirkwood
2012-09-28 12:22 ` mark seger
2012-10-01 15:41 ` Bryan K. Wright
2012-10-01 16:43 ` Mark Nelson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.