All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph performance improvement
@ 2012-08-22  8:54 Denis Fondras
  2012-08-22 10:24 ` David McBride
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Denis Fondras @ 2012-08-22  8:54 UTC (permalink / raw)
  To: ceph-devel

Hello all,

I'm currently testing Ceph. So far it seems that HA and recovering are 
very good.
The only point that prevents my from using it at datacenter-scale is 
performance.

First of all, here is my setup :
- 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 
4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 
(commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).  It  has 1x 320GB 
drive for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the 
journal and 4x 3TB drive (Western Digital WD30EZRX). Everything but the 
boot partition is BTRFS-formated and 4K-aligned.
- 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and 
Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).
Both servers are linked over a 1Gb Ethernet switch (iperf shows about 
960Mb/s).

Here is my ceph.conf :
------cut-here------
[global]
         auth supported = cephx
         keyring = /etc/ceph/keyring
         journal dio = true
         osd op threads = 24
         osd disk threads = 24
         filestore op threads = 6
         filestore queue max ops = 24
         osd client message size cap = 14000000
         ms dispatch throttle bytes =  17500000

[mon]
         mon data = /home/mon.$id
         keyring = /etc/ceph/keyring.$name

[mon.a]
         host = ceph-osd-0
         mon addr = 192.168.0.132:6789

[mds]
         keyring = /etc/ceph/keyring.$name

[mds.a]
         host = ceph-osd-0

[osd]
         osd data = /home/osd.$id
         osd journal = /home/osd.$id.journal
         osd journal size = 1000
         keyring = /etc/ceph/keyring.$name

[osd.0]
         host = ceph-osd-0
         btrfs devs = 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201
         btrfs options = rw,noatime
------cut-here------

Here are some figures :
* Test with "dd" on the OSD server (on drive 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0,00    0,00    0,52   41,99    0,00   57,48

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdf             247,00         0,00    125520,00          0     125520

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD 
server (on drive 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# time tar xzf src.tar.gz
real    0m9.669s
user    0m8.405s
sys     0m4.736s

# time rm -rf *
real    0m3.647s
user    0m0.036s
sys     0m3.552s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           10,83    0,00   28,72   16,62    0,00   43,83

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdf            1369,00         0,00      9300,00          0       9300

* Test with "dd" from the client using RBD :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            4,57    0,00   30,46   27,66    0,00   37,31

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             317,00         0,00     57400,00          0      57400
sdf             237,00         0,00     88336,00          0      88336

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
client using RBD :
# time tar xzf src.tar.gz
real    0m26.955s
user    0m9.233s
sys     0m11.425s

# time rm -rf *
real    0m8.545s
user    0m0.128s
sys     0m8.297s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            4,59    0,00   24,74   30,61    0,00   40,05

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             239,00         0,00     54772,00          0      54772
sdf             441,00         0,00     50836,00          0      50836

* Test with "dd" from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            2,26    0,00   20,30   27,07    0,00   50,38

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             710,00         0,00     58836,00          0      58836
sdf             722,00         0,00     32768,00          0      32768


* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
client using CephFS :
# time tar xzf src.tar.gz
real    3m55.260s
user    0m8.721s
sys     0m11.461s

# time rm -rf *
real    9m2.319s
user    0m0.320s
sys     0m4.572s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           14,40    0,00   15,94    2,31    0,00   67,35

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             174,00         0,00     10772,00          0      10772
sdf             527,00         0,00      3636,00          0       3636

=> from top :
   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
  4070 root      20   0  992m 237m 4384 S  90,5  3,0  18:40.50 ceph-osd
  3975 root      20   0  777m 635m 4368 S  59,7  8,0   7:08.27 ceph-mds


Adding an OSD doesn't change much of these figures (and it is always for 
a lower end when it does).
Neither does migrating the MON+MDS on the client machine.

Are these figures right for this kind of hardware ? What could I try to 
make it a bit faster (essentially on the CephFS multiple little files 
side of things like uncompressing Linux kernel source or OpenBSD sources) ?

I see figures of hundreds of megabits on some mailing-list threads, I'd 
really like to see this kind of numbers :D

Thank you in advance for any pointer,
Denis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22  8:54 Ceph performance improvement Denis Fondras
@ 2012-08-22 10:24 ` David McBride
  2012-08-22 12:10   ` Denis Fondras
  2012-08-23  3:51   ` Mark Kirkwood
  2012-08-22 12:35 ` Mark Nelson
  2012-08-22 16:03 ` Tommi Virtanen
  2 siblings, 2 replies; 13+ messages in thread
From: David McBride @ 2012-08-22 10:24 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel

On 22/08/12 09:54, Denis Fondras wrote:

> The only point that prevents my from using it at datacenter-scale is
> performance.

> Here are some figures :
> * Test with "dd" on the OSD server (on drive
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
> # dd if=/dev/zero of=testdd bs=4k count=4M
> 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s

That looks like you're writing to a filesystem on that disk, rather than 
the block device itself -- but lets say you've got 139MB/sec 
(1112Mbit/sec) of straight-line performance.

Note: this is already faster than your network link can go -- you can, 
at best, only achieve 120MB/sec over your gigabit link.

> * Test with "dd" from the client using RBD :
> # dd if=/dev/zero of=testdd bs=4k count=4M
> 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s

Is this a dd to the RBD device directly, or is this a write to a file in 
a filesystem created on top of it?

dd will write blocks synchronously -- that is, it will write one block, 
wait for the write to complete, then write the next block, and so on. 
Because of the durability guarantees provided by ceph, this will result 
in dd doing a lot of waiting around while writes are being sent over the 
network and written out on your OSD.

(If you're using the default replication count of 2, probably twice? 
I'm not exactly sure what Ceph does when it only has one OSD to work on..?)

> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the
> client using RBD :
> # time tar xzf src.tar.gz
> real    0m26.955s
> user    0m9.233s
> sys     0m11.425s

Just ignoring networking and storage for a moment, this also isn't a 
fair test: you're comparing the decompress-and-unpack time of a 139MB 
tarball on a 3GHz Pentium 4 with 1GB of RAM and a quad-core Xeon E5 that 
has 8GB.

Even ignoring the relative CPU difference, then unless you're doing 
something clever that you haven't described, there's no guarantee that 
the files in the latter case have actually been written to disk -- you 
have enough memory on your server for it to buffer all of those writes 
in RAM.  You'd need to add a sync() call or similar at the end of your 
timing run to ensure that all of those writes have actually been 
committed to disk.

> * Test with "dd" from the client using CephFS :
> # dd if=/dev/zero of=testdd bs=4k count=4M
> 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s

Again, the synchronous nature of 'dd' is probably severely affecting 
apparent performance.  I'd suggest looking at some other tools, like 
fio, bonnie++, or iozone, which might generate more representative load.

(Or, if you have a specific use-case in mind, something that generates 
an IO pattern like what you'll be using in production would be ideal!)

Cheers,
David
-- 
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Computing Service

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22 10:24 ` David McBride
@ 2012-08-22 12:10   ` Denis Fondras
  2012-08-23  3:51   ` Mark Kirkwood
  1 sibling, 0 replies; 13+ messages in thread
From: Denis Fondras @ 2012-08-22 12:10 UTC (permalink / raw)
  To: ceph-devel

Thank you for the answer David.

>
> That looks like you're writing to a filesystem on that disk, rather than
> the block device itself -- but lets say you've got 139MB/sec
> (1112Mbit/sec) of straight-line performance.
>
> Note: this is already faster than your network link can go -- you can,
> at best, only achieve 120MB/sec over your gigabit link.
>

Yes, I am aware of that, I can't get more than the GB link. However, I 
mentionned this to show that the disk should not be a bottleneck.

>
> Is this a dd to the RBD device directly, or is this a write to a file in
> a filesystem created on top of it?
>

The RBD device is mounted and formatted with BTRFS.

> dd will write blocks synchronously -- that is, it will write one block,
> wait for the write to complete, then write the next block, and so on.
> Because of the durability guarantees provided by ceph, this will result
> in dd doing a lot of waiting around while writes are being sent over the
> network and written out on your OSD.
>

Thank you for that information.

> (If you're using the default replication count of 2, probably twice? I'm
> not exactly sure what Ceph does when it only has one OSD to work on..?)
>

I don't know exactly how it behaves but "ceph -s" tells the cluster is 
degraded at 50%. Adding a second OSD allows Ceph to replicate.

>
> Just ignoring networking and storage for a moment, this also isn't a
> fair test: you're comparing the decompress-and-unpack time of a 139MB
> tarball on a 3GHz Pentium 4 with 1GB of RAM and a quad-core Xeon E5 that
> has 8GB.
>

That's a very good point ! Comparing figures on the same host tells a 
different story (/mnt is Ceph RBD device) :)

root@ceph-osd-1:/home# time tar xzf ../src.tar.gz && sync

real    0m43.668s
user    0m9.649s
sys     0m20.897s

root@ceph-osd-1:/mnt# time tar xzf ../src.tar.gz && sync

real    0m38.022s
user    0m9.101s
sys     0m11.265s

Thank you again,
Denis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22  8:54 Ceph performance improvement Denis Fondras
  2012-08-22 10:24 ` David McBride
@ 2012-08-22 12:35 ` Mark Nelson
  2012-08-22 12:42   ` Alexandre DERUMIER
  2012-08-24 16:41   ` Denis Fondras
  2012-08-22 16:03 ` Tommi Virtanen
  2 siblings, 2 replies; 13+ messages in thread
From: Mark Nelson @ 2012-08-22 12:35 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel

On 08/22/2012 03:54 AM, Denis Fondras wrote:
> Hello all,

Hello!

David had some good comments in his reply, so I'll just add in a couple 
of extra thoughts...

>
> I'm currently testing Ceph. So far it seems that HA and recovering are
> very good.
> The only point that prevents my from using it at datacenter-scale is
> performance.
>
> First of all, here is my setup :
> - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 -
> 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49

Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine).

> (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive
> for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal
> and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot
> partition is BTRFS-formated and 4K-aligned.
> - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and
> Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).
> Both servers are linked over a 1Gb Ethernet switch (iperf shows about
> 960Mb/s).
>
> Here is my ceph.conf :
> ------cut-here------
> [global]
> auth supported = cephx
> keyring = /etc/ceph/keyring
> journal dio = true
> osd op threads = 24
> osd disk threads = 24
> filestore op threads = 6
> filestore queue max ops = 24
> osd client message size cap = 14000000
> ms dispatch throttle bytes = 17500000
>

default values are quite a bit lower for most of these.  You may want to 
play with them and see if it has an effect.

> [mon]
> mon data = /home/mon.$id
> keyring = /etc/ceph/keyring.$name
>
> [mon.a]
> host = ceph-osd-0
> mon addr = 192.168.0.132:6789
>
> [mds]
> keyring = /etc/ceph/keyring.$name
>
> [mds.a]
> host = ceph-osd-0
>
> [osd]
> osd data = /home/osd.$id
> osd journal = /home/osd.$id.journal
> osd journal size = 1000
> keyring = /etc/ceph/keyring.$name
>
> [osd.0]
> host = ceph-osd-0
> btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201
> btrfs options = rw,noatime

Just fyi, we are trying to get away from btrfs devs.

> ------cut-here------
>
> Here are some figures :
> * Test with "dd" on the OSD server (on drive
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
> # dd if=/dev/zero of=testdd bs=4k count=4M
> 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s

Good job using a data file that is much bigger than main memory! That 
looks pretty accurate for a 7200rpm spinning disk.  For dd benchmarks, 
you should probably throw in conv=fdatasync at the end though.

>
> => iostat (on the OSD server) :
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0,00 0,00 0,52 41,99 0,00 57,48
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sdf 247,00 0,00 125520,00 0 125520
>
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD
> server (on drive
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
> # time tar xzf src.tar.gz
> real 0m9.669s
> user 0m8.405s
> sys 0m4.736s
>
> # time rm -rf *
> real 0m3.647s
> user 0m0.036s
> sys 0m3.552s
>
> => iostat (on the OSD server) :
> avg-cpu: %user %nice %system %iowait %steal %idle
> 10,83 0,00 28,72 16,62 0,00 43,83
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sdf 1369,00 0,00 9300,00 0 9300
>
> * Test with "dd" from the client using RBD :
> # dd if=/dev/zero of=testdd bs=4k count=4M
> 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s

RBD caching should definitely be enabled for a test like this.  I'd be 
surprised if you got 42MB/s without it though...

>
> => iostat (on the OSD server) :
> avg-cpu: %user %nice %system %iowait %steal %idle
> 4,57 0,00 30,46 27,66 0,00 37,31
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda 317,00 0,00 57400,00 0 57400
> sdf 237,00 0,00 88336,00 0 88336
>
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the
> client using RBD :
> # time tar xzf src.tar.gz
> real 0m26.955s
> user 0m9.233s
> sys 0m11.425s
>
> # time rm -rf *
> real 0m8.545s
> user 0m0.128s
> sys 0m8.297s
>
> => iostat (on the OSD server) :
> avg-cpu: %user %nice %system %iowait %steal %idle
> 4,59 0,00 24,74 30,61 0,00 40,05
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda 239,00 0,00 54772,00 0 54772
> sdf 441,00 0,00 50836,00 0 50836
>
> * Test with "dd" from the client using CephFS :
> # dd if=/dev/zero of=testdd bs=4k count=4M
> 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s
>
> => iostat (on the OSD server) :
> avg-cpu: %user %nice %system %iowait %steal %idle
> 2,26 0,00 20,30 27,07 0,00 50,38
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda 710,00 0,00 58836,00 0 58836
> sdf 722,00 0,00 32768,00 0 32768
>
>
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the
> client using CephFS :
> # time tar xzf src.tar.gz
> real 3m55.260s
> user 0m8.721s
> sys 0m11.461s
>

Ouch, that's taking a while!  In addition to the comments that David 
made, be aware that you are also testing the metadata server with 
cephFS.  Right now that's not getting a lot of attention as we are 
primarily focusing on RADOS performance at the moment.  For this kind of 
test though, distributed filesystems will never be as good as local disks...

> # time rm -rf *
> real 9m2.319s
> user 0m0.320s
> sys 0m4.572s
>
> => iostat (on the OSD server) :
> avg-cpu: %user %nice %system %iowait %steal %idle
> 14,40 0,00 15,94 2,31 0,00 67,35
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda 174,00 0,00 10772,00 0 10772
> sdf 527,00 0,00 3636,00 0 3636
>
> => from top :
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4070 root 20 0 992m 237m 4384 S 90,5 3,0 18:40.50 ceph-osd
> 3975 root 20 0 777m 635m 4368 S 59,7 8,0 7:08.27 ceph-mds
>
>
> Adding an OSD doesn't change much of these figures (and it is always for
> a lower end when it does).

Are you putting both journals on the SSD when you add an OSD?  If so, 
what's the throughput your SSD can sustain?

> Neither does migrating the MON+MDS on the client machine.
>
> Are these figures right for this kind of hardware ? What could I try to
> make it a bit faster (essentially on the CephFS multiple little files
> side of things like uncompressing Linux kernel source or OpenBSD sources) ?
>
> I see figures of hundreds of megabits on some mailing-list threads, I'd
> really like to see this kind of numbers :D

With a single OSD and 1x replication on 10GbE I can sustain about 
110MB/s with 4MB writes if the journal is on an alternate disk.  I've 
also got some hardware though that does much worse than that (I think 
due to raid controller interference).  50MB/s does seem kind of low for 
cephFS in your dd test.

You may want to check and see how big the IOs going to disk are on the 
OSD node, and how quickly you are filling up the journal vs writing out 
to disk.  "collectl -sD -oT" will give you a nice report.  Iostat can 
probably tell you all of the same stuff with the right flags.

>
> Thank you in advance for any pointer,
> Denis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22 12:35 ` Mark Nelson
@ 2012-08-22 12:42   ` Alexandre DERUMIER
  2012-08-24 16:41   ` Denis Fondras
  1 sibling, 0 replies; 13+ messages in thread
From: Alexandre DERUMIER @ 2012-08-22 12:42 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel, Denis Fondras

>>Not sure what version of glibc Wheezy has, but try to make sure you have 
>>one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
>>should be fine). 

Hi, glibc from wheezy don't have syncfs support.

----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@inktank.com> 
À: "Denis Fondras" <ceph@ledeuns.net> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 22 Août 2012 14:35:28 
Objet: Re: Ceph performance improvement 

On 08/22/2012 03:54 AM, Denis Fondras wrote: 
> Hello all, 

Hello! 

David had some good comments in his reply, so I'll just add in a couple 
of extra thoughts... 

> 
> I'm currently testing Ceph. So far it seems that HA and recovering are 
> very good. 
> The only point that prevents my from using it at datacenter-scale is 
> performance. 
> 
> First of all, here is my setup : 
> - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 
> 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 

Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine). 

> (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive 
> for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal 
> and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot 
> partition is BTRFS-formated and 4K-aligned. 
> - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and 
> Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). 
> Both servers are linked over a 1Gb Ethernet switch (iperf shows about 
> 960Mb/s). 
> 
> Here is my ceph.conf : 
> ------cut-here------ 
> [global] 
> auth supported = cephx 
> keyring = /etc/ceph/keyring 
> journal dio = true 
> osd op threads = 24 
> osd disk threads = 24 
> filestore op threads = 6 
> filestore queue max ops = 24 
> osd client message size cap = 14000000 
> ms dispatch throttle bytes = 17500000 
> 

default values are quite a bit lower for most of these. You may want to 
play with them and see if it has an effect. 

> [mon] 
> mon data = /home/mon.$id 
> keyring = /etc/ceph/keyring.$name 
> 
> [mon.a] 
> host = ceph-osd-0 
> mon addr = 192.168.0.132:6789 
> 
> [mds] 
> keyring = /etc/ceph/keyring.$name 
> 
> [mds.a] 
> host = ceph-osd-0 
> 
> [osd] 
> osd data = /home/osd.$id 
> osd journal = /home/osd.$id.journal 
> osd journal size = 1000 
> keyring = /etc/ceph/keyring.$name 
> 
> [osd.0] 
> host = ceph-osd-0 
> btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 
> btrfs options = rw,noatime 

Just fyi, we are trying to get away from btrfs devs. 

> ------cut-here------ 
> 
> Here are some figures : 
> * Test with "dd" on the OSD server (on drive 
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
> # dd if=/dev/zero of=testdd bs=4k count=4M 
> 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s 

Good job using a data file that is much bigger than main memory! That 
looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, 
you should probably throw in conv=fdatasync at the end though. 

> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 0,00 0,00 0,52 41,99 0,00 57,48 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sdf 247,00 0,00 125520,00 0 125520 
> 
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD 
> server (on drive 
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
> # time tar xzf src.tar.gz 
> real 0m9.669s 
> user 0m8.405s 
> sys 0m4.736s 
> 
> # time rm -rf * 
> real 0m3.647s 
> user 0m0.036s 
> sys 0m3.552s 
> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 10,83 0,00 28,72 16,62 0,00 43,83 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sdf 1369,00 0,00 9300,00 0 9300 
> 
> * Test with "dd" from the client using RBD : 
> # dd if=/dev/zero of=testdd bs=4k count=4M 
> 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s 

RBD caching should definitely be enabled for a test like this. I'd be 
surprised if you got 42MB/s without it though... 

> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 4,57 0,00 30,46 27,66 0,00 37,31 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sda 317,00 0,00 57400,00 0 57400 
> sdf 237,00 0,00 88336,00 0 88336 
> 
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
> client using RBD : 
> # time tar xzf src.tar.gz 
> real 0m26.955s 
> user 0m9.233s 
> sys 0m11.425s 
> 
> # time rm -rf * 
> real 0m8.545s 
> user 0m0.128s 
> sys 0m8.297s 
> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 4,59 0,00 24,74 30,61 0,00 40,05 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sda 239,00 0,00 54772,00 0 54772 
> sdf 441,00 0,00 50836,00 0 50836 
> 
> * Test with "dd" from the client using CephFS : 
> # dd if=/dev/zero of=testdd bs=4k count=4M 
> 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s 
> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 2,26 0,00 20,30 27,07 0,00 50,38 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sda 710,00 0,00 58836,00 0 58836 
> sdf 722,00 0,00 32768,00 0 32768 
> 
> 
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
> client using CephFS : 
> # time tar xzf src.tar.gz 
> real 3m55.260s 
> user 0m8.721s 
> sys 0m11.461s 
> 

Ouch, that's taking a while! In addition to the comments that David 
made, be aware that you are also testing the metadata server with 
cephFS. Right now that's not getting a lot of attention as we are 
primarily focusing on RADOS performance at the moment. For this kind of 
test though, distributed filesystems will never be as good as local disks... 

> # time rm -rf * 
> real 9m2.319s 
> user 0m0.320s 
> sys 0m4.572s 
> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 14,40 0,00 15,94 2,31 0,00 67,35 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sda 174,00 0,00 10772,00 0 10772 
> sdf 527,00 0,00 3636,00 0 3636 
> 
> => from top : 
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
> 4070 root 20 0 992m 237m 4384 S 90,5 3,0 18:40.50 ceph-osd 
> 3975 root 20 0 777m 635m 4368 S 59,7 8,0 7:08.27 ceph-mds 
> 
> 
> Adding an OSD doesn't change much of these figures (and it is always for 
> a lower end when it does). 

Are you putting both journals on the SSD when you add an OSD? If so, 
what's the throughput your SSD can sustain? 

> Neither does migrating the MON+MDS on the client machine. 
> 
> Are these figures right for this kind of hardware ? What could I try to 
> make it a bit faster (essentially on the CephFS multiple little files 
> side of things like uncompressing Linux kernel source or OpenBSD sources) ? 
> 
> I see figures of hundreds of megabits on some mailing-list threads, I'd 
> really like to see this kind of numbers :D 

With a single OSD and 1x replication on 10GbE I can sustain about 
110MB/s with 4MB writes if the journal is on an alternate disk. I've 
also got some hardware though that does much worse than that (I think 
due to raid controller interference). 50MB/s does seem kind of low for 
cephFS in your dd test. 

You may want to check and see how big the IOs going to disk are on the 
OSD node, and how quickly you are filling up the journal vs writing out 
to disk. "collectl -sD -oT" will give you a nice report. Iostat can 
probably tell you all of the same stuff with the right flags. 

> 
> Thank you in advance for any pointer, 
> Denis 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 



	

Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22  8:54 Ceph performance improvement Denis Fondras
  2012-08-22 10:24 ` David McBride
  2012-08-22 12:35 ` Mark Nelson
@ 2012-08-22 16:03 ` Tommi Virtanen
  2012-08-22 16:23   ` Denis Fondras
  2 siblings, 1 reply; 13+ messages in thread
From: Tommi Virtanen @ 2012-08-22 16:03 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel

On Wed, Aug 22, 2012 at 1:54 AM, Denis Fondras <ceph@ledeuns.net> wrote:
> First of all, here is my setup :
> for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal and 4x
> 3TB drive (Western Digital WD30EZRX). Everything but the boot partition is
> BTRFS-formated and 4K-aligned.
...
> [osd]
>         osd data = /home/osd.$id
>         osd journal = /home/osd.$id.journal
>         osd journal size = 1000
>         keyring = /etc/ceph/keyring.$name
>
> [osd.0]
>         host = ceph-osd-0
>         btrfs devs =
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201
>         btrfs options = rw,noatime

Are you sure your osd data and journal are on the disks you think? The
/home paths look suspicious -- especially for journal, which often
should be a block device.

Can you share output of "mount" and "ls -ld /home/osd.*"

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22 16:03 ` Tommi Virtanen
@ 2012-08-22 16:23   ` Denis Fondras
  2012-08-22 16:29     ` Tommi Virtanen
  0 siblings, 1 reply; 13+ messages in thread
From: Denis Fondras @ 2012-08-22 16:23 UTC (permalink / raw)
  To: ceph-devel

>
> Are you sure your osd data and journal are on the disks you think? The
> /home paths look suspicious -- especially for journal, which often
> should be a block device.
>

I am :)

> Can you share output of "mount" and "ls -ld /home/osd.*"

Here are some details :

root@ceph-osd-0:~# ls -al /dev/disk/by-id/
lrwxrwxrwx 1 root root   9 août  21 21:19 
scsi-SATA_C300-CTFDDAC06400000000104903008FE4 -> ../../sda
lrwxrwxrwx 1 root root   9 août  22 10:57 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0124762 -> ../../sdh
lrwxrwxrwx 1 root root   9 août  21 16:03 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0137898 -> ../../sdg
lrwxrwxrwx 1 root root   9 août  21 21:19 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 -> ../../sdf
lrwxrwxrwx 1 root root   9 août  21 16:03 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152562 -> ../../sdc

root@ceph-osd-0:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs 
(rw,relatime,size=10240k,nr_inodes=1020030,mode=755)
devpts on /dev/pts type devpts 
(rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=817216k,mode=755)
/dev/disk/by-uuid/7d95d243-1788-4c3f-9f89-166c15f880f0 on / type ext3 
(rw,relatime,errors=remount-ro,barrier=1,data=ordered)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,relatime,size=1634432k)
tmpfs on /run/shm type tmpfs (rw,nosuid,nodev,relatime,size=1634432k)
/dev/sda on /home type btrfs (rw,relatime,ssd,space_cache)
/dev/sdf on /home/osd.0 type btrfs (rw,noatime,space_cache)

root@ceph-osd-0:~# ls -ld /home/osd.*
drwxr-xr-x 1 root root        236 août  22 17:22 /home/osd.0
-rw-r--r-- 1 root root 1048576000 août  22 17:22 /home/osd.0.journal

Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22 16:23   ` Denis Fondras
@ 2012-08-22 16:29     ` Tommi Virtanen
  2012-08-22 19:12       ` Ceph performance improvement / journal on block-dev Dieter Kasper (KD)
  0 siblings, 1 reply; 13+ messages in thread
From: Tommi Virtanen @ 2012-08-22 16:29 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel

On Wed, Aug 22, 2012 at 9:23 AM, Denis Fondras <ceph@ledeuns.net> wrote:
>> Are you sure your osd data and journal are on the disks you think? The
>> /home paths look suspicious -- especially for journal, which often
>> should be a block device.
> I am :)
...
> -rw-r--r-- 1 root root 1048576000 août  22 17:22 /home/osd.0.journal

Your journal is a file on a btrfs partition. That is probably a bad
idea for performance. I'd recommend partitioning the drive and using
partitions as journals directly.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement / journal on block-dev
  2012-08-22 16:29     ` Tommi Virtanen
@ 2012-08-22 19:12       ` Dieter Kasper (KD)
  2012-08-22 23:19         ` Tommi Virtanen
  0 siblings, 1 reply; 13+ messages in thread
From: Dieter Kasper (KD) @ 2012-08-22 19:12 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Denis Fondras, Dieter Kasper (KD), ceph-devel

On Wed, Aug 22, 2012 at 06:29:12PM +0200, Tommi Virtanen wrote:
(...)
> 
> Your journal is a file on a btrfs partition. That is probably a bad
> idea for performance. I'd recommend partitioning the drive and using
> partitions as journals directly.

Hi Tommi,

can you please teach me how to use the right parameter(s) to realize 'journal on block-dev' ?

It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs'
(see below)

Regards,
-Dieter


e.g.
---snip---
modprobe -v brd rd_nr=6 rd_size=10000000        # 6x 10G RAM DISK

/etc/ceph/ceph.conf
--
[global]
        auth supported = none

        # set log file
        log file = /ceph/log/$name.log
        log_to_syslog = true        # uncomment this line to log to syslog

        # set up pid files
        pid file = /var/run/ceph/$name.pid

[mon]  
        mon data = /ceph/$name
	debug optracker = 0

[mon.alpha]
	host = 127.0.0.1
	mon addr = 127.0.0.1:6789

[mds]
	debug optracker = 0

[mds.0]
        host = 127.0.0.1

[osd]
	osd data = /data/$name

[osd.0]
	host = 127.0.0.1
        btrfs devs  = /dev/ram0
	osd journal = /dev/ram3

[osd.1]
	host = 127.0.0.1
        btrfs devs  = /dev/ram1
	osd journal = /dev/ram4

[osd.2]
	host = 127.0.0.1
        btrfs devs  = /dev/ram2
	osd journal = /dev/ram5
--

root # mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs
temp dir is /tmp/mkcephfs.wzARGSpFB6
preparing monmap in /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool --create --clobber --add alpha 127.0.0.1:6789 --print /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: generated fsid 40b997ea-387a-4deb-9a30-805cd076a0de
epoch 0
fsid 40b997ea-387a-4deb-9a30-805cd076a0de
last_changed 2012-08-22 21:04:00.553972
created 2012-08-22 21:04:00.553972
0: 127.0.0.1:6789/0 mon.alpha
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.wzARGSpFB6/monmap (1 monitors)
=== osd.0 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.0: not mounted
umount: /dev/ram0: not mounted

Btrfs v0.19.1+

ATTENTION:

mkfs.btrfs is not intended to be used directly. Please use the
YaST partitioner to create and manage btrfs filesystems to be
in a supported state on SUSE Linux Enterprise systems.

fs created label (null) on /dev/ram0
	nodesize 4096 leafsize 4096 sectorsize 4096 size 9.54GiB
Scanning for Btrfs filesystems
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.923505 7fb475e8b780 -1 filestore(/data/osd.0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2012-08-22 21:04:01.937429 7fb475e8b780 -1 created object store /data/osd.0 journal /dev/ram3 for osd.0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de
creating private key for osd.0 keyring /data/osd.0/keyring
creating /data/osd.0/keyring
collecting osd.0 key
=== osd.1 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.1: not mounted
(...)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement / journal on block-dev
  2012-08-22 19:12       ` Ceph performance improvement / journal on block-dev Dieter Kasper (KD)
@ 2012-08-22 23:19         ` Tommi Virtanen
  0 siblings, 0 replies; 13+ messages in thread
From: Tommi Virtanen @ 2012-08-22 23:19 UTC (permalink / raw)
  To: Dieter Kasper (KD); +Cc: Denis Fondras, ceph-devel

On Wed, Aug 22, 2012 at 12:12 PM, Dieter Kasper (KD)
<d.kasper@kabelmail.de> wrote:
>> Your journal is a file on a btrfs partition. That is probably a bad
>> idea for performance. I'd recommend partitioning the drive and using
>> partitions as journals directly.
> can you please teach me how to use the right parameter(s) to realize 'journal on block-dev' ?

Replacing the example paths, use "sudo parted /dev/sdg" or "gksu
gparted /dev/sdg", create partitions, set osd journal to point to a
block device for a partition.

[osd.42]
osd journal = /dev/sdg4

> It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs'
> (see below)

Try running it with -x for any chance of extracting debuggable
information from the monster.

> Scanning for Btrfs filesystems
>  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal

Based on that, my best guess would be that you're seeing a journal
from an old run -- perhaps you need to explicitly clear out the block
device contents..

Frankly, you should not use btrfs devs. Any convenience you may gain
is more than doubly offset by pains exactly like these.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22 10:24 ` David McBride
  2012-08-22 12:10   ` Denis Fondras
@ 2012-08-23  3:51   ` Mark Kirkwood
  1 sibling, 0 replies; 13+ messages in thread
From: Mark Kirkwood @ 2012-08-23  3:51 UTC (permalink / raw)
  To: David McBride; +Cc: Denis Fondras, ceph-devel

On 22/08/12 22:24, David McBride wrote:
> On 22/08/12 09:54, Denis Fondras wrote:
>
>> * Test with "dd" from the client using CephFS :
>> # dd if=/dev/zero of=testdd bs=4k count=4M
>> 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s
>
> Again, the synchronous nature of 'dd' is probably severely affecting 
> apparent performance.  I'd suggest looking at some other tools, like 
> fio, bonnie++, or iozone, which might generate more representative load.
>
> (Or, if you have a specific use-case in mind, something that generates 
> an IO pattern like what you'll be using in production would be ideal!)
>
>

Appending conv=fsync to the dd will make the comparison fair enough. 
Looking at the ceph code, it does


sync_file_range(fd, offset, blocksz, SYNC_FILE_RANGE_WRITE);

which is very fast - way faster than fdatasync() and friends (I have 
tested this ... see prev posting on random write performance with file 
writetest.c attached).

I am not convinced the these sort of tests are in any way 'unfair' - for 
instance I would like to use rbd for postgres or mysql data volumes... 
and many database actions involve a stream of block writes similar 
enough to doing dd (e.g bulk row loads, appends to transaction log 
journals).

Cheers

Mark

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-22 12:35 ` Mark Nelson
  2012-08-22 12:42   ` Alexandre DERUMIER
@ 2012-08-24 16:41   ` Denis Fondras
  2012-08-24 17:42     ` Wido den Hollander
  1 sibling, 1 reply; 13+ messages in thread
From: Denis Fondras @ 2012-08-24 16:41 UTC (permalink / raw)
  To: ceph-devel

Hello Mark,


> Not sure what version of glibc Wheezy has, but try to make sure you have
> one that supports syncfs (you'll also need a semi-new kernel, 3.0+
> should be fine).
>

Wheezy has a fairly recent kernel :
# uname -a
Linux ceph-osd-0 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 
x86_64 GNU/Linux

>
> default values are quite a bit lower for most of these.  You may want to
> play with them and see if it has an effect.
>

I found these values on this ML. I haven't tried to tweak them but it is 
much better than with default values. I will try to change it.

>
> RBD caching should definitely be enabled for a test like this.  I'd be
> surprised if you got 42MB/s without it though...
>

root@ceph-osd-0:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok 
config show | grep rbd
debug_rbd = 0/5
rbd_cache = false
rbd_cache_size = 33554432
rbd_cache_max_dirty = 25165824
rbd_cache_target_dirty = 16777216
rbd_cache_max_dirty_age = 1

In my opinions, performances from RBD client are decent.
Unfortunately I need concurrent access and CephFS is really appealing in 
that respect.

>
> Ouch, that's taking a while!  In addition to the comments that David
> made, be aware that you are also testing the metadata server with
> cephFS.  Right now that's not getting a lot of attention as we are
> primarily focusing on RADOS performance at the moment.  For this kind of
> test though, distributed filesystems will never be as good as local
> disks...
>

Yes, it may be the MDS that is the bottleneck. Perhaps I should have a 
lot of them...

>
> Are you putting both journals on the SSD when you add an OSD?  If so,
> what's the throughput your SSD can sustain?
>

Both journals are on the SSD. It seems that when I do "ceph-osd -i $id 
--mkfs --mkkey" it creates the journal according to the settings in 
ceph.conf.
I did some tests and my SSD drive is somewhat broken... Crucial C300 is 
a bit old and can only do 80MB/s writing.

>
> You may want to check and see how big the IOs going to disk are on the
> OSD node, and how quickly you are filling up the journal vs writing out
> to disk.  "collectl -sD -oT" will give you a nice report.  Iostat can
> probably tell you all of the same stuff with the right flags.
>

Thank you for that tool.

Denis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Ceph performance improvement
  2012-08-24 16:41   ` Denis Fondras
@ 2012-08-24 17:42     ` Wido den Hollander
  0 siblings, 0 replies; 13+ messages in thread
From: Wido den Hollander @ 2012-08-24 17:42 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel



On 08/24/2012 06:41 PM, Denis Fondras wrote:
>
> In my opinions, performances from RBD client are decent.
> Unfortunately I need concurrent access and CephFS is really appealing in
> that respect.
>
>>
>> Ouch, that's taking a while!  In addition to the comments that David
>> made, be aware that you are also testing the metadata server with
>> cephFS.  Right now that's not getting a lot of attention as we are
>> primarily focusing on RADOS performance at the moment.  For this kind of
>> test though, distributed filesystems will never be as good as local
>> disks...
>>
>
> Yes, it may be the MDS that is the bottleneck. Perhaps I should have a
> lot of them...
>

Multi-MDS isn't working that great yet. In fact, CephFS hasn't gotten 
that much attention lately.

Most of the work went into RADOS and RBD. In the next iterations the 
focus will go to CephFS again, but right now it's not that well maintained.

>>
>> Are you putting both journals on the SSD when you add an OSD?  If so,
>> what's the throughput your SSD can sustain?
>>
>
> Both journals are on the SSD. It seems that when I do "ceph-osd -i $id
> --mkfs --mkkey" it creates the journal according to the settings in
> ceph.conf.
> I did some tests and my SSD drive is somewhat broken... Crucial C300 is
> a bit old and can only do 80MB/s writing.
>
>>
>> You may want to check and see how big the IOs going to disk are on the
>> OSD node, and how quickly you are filling up the journal vs writing out
>> to disk.  "collectl -sD -oT" will give you a nice report.  Iostat can
>> probably tell you all of the same stuff with the right flags.
>>
>
> Thank you for that tool.
>
> Denis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-08-24 17:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-22  8:54 Ceph performance improvement Denis Fondras
2012-08-22 10:24 ` David McBride
2012-08-22 12:10   ` Denis Fondras
2012-08-23  3:51   ` Mark Kirkwood
2012-08-22 12:35 ` Mark Nelson
2012-08-22 12:42   ` Alexandre DERUMIER
2012-08-24 16:41   ` Denis Fondras
2012-08-24 17:42     ` Wido den Hollander
2012-08-22 16:03 ` Tommi Virtanen
2012-08-22 16:23   ` Denis Fondras
2012-08-22 16:29     ` Tommi Virtanen
2012-08-22 19:12       ` Ceph performance improvement / journal on block-dev Dieter Kasper (KD)
2012-08-22 23:19         ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.