All of lore.kernel.org
 help / color / mirror / Atom feed
* slow performance even when using SSDs
@ 2012-05-10 12:09 Stefan Priebe - Profihost AG
  2012-05-10 13:09 ` Stefan Priebe - Profihost AG
  2012-05-10 13:23 ` Designing a cluster guide Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 53+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-10 12:09 UTC (permalink / raw)
  To: ceph-devel

Dear List,

i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.

my testsetup:
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
1Gbit/s LAN each

All 3 are running as mon a-c and osd 0-2. Two of them are also running
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).

All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
eth0+eth1 as bond0 (mode 6).

This gives me:
rados -p rbd bench 60 write

...
Total time run:        61.465323
Total writes made:     776
Write size:            4194304
Bandwidth (MB/sec):    50.500

Average Latency:       1.2654
Max latency:           2.77124
Min latency:           0.170936

Shouldn't it be at least 100MB/s? (1Gbit/s / 8)

And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
Total time run:        60.221130
Total writes made:     6401
Write size:            4096
Bandwidth (MB/sec):    0.415

Average Latency:       0.150525
Max latency:           1.12647
Min latency:           0.026599

All btrfs ssds are also mounted with noatime.

Thanks for your help!

Greets Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: slow performance even when using SSDs
  2012-05-10 12:09 slow performance even when using SSDs Stefan Priebe - Profihost AG
@ 2012-05-10 13:09 ` Stefan Priebe - Profihost AG
  2012-05-10 18:24   ` Calvin Morrow
  2012-05-10 13:23 ` Designing a cluster guide Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 53+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-10 13:09 UTC (permalink / raw)
  To: ceph-devel

OK, here some retests. I had the SDDs conected to an old Raid controller
even i did used them as JBODs (oops).

Here are two new Tests (using kernel 3.4-rc6) it would be great if
someone could tell me if they're fine or bad.

New tests with all 3 SSDs connected to the mainboard.

#~ rados -p rbd bench 60 write
Total time run:        60.342419
Total writes made:     2021
Write size:            4194304
Bandwidth (MB/sec):    133.969

Average Latency:       0.477476
Max latency:           0.942029
Min latency:           0.109467

#~ rados -p rbd bench 60 write -b 4096
Total time run:        60.726326
Total writes made:     59026
Write size:            4096
Bandwidth (MB/sec):    3.797

Average Latency:       0.016459
Max latency:           0.874841
Min latency:           0.002392

Another test with only osd on the disk and the journal in memory / tmpfs:
#~ rados -p rbd bench 60 write
Total time run:        60.513240
Total writes made:     2555
Write size:            4194304
Bandwidth (MB/sec):    168.889

Average Latency:       0.378775
Max latency:           4.59233
Min latency:           0.055179

#~ rados -p rbd bench 60 write -b 4096
Total time run:        60.116260
Total writes made:     281903
Write size:            4096
Bandwidth (MB/sec):    18.318

Average Latency:       0.00341067
Max latency:           0.720486
Min latency:           0.000602

Another problem i have is i'm always getting:
"2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
message from mon.2 was stamped 0.109244s in the future, clocks not
synchronized"

even on all systems ntp is running fine.

Stefan

Am 10.05.2012 14:09, schrieb Stefan Priebe - Profihost AG:
> Dear List,
> 
> i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.
> 
> my testsetup:
> 3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
> 1Gbit/s LAN each
> 
> All 3 are running as mon a-c and osd 0-2. Two of them are also running
> as mds.2 and mds.3 (has 8GB RAM instead of 4GB).
> 
> All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
> them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
> eth0+eth1 as bond0 (mode 6).
> 
> This gives me:
> rados -p rbd bench 60 write
> 
> ...
> Total time run:        61.465323
> Total writes made:     776
> Write size:            4194304
> Bandwidth (MB/sec):    50.500
> 
> Average Latency:       1.2654
> Max latency:           2.77124
> Min latency:           0.170936
> 
> Shouldn't it be at least 100MB/s? (1Gbit/s / 8)
> 
> And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
> Total time run:        60.221130
> Total writes made:     6401
> Write size:            4096
> Bandwidth (MB/sec):    0.415
> 
> Average Latency:       0.150525
> Max latency:           1.12647
> Min latency:           0.026599
> 
> All btrfs ssds are also mounted with noatime.
> 
> Thanks for your help!
> 
> Greets Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Designing a cluster guide
  2012-05-10 12:09 slow performance even when using SSDs Stefan Priebe - Profihost AG
  2012-05-10 13:09 ` Stefan Priebe - Profihost AG
@ 2012-05-10 13:23 ` Stefan Priebe - Profihost AG
  2012-05-17 21:27   ` Gregory Farnum
  1 sibling, 1 reply; 53+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-10 13:23 UTC (permalink / raw)
  To: ceph-devel

Hi,

the "Designing a cluster guide"
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.

It mentions for example "Fast CPU" for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?

The Cluster Design Recommendations mentions to seperate all Daemons on
dedicated machines. Is this also for the MON useful? As they're so
leightweight why not running them on the OSDs?

Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6?  Is it more useful the
use a Raid 6 HW Controller or the btrfs raid?

Use single socket Xeon for the OSDs or Dual Socket?

Thanks and greets
Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: slow performance even when using SSDs
  2012-05-10 13:09 ` Stefan Priebe - Profihost AG
@ 2012-05-10 18:24   ` Calvin Morrow
  0 siblings, 0 replies; 53+ messages in thread
From: Calvin Morrow @ 2012-05-10 18:24 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

I was getting roughly the same results of your tmpfs test using
spinning disks for OSDs with a 160GB Intel 320 SSD being used for the
journal.  Theoretically the 520 SSD should give better performance
than my 320s.

Keep in mind that even with balance-alb, multiple GigE connections
will only be used if there are multiple TCP sessions being used by
Ceph.

You don't mention it in your email, but if you're using kernel 3.4+
you'll want to make sure your create your btrfs filesystem using the
large node & leaf size (Big Metadata - I've heard recommendations of
32k instead of default 4k) so your performance doesn't degrade over
time.

I'm curious what speed you're getting from dd in a streaming write.
You might try running a "dd if=/dev/zero of=<intel ssd partition>
bs=128k count=something" to see what the SSD will spit out without
Ceph in the picture.

Calvin

On Thu, May 10, 2012 at 7:09 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> OK, here some retests. I had the SDDs conected to an old Raid controller
> even i did used them as JBODs (oops).
>
> Here are two new Tests (using kernel 3.4-rc6) it would be great if
> someone could tell me if they're fine or bad.
>
> New tests with all 3 SSDs connected to the mainboard.
>
> #~ rados -p rbd bench 60 write
> Total time run:        60.342419
> Total writes made:     2021
> Write size:            4194304
> Bandwidth (MB/sec):    133.969
>
> Average Latency:       0.477476
> Max latency:           0.942029
> Min latency:           0.109467
>
> #~ rados -p rbd bench 60 write -b 4096
> Total time run:        60.726326
> Total writes made:     59026
> Write size:            4096
> Bandwidth (MB/sec):    3.797
>
> Average Latency:       0.016459
> Max latency:           0.874841
> Min latency:           0.002392
>
> Another test with only osd on the disk and the journal in memory / tmpfs:
> #~ rados -p rbd bench 60 write
> Total time run:        60.513240
> Total writes made:     2555
> Write size:            4194304
> Bandwidth (MB/sec):    168.889
>
> Average Latency:       0.378775
> Max latency:           4.59233
> Min latency:           0.055179
>
> #~ rados -p rbd bench 60 write -b 4096
> Total time run:        60.116260
> Total writes made:     281903
> Write size:            4096
> Bandwidth (MB/sec):    18.318
>
> Average Latency:       0.00341067
> Max latency:           0.720486
> Min latency:           0.000602
>
> Another problem i have is i'm always getting:
> "2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
> message from mon.2 was stamped 0.109244s in the future, clocks not
> synchronized"
>
> even on all systems ntp is running fine.
>
> Stefan
>
> Am 10.05.2012 14:09, schrieb Stefan Priebe - Profihost AG:
>> Dear List,
>>
>> i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.
>>
>> my testsetup:
>> 3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
>> 1Gbit/s LAN each
>>
>> All 3 are running as mon a-c and osd 0-2. Two of them are also running
>> as mds.2 and mds.3 (has 8GB RAM instead of 4GB).
>>
>> All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
>> them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
>> eth0+eth1 as bond0 (mode 6).
>>
>> This gives me:
>> rados -p rbd bench 60 write
>>
>> ...
>> Total time run:        61.465323
>> Total writes made:     776
>> Write size:            4194304
>> Bandwidth (MB/sec):    50.500
>>
>> Average Latency:       1.2654
>> Max latency:           2.77124
>> Min latency:           0.170936
>>
>> Shouldn't it be at least 100MB/s? (1Gbit/s / 8)
>>
>> And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
>> Total time run:        60.221130
>> Total writes made:     6401
>> Write size:            4096
>> Bandwidth (MB/sec):    0.415
>>
>> Average Latency:       0.150525
>> Max latency:           1.12647
>> Min latency:           0.026599
>>
>> All btrfs ssds are also mounted with noatime.
>>
>> Thanks for your help!
>>
>> Greets Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-10 13:23 ` Designing a cluster guide Stefan Priebe - Profihost AG
@ 2012-05-17 21:27   ` Gregory Farnum
  2012-05-19  8:37     ` Stefan Priebe
  2012-06-29 18:07     ` Gregory Farnum
  0 siblings, 2 replies; 53+ messages in thread
From: Gregory Farnum @ 2012-05-17 21:27 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

Sorry this got left for so long...

On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Hi,
>
> the "Designing a cluster guide"
> http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
> still leaves some questions unanswered.
>
> It mentions for example "Fast CPU" for the mds system. What does fast
> mean? Just the speed of one core? Or is ceph designed to use multi core?
> Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.

> The Cluster Design Recommendations mentions to seperate all Daemons on
> dedicated machines. Is this also for the MON useful? As they're so
> leightweight why not running them on the OSDs?
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernel&glibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)

> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
> and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?
> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)

> Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-17 21:27   ` Gregory Farnum
@ 2012-05-19  8:37     ` Stefan Priebe
  2012-05-19 16:15       ` Alexandre DERUMIER
  2012-05-21 18:13       ` Gregory Farnum
  2012-06-29 18:07     ` Gregory Farnum
  1 sibling, 2 replies; 53+ messages in thread
From: Stefan Priebe @ 2012-05-19  8:37 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Hi Greg,

Am 17.05.2012 23:27, schrieb Gregory Farnum:
>> It mentions for example "Fast CPU" for the mds system. What does fast
>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>> Is multi core or more speed important?
> Right now, it's primarily the speed of a single core. The MDS is
> highly threaded but doing most things requires grabbing a big lock.
> How fast is a qualitative rather than quantitative assessment at this
> point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single 
xeon for this system? (price per ghz is better).

> It depends on what your nodes look like, and what sort of cluster
> you're running. The monitors are pretty lightweight, but they will add
> *some* load. More important is their disk access patterns — they have
> to do a lot of syncs. So if they're sharing a machine with some other
> daemon you want them to have an independent disk and to be running a
> new kernel&glibc so that they can use syncfs rather than sync. (The
> only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched 
google but haven't found an exact version. We're using debian lenny 
squeeze with a custom kernel.

>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>> and you should go for 22x SSD Disks in a Raid 6?
> You'll need to do your own failure calculations on this one, I'm
> afraid. Just take note that you'll presumably be limited to the speed
> of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or 
is this still too slow? Another idea was to use only a ramdisk for the 
journal and backup the files while shutting down to disk and restore 
them after boot.

> Given that Ceph is going to be doing its own replication, though, I
> wouldn't want to add in another whole layer of replication with raid10
> — do you really want to multiply your storage requirements by another
> factor of two?
OK correct bad idea.

>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
> I would use the hardware controller over btrfs raid for now; it allows
> more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So 
instead of using a Raid 6 with for example 10 disks you would run 6 osds 
on this machine?

>> Use single socket Xeon for the OSDs or Dual Socket?
> Dual socket servers will be overkill given the setup you're
> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
> daemon. You might consider it if you decided you wanted to do an OSD
> per disk instead (that's a more common configuration, but it requires
> more CPU and RAM per disk and we don't know yet which is the better
> choice).
Is there also a rule of thumb for the memory?

My biggest problem with ceph right now is the awful slow speed while 
doing random reads and writes.

Sequential read and writes are at 200Mb/s (that's pretty good for bonded 
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s 
which is def. too slow.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-19  8:37     ` Stefan Priebe
@ 2012-05-19 16:15       ` Alexandre DERUMIER
  2012-05-20  7:56         ` Stefan Priebe
  2012-05-21 15:07         ` Tomasz Paszkowski
  2012-05-21 18:13       ` Gregory Farnum
  1 sibling, 2 replies; 53+ messages in thread
From: Alexandre DERUMIER @ 2012-05-19 16:15 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel, Gregory Farnum

Hi,

For your journal , if you have money, you can use

stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
I'm using them with zfs san, they rocks for journal. 
http://www.stec-inc.com/product/zeusram.php

another interessesting product is ddrdrive
http://www.ddrdrive.com/

----- Mail original ----- 

De: "Stefan Priebe" <s.priebe@profihost.ag> 
À: "Gregory Farnum" <greg@inktank.com> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Samedi 19 Mai 2012 10:37:01 
Objet: Re: Designing a cluster guide 

Hi Greg, 

Am 17.05.2012 23:27, schrieb Gregory Farnum: 
>> It mentions for example "Fast CPU" for the mds system. What does fast 
>> mean? Just the speed of one core? Or is ceph designed to use multi core? 
>> Is multi core or more speed important? 
> Right now, it's primarily the speed of a single core. The MDS is 
> highly threaded but doing most things requires grabbing a big lock. 
> How fast is a qualitative rather than quantitative assessment at this 
> point, though. 
So would you recommand a fast (more ghz) Core i3 instead of a single 
xeon for this system? (price per ghz is better). 

> It depends on what your nodes look like, and what sort of cluster 
> you're running. The monitors are pretty lightweight, but they will add 
> *some* load. More important is their disk access patterns — they have 
> to do a lot of syncs. So if they're sharing a machine with some other 
> daemon you want them to have an independent disk and to be running a 
> new kernel&glibc so that they can use syncfs rather than sync. (The 
> only distribution I know for sure does this is Ubuntu 12.04.) 
Which kernel and which glibc version supports this? I have searched 
google but haven't found an exact version. We're using debian lenny 
squeeze with a custom kernel. 

>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and 
>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd 
>> and you should go for 22x SSD Disks in a Raid 6? 
> You'll need to do your own failure calculations on this one, I'm 
> afraid. Just take note that you'll presumably be limited to the speed 
> of your journaling device here. 
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or 
is this still too slow? Another idea was to use only a ramdisk for the 
journal and backup the files while shutting down to disk and restore 
them after boot. 

> Given that Ceph is going to be doing its own replication, though, I 
> wouldn't want to add in another whole layer of replication with raid10 
> — do you really want to multiply your storage requirements by another 
> factor of two? 
OK correct bad idea. 

>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid? 
> I would use the hardware controller over btrfs raid for now; it allows 
> more flexibility in eg switching to xfs. :) 
OK but overall you would recommand running one osd per disk right? So 
instead of using a Raid 6 with for example 10 disks you would run 6 osds 
on this machine? 

>> Use single socket Xeon for the OSDs or Dual Socket? 
> Dual socket servers will be overkill given the setup you're 
> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD 
> daemon. You might consider it if you decided you wanted to do an OSD 
> per disk instead (that's a more common configuration, but it requires 
> more CPU and RAM per disk and we don't know yet which is the better 
> choice). 
Is there also a rule of thumb for the memory? 

My biggest problem with ceph right now is the awful slow speed while 
doing random reads and writes. 

Sequential read and writes are at 200Mb/s (that's pretty good for bonded 
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s 
which is def. too slow. 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-19 16:15       ` Alexandre DERUMIER
@ 2012-05-20  7:56         ` Stefan Priebe
  2012-05-20  8:13           ` Alexandre DERUMIER
  2012-05-20  8:19           ` Christian Brunner
  2012-05-21 15:07         ` Tomasz Paszkowski
  1 sibling, 2 replies; 53+ messages in thread
From: Stefan Priebe @ 2012-05-20  7:56 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Gregory Farnum

Am 19.05.2012 18:15, schrieb Alexandre DERUMIER:
> Hi,
>
> For your journal , if you have money, you can use
>
> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
> I'm using them with zfs san, they rocks for journal.
> http://www.stec-inc.com/product/zeusram.php
>
> another interessesting product is ddrdrive
> http://www.ddrdrive.com/

Great products but really expensive. The question is do we really need 
this in case of rbd block device.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  7:56         ` Stefan Priebe
@ 2012-05-20  8:13           ` Alexandre DERUMIER
  2012-05-20  8:19           ` Christian Brunner
  1 sibling, 0 replies; 53+ messages in thread
From: Alexandre DERUMIER @ 2012-05-20  8:13 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel, Gregory Farnum

I think that depend how much random writes io you have and acceptable latency you need.

(As the purpose of the journal is to take random io then flush them sequentially to slow storage).

Maybe some slower ssd will fill your needs.
(just be carefull of performance degradation in time, trim,....)




----- Mail original ----- 

De: "Stefan Priebe" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org, "Gregory Farnum" <greg@inktank.com> 
Envoyé: Dimanche 20 Mai 2012 09:56:21 
Objet: Re: Designing a cluster guide 

Am 19.05.2012 18:15, schrieb Alexandre DERUMIER: 
> Hi, 
> 
> For your journal , if you have money, you can use 
> 
> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block). 
> I'm using them with zfs san, they rocks for journal. 
> http://www.stec-inc.com/product/zeusram.php 
> 
> another interessesting product is ddrdrive 
> http://www.ddrdrive.com/ 

Great products but really expensive. The question is do we really need 
this in case of rbd block device. 

Stefan 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  7:56         ` Stefan Priebe
  2012-05-20  8:13           ` Alexandre DERUMIER
@ 2012-05-20  8:19           ` Christian Brunner
  2012-05-20  8:27             ` Stefan Priebe
  2012-05-20  8:56             ` Tim O'Donovan
  1 sibling, 2 replies; 53+ messages in thread
From: Christian Brunner @ 2012-05-20  8:19 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Alexandre DERUMIER, ceph-devel, Gregory Farnum

2012/5/20 Stefan Priebe <s.priebe@profihost.ag>:
> Am 19.05.2012 18:15, schrieb Alexandre DERUMIER:
>
>> Hi,
>>
>> For your journal , if you have money, you can use
>>
>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with
>> 4k block).
>> I'm using them with zfs san, they rocks for journal.
>> http://www.stec-inc.com/product/zeusram.php
>>
>> another interessesting product is ddrdrive
>> http://www.ddrdrive.com/
>
>
> Great products but really expensive. The question is do we really need this
> in case of rbd block device.

I think it depends, what you are planning to do. I was calculating
different storage type for our cloud solution lately. I think that
there are three different types that make sense (at least for us):

- Cheap Object Storage (S3):

  Many 3,5'' SATA Drives for the storage (probably in a RAID config)
  A small and cheap SSD for the journal

- Basic Block Storage (RBD):

  Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
  Small MaxIOPS SSDs for each OSD journal

- High performance Block Storage (RBD)

  Many large SATA SSDs for the storage (prbably in a RAID5 config)
  stec zeusram ssd drive for the journal

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  8:19           ` Christian Brunner
@ 2012-05-20  8:27             ` Stefan Priebe
  2012-05-20  8:31               ` Christian Brunner
  2012-05-20  8:56             ` Tim O'Donovan
  1 sibling, 1 reply; 53+ messages in thread
From: Stefan Priebe @ 2012-05-20  8:27 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Alexandre DERUMIER, ceph-devel, Gregory Farnum

Am 20.05.2012 10:19, schrieb Christian Brunner:
> - Cheap Object Storage (S3):
>
>    Many 3,5'' SATA Drives for the storage (probably in a RAID config)
>    A small and cheap SSD for the journal
>
> - Basic Block Storage (RBD):
>
>    Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
>    Small MaxIOPS SSDs for each OSD journal
>
> - High performance Block Storage (RBD)
>
>    Many large SATA SSDs for the storage (prbably in a RAID5 config)
>    stec zeusram ssd drive for the journal
That's exactly what i thought too but then you need a seperate ceph / 
rbd cluster for each type.

Which will result in a minimum of:
3x mon servers per type
4x osd servers per type
---

so you'll need a minimum of 12x osd systems and 9x mon systems.

Regards,
Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  8:27             ` Stefan Priebe
@ 2012-05-20  8:31               ` Christian Brunner
  2012-05-21  8:22                 ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 53+ messages in thread
From: Christian Brunner @ 2012-05-20  8:31 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Alexandre DERUMIER, ceph-devel, Gregory Farnum

2012/5/20 Stefan Priebe <s.priebe@profihost.ag>:
> Am 20.05.2012 10:19, schrieb Christian Brunner:
>
>> - Cheap Object Storage (S3):
>>
>>   Many 3,5'' SATA Drives for the storage (probably in a RAID config)
>>   A small and cheap SSD for the journal
>>
>> - Basic Block Storage (RBD):
>>
>>   Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
>>   Small MaxIOPS SSDs for each OSD journal
>>
>> - High performance Block Storage (RBD)
>>
>>   Many large SATA SSDs for the storage (prbably in a RAID5 config)
>>   stec zeusram ssd drive for the journal
>
> That's exactly what i thought too but then you need a seperate ceph / rbd
> cluster for each type.
>
> Which will result in a minimum of:
> 3x mon servers per type
> 4x osd servers per type
> ---
>
> so you'll need a minimum of 12x osd systems and 9x mon systems.

You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  8:19           ` Christian Brunner
  2012-05-20  8:27             ` Stefan Priebe
@ 2012-05-20  8:56             ` Tim O'Donovan
  2012-05-20  9:24               ` Stefan Priebe
  2012-05-21 14:59               ` Christian Brunner
  1 sibling, 2 replies; 53+ messages in thread
From: Tim O'Donovan @ 2012-05-20  8:56 UTC (permalink / raw)
  To: ceph-devel

> - High performance Block Storage (RBD)
> 
>   Many large SATA SSDs for the storage (prbably in a RAID5 config)
>   stec zeusram ssd drive for the journal

How do you think standard SATA disks would perform in comparison to
this, and is a separate journaling device really necessary?

Perhaps three servers, each with 12 x 1TB SATA disks configured in
RAID10, an osd on each server and three separate mon servers.

Would this be suitable for the storage backend for a small OpenStack
cloud, performance wise, for instance?


Regards,
Tim O'Donovan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  8:56             ` Tim O'Donovan
@ 2012-05-20  9:24               ` Stefan Priebe
  2012-05-20  9:46                 ` Tim O'Donovan
  2012-05-21 14:59               ` Christian Brunner
  1 sibling, 1 reply; 53+ messages in thread
From: Stefan Priebe @ 2012-05-20  9:24 UTC (permalink / raw)
  To: Tim O'Donovan; +Cc: ceph-devel

Am 20.05.2012 um 10:56 schrieb Tim O'Donovan <tim@icukhosting.co.uk>:

>> - High performance Block Storage (RBD)
>> 
>>  Many large SATA SSDs for the storage (prbably in a RAID5 config)
>>  stec zeusram ssd drive for the journal
> 
> How do you think standard SATA disks would perform in comparison to
> this, and is a separate journaling device really necessary?
> 
> Perhaps three servers, each with 12 x 1TB SATA disks configured in
> RAID10, an osd on each server and three separate mon servers.
> 
He's talking about ssd's not normal sata disks.

Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  9:24               ` Stefan Priebe
@ 2012-05-20  9:46                 ` Tim O'Donovan
  2012-05-20  9:49                   ` Stefan Priebe
  0 siblings, 1 reply; 53+ messages in thread
From: Tim O'Donovan @ 2012-05-20  9:46 UTC (permalink / raw)
  To: ceph-devel

> He's talking about ssd's not normal sata disks.

I realise that. I'm looking for similar advice and have been following
this thread. It didn't seem off topic to ask here.


Regards,
Tim O'Donovan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  9:46                 ` Tim O'Donovan
@ 2012-05-20  9:49                   ` Stefan Priebe
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Priebe @ 2012-05-20  9:49 UTC (permalink / raw)
  To: Tim O'Donovan; +Cc: ceph-devel

No sorry just wanted to clarify as you quoted the ssd part. 

Stefan

Am 20.05.2012 um 11:46 schrieb Tim O'Donovan <tim@icukhosting.co.uk>:

>> He's talking about ssd's not normal sata disks.
> 
> I realise that. I'm looking for similar advice and have been following
> this thread. It didn't seem off topic to ask here.
> 
> 
> Regards,
> Tim O'Donovan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  8:31               ` Christian Brunner
@ 2012-05-21  8:22                 ` Stefan Priebe - Profihost AG
  2012-05-21 15:03                   ` Christian Brunner
  0 siblings, 1 reply; 53+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-21  8:22 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Alexandre DERUMIER, ceph-devel, Gregory Farnum

Am 20.05.2012 10:31, schrieb Christian Brunner:
>> That's exactly what i thought too but then you need a seperate ceph / rbd
>> cluster for each type.
>>
>> Which will result in a minimum of:
>> 3x mon servers per type
>> 4x osd servers per type
>> ---
>>
>> so you'll need a minimum of 12x osd systems and 9x mon systems.
> 
> You can arrange the storage types in different pools, so that you
> don't need separate mon servers (this can be done by adjusting the
> crushmap) and you could even run multiple OSDs per server.
That sounds great. Can you give me a hint how to setup pools? Right now
i have data, metadata and rbd => the default pools. But i wasn't able to
find any page in the wiki which described how to setup pools.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-20  8:56             ` Tim O'Donovan
  2012-05-20  9:24               ` Stefan Priebe
@ 2012-05-21 14:59               ` Christian Brunner
  2012-05-21 15:05                 ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 53+ messages in thread
From: Christian Brunner @ 2012-05-21 14:59 UTC (permalink / raw)
  To: Tim O'Donovan; +Cc: ceph-devel

2012/5/20 Tim O'Donovan <tim@icukhosting.co.uk>:
>> - High performance Block Storage (RBD)
>>
>>   Many large SATA SSDs for the storage (prbably in a RAID5 config)
>>   stec zeusram ssd drive for the journal
>
> How do you think standard SATA disks would perform in comparison to
> this, and is a separate journaling device really necessary?

A journaling device is improving write latency a lot and the write
latency is directly related to the throughput you get in your virtual
machine. If you have a raid controller with a battery backed write
cache you could try to put the journal on a separate, small partition
of your SATA disk. I haven't tried this, but I think this could work.

Apart from that you should calculate the sum of the IOPS your guests
genereate. In the end everything has to be written on your backend
storage and is has to be able to deliver the IOPS.

With the journal you might be able to compensate short write peaks and
there might be a gain by merging write requests on the OSDs, but for a
solid sizing I would neglect this. Read requests can be delivered for
the OSDs cache (RAM), but again this will probably give you only a
small gain.

For a single SATA disk you can calculate with 100-150 IOPS (depending
on the speed of the disk). SSDs can deliver much higher IOPS values.

> Perhaps three servers, each with 12 x 1TB SATA disks configured in
> RAID10, an osd on each server and three separate mon servers.

With a replication level of two this would be 1350 IOPS:

150 IOPS per disk * 12 disks * 3 servers / 2 for the RAID10 / 2 for
ceph replication

Comments on this formula would be welcome...

> Would this be suitable for the storage backend for a small OpenStack
> cloud, performance wise, for instance?

That depends on what you are doing in your guests.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21  8:22                 ` Stefan Priebe - Profihost AG
@ 2012-05-21 15:03                   ` Christian Brunner
  0 siblings, 0 replies; 53+ messages in thread
From: Christian Brunner @ 2012-05-21 15:03 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Alexandre DERUMIER, ceph-devel, Gregory Farnum

2012/5/21 Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:
> Am 20.05.2012 10:31, schrieb Christian Brunner:
>>> That's exactly what i thought too but then you need a seperate ceph / rbd
>>> cluster for each type.
>>>
>>> Which will result in a minimum of:
>>> 3x mon servers per type
>>> 4x osd servers per type
>>> ---
>>>
>>> so you'll need a minimum of 12x osd systems and 9x mon systems.
>>
>> You can arrange the storage types in different pools, so that you
>> don't need separate mon servers (this can be done by adjusting the
>> crushmap) and you could even run multiple OSDs per server.
> That sounds great. Can you give me a hint how to setup pools? Right now
> i have data, metadata and rbd => the default pools. But i wasn't able to
> find any page in the wiki which described how to setup pools.

rados mkpool <pool-name> [123[ 4]]     create pool <pool-name>'
                                    [with auid 123[and using crush rule 4]]

Christian

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 14:59               ` Christian Brunner
@ 2012-05-21 15:05                 ` Stefan Priebe - Profihost AG
  2012-05-21 15:12                   ` Tomasz Paszkowski
  0 siblings, 1 reply; 53+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-21 15:05 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Tim O'Donovan, ceph-devel

Am 21.05.2012 16:59, schrieb Christian Brunner:
> Apart from that you should calculate the sum of the IOPS your guests
> genereate. In the end everything has to be written on your backend
> storage and is has to be able to deliver the IOPS.
How to measure the IOPs of a dedicated actual system?

Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-19 16:15       ` Alexandre DERUMIER
  2012-05-20  7:56         ` Stefan Priebe
@ 2012-05-21 15:07         ` Tomasz Paszkowski
  2012-05-21 21:22           ` Sławomir Skowron
  1 sibling, 1 reply; 53+ messages in thread
From: Tomasz Paszkowski @ 2012-05-21 15:07 UTC (permalink / raw)
  To: ceph-devel

Another great thing that should be mentioned is:
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)



On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
> Hi,
>
> For your journal , if you have money, you can use
>
> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
> I'm using them with zfs san, they rocks for journal.
> http://www.stec-inc.com/product/zeusram.php
>
> another interessesting product is ddrdrive
> http://www.ddrdrive.com/
>
> ----- Mail original -----
>
> De: "Stefan Priebe" <s.priebe@profihost.ag>
> À: "Gregory Farnum" <greg@inktank.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Samedi 19 Mai 2012 10:37:01
> Objet: Re: Designing a cluster guide
>
> Hi Greg,
>
> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>> Is multi core or more speed important?
>> Right now, it's primarily the speed of a single core. The MDS is
>> highly threaded but doing most things requires grabbing a big lock.
>> How fast is a qualitative rather than quantitative assessment at this
>> point, though.
> So would you recommand a fast (more ghz) Core i3 instead of a single
> xeon for this system? (price per ghz is better).
>
>> It depends on what your nodes look like, and what sort of cluster
>> you're running. The monitors are pretty lightweight, but they will add
>> *some* load. More important is their disk access patterns — they have
>> to do a lot of syncs. So if they're sharing a machine with some other
>> daemon you want them to have an independent disk and to be running a
>> new kernel&glibc so that they can use syncfs rather than sync. (The
>> only distribution I know for sure does this is Ubuntu 12.04.)
> Which kernel and which glibc version supports this? I have searched
> google but haven't found an exact version. We're using debian lenny
> squeeze with a custom kernel.
>
>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>> and you should go for 22x SSD Disks in a Raid 6?
>> You'll need to do your own failure calculations on this one, I'm
>> afraid. Just take note that you'll presumably be limited to the speed
>> of your journaling device here.
> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
> is this still too slow? Another idea was to use only a ramdisk for the
> journal and backup the files while shutting down to disk and restore
> them after boot.
>
>> Given that Ceph is going to be doing its own replication, though, I
>> wouldn't want to add in another whole layer of replication with raid10
>> — do you really want to multiply your storage requirements by another
>> factor of two?
> OK correct bad idea.
>
>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>> I would use the hardware controller over btrfs raid for now; it allows
>> more flexibility in eg switching to xfs. :)
> OK but overall you would recommand running one osd per disk right? So
> instead of using a Raid 6 with for example 10 disks you would run 6 osds
> on this machine?
>
>>> Use single socket Xeon for the OSDs or Dual Socket?
>> Dual socket servers will be overkill given the setup you're
>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>> daemon. You might consider it if you decided you wanted to do an OSD
>> per disk instead (that's a more common configuration, but it requires
>> more CPU and RAM per disk and we don't know yet which is the better
>> choice).
> Is there also a rule of thumb for the memory?
>
> My biggest problem with ceph right now is the awful slow speed while
> doing random reads and writes.
>
> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
> which is def. too slow.
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
>
> --
>
>
>
>
>        Alexandre D erumier
> Ingénieur Système
> Fixe : 03 20 68 88 90
> Fax : 03 20 68 90 81
> 45 Bvd du Général Leclerc 59100 Roubaix - France
> 12 rue Marivaux 75002 Paris - France
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 15:05                 ` Stefan Priebe - Profihost AG
@ 2012-05-21 15:12                   ` Tomasz Paszkowski
       [not found]                     ` <CANT588uxL7jrf1BfowUeer_AnDTfGjzkWVFhS4aNMaMSst_jyA@mail.gmail.com>
  2012-05-21 20:11                     ` Stefan Priebe
  0 siblings, 2 replies; 53+ messages in thread
From: Tomasz Paszkowski @ 2012-05-21 15:12 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Christian Brunner, Tim O'Donovan, ceph-devel

If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.


On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Am 21.05.2012 16:59, schrieb Christian Brunner:
>> Apart from that you should calculate the sum of the IOPS your guests
>> genereate. In the end everything has to be written on your backend
>> storage and is has to be able to deliver the IOPS.
> How to measure the IOPs of a dedicated actual system?
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
       [not found]                     ` <CANT588uxL7jrf1BfowUeer_AnDTfGjzkWVFhS4aNMaMSst_jyA@mail.gmail.com>
@ 2012-05-21 15:36                       ` Tomasz Paszkowski
  2012-05-21 18:15                         ` Damien Churchill
  0 siblings, 1 reply; 53+ messages in thread
From: Tomasz Paszkowski @ 2012-05-21 15:36 UTC (permalink / raw)
  To: ceph-devel

Project is indeed very interesting, but requires to patch a kernel
source. For me using lkm is safer ;)


On Mon, May 21, 2012 at 5:30 PM, Kiran Patil <kirantpatil@gmail.com> wrote:
> Hello,
>
> Has someone looked into bcache (http://bcache.evilpiepirate.org/) ?
>
> It seems, it is superior to flashcache.
>
> Lwn.net article: https://lwn.net/Articles/497024/
>
> Mailing list: http://news.gmane.org/gmane.linux.kernel.bcache.devel
>
> Source code: http://evilpiepirate.org/cgi-bin/cgit.cgi/linux-bcache.git/
>
> Thanks,
> Kiran Patil.
>
>
> On Mon, May 21, 2012 at 8:42 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
>>
>> If you're using Qemu/KVM you can use 'info blockstats' command for
>> measruing I/O on particular VM.
>>
>>
>> On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG
>> <s.priebe@profihost.ag> wrote:
>> > Am 21.05.2012 16:59, schrieb Christian Brunner:
>> >> Apart from that you should calculate the sum of the IOPS your guests
>> >> genereate. In the end everything has to be written on your backend
>> >> storage and is has to be able to deliver the IOPS.
>> > How to measure the IOPs of a dedicated actual system?
>> >
>> > Stefan
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Tomasz Paszkowski
>> SS7, Asterisk, SAN, Datacenter, Cloud Computing
>> +48500166299
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-19  8:37     ` Stefan Priebe
  2012-05-19 16:15       ` Alexandre DERUMIER
@ 2012-05-21 18:13       ` Gregory Farnum
  2012-05-22  6:20         ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 53+ messages in thread
From: Gregory Farnum @ 2012-05-21 18:13 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel

On Sat, May 19, 2012 at 1:37 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Hi Greg,
>
> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>
>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>> Is multi core or more speed important?
>>
>> Right now, it's primarily the speed of a single core. The MDS is
>> highly threaded but doing most things requires grabbing a big lock.
>> How fast is a qualitative rather than quantitative assessment at this
>> point, though.
>
> So would you recommand a fast (more ghz) Core i3 instead of a single xeon
> for this system? (price per ghz is better).

If that's all the MDS is doing there, probably? (It would also depend
on cache sizes and things; I don't have a good sense for how that
impacts the MDS' performance.)

>> It depends on what your nodes look like, and what sort of cluster
>> you're running. The monitors are pretty lightweight, but they will add
>> *some* load. More important is their disk access patterns — they have
>> to do a lot of syncs. So if they're sharing a machine with some other
>> daemon you want them to have an independent disk and to be running a
>> new kernel&glibc so that they can use syncfs rather than sync. (The
>> only distribution I know for sure does this is Ubuntu 12.04.)
>
> Which kernel and which glibc version supports this? I have searched google
> but haven't found an exact version. We're using debian lenny squeeze with a
> custom kernel.

syncfs is in Linux 2.6.39; I'm not sure about glibc but from a quick
web search it looks like it might have appeared in glibc 2.15?

>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>> and you should go for 22x SSD Disks in a Raid 6?
>>
>> You'll need to do your own failure calculations on this one, I'm
>> afraid. Just take note that you'll presumably be limited to the speed
>> of your journaling device here.
>
> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is
> this still too slow? Another idea was to use only a ramdisk for the journal
> and backup the files while shutting down to disk and restore them after
> boot.

Well, RAID1 isn't going to make it any faster than just the single
SSD, is why I pointed that out.
I wouldn't recommend using a ramdisk for the journal — that will
guarantee local data loss in the event the server doesn't shut down
properly, and if it happens to several servers at once you get a good
chance of losing client writes.

>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>
>> I would use the hardware controller over btrfs raid for now; it allows
>> more flexibility in eg switching to xfs. :)
>
> OK but overall you would recommand running one osd per disk right? So
> instead of using a Raid 6 with for example 10 disks you would run 6 osds on
> this machine?
Right now all the production systems I'm involved in are using 1 OSD
per disk, but honestly we don't know if that's the right answer or
not. It's a tradeoff — more OSDs increases cpu and memory requirements
(per storage space) but also localizes failure a bit more.

>>> Use single socket Xeon for the OSDs or Dual Socket?
>>
>> Dual socket servers will be overkill given the setup you're
>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>> daemon. You might consider it if you decided you wanted to do an OSD
>> per disk instead (that's a more common configuration, but it requires
>> more CPU and RAM per disk and we don't know yet which is the better
>> choice).
>
> Is there also a rule of thumb for the memory?
About 200MB per daemon right now, plus however much you want the page
cache to be able to use. :) This might go up a bit during peering, but
under normal operation it shouldn't be more than another couple
hundred MB.

> My biggest problem with ceph right now is the awful slow speed while doing
> random reads and writes.
>
> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s which is
> def. too slow.
Hmm. I'm not super-familiar where our random IO performance is right
now (and lots of other people seem to have advice on journaling
devices :), but that's about in line with what you get from a hard
disk normally. Unless you've designed your application very carefully
(lots and lots of parallel IO), an individual client doing synchronous
random IO is unlikely to be able to get much faster than a regular
drive.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 15:36                       ` Tomasz Paszkowski
@ 2012-05-21 18:15                         ` Damien Churchill
  0 siblings, 0 replies; 53+ messages in thread
From: Damien Churchill @ 2012-05-21 18:15 UTC (permalink / raw)
  To: Tomasz Paszkowski; +Cc: ceph-devel

On 21 May 2012 16:36, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
> Project is indeed very interesting, but requires to patch a kernel
> source. For me using lkm is safer ;)
>

I believe bcache is actually in the process of being mainlined and
moved to a device mapper target, although I could wrong about one or
more of those things.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 15:12                   ` Tomasz Paszkowski
       [not found]                     ` <CANT588uxL7jrf1BfowUeer_AnDTfGjzkWVFhS4aNMaMSst_jyA@mail.gmail.com>
@ 2012-05-21 20:11                     ` Stefan Priebe
  2012-05-21 20:13                       ` Tomasz Paszkowski
  1 sibling, 1 reply; 53+ messages in thread
From: Stefan Priebe @ 2012-05-21 20:11 UTC (permalink / raw)
  To: Tomasz Paszkowski; +Cc: Christian Brunner, Tim O'Donovan, ceph-devel

Am 21.05.2012 17:12, schrieb Tomasz Paszkowski:
> If you're using Qemu/KVM you can use 'info blockstats' command for
> measruing I/O on particular VM.

I want to migrate physical servers to KVM. Any idea for that?

Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 20:11                     ` Stefan Priebe
@ 2012-05-21 20:13                       ` Tomasz Paszkowski
  2012-05-21 20:14                         ` Stefan Priebe
  0 siblings, 1 reply; 53+ messages in thread
From: Tomasz Paszkowski @ 2012-05-21 20:13 UTC (permalink / raw)
  To: ceph-devel

Just to clarify. You'd like to measure I/O on those system which are
currently running on physical machines ?


On Mon, May 21, 2012 at 10:11 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Am 21.05.2012 17:12, schrieb Tomasz Paszkowski:
>
>> If you're using Qemu/KVM you can use 'info blockstats' command for
>> measruing I/O on particular VM.
>
>
> I want to migrate physical servers to KVM. Any idea for that?
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 20:13                       ` Tomasz Paszkowski
@ 2012-05-21 20:14                         ` Stefan Priebe
  2012-05-21 20:19                           ` Tomasz Paszkowski
  0 siblings, 1 reply; 53+ messages in thread
From: Stefan Priebe @ 2012-05-21 20:14 UTC (permalink / raw)
  To: Tomasz Paszkowski; +Cc: ceph-devel

Am 21.05.2012 22:13, schrieb Tomasz Paszkowski:
> Just to clarify. You'd like to measure I/O on those system which are
> currently running on physical machines ?
IOPs not just I/O.

Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 20:14                         ` Stefan Priebe
@ 2012-05-21 20:19                           ` Tomasz Paszkowski
  0 siblings, 0 replies; 53+ messages in thread
From: Tomasz Paszkowski @ 2012-05-21 20:19 UTC (permalink / raw)
  To: ceph-devel

On Linux boxes you may use output from iostat -x /dev/sda and connect
it it to any monitoring system like: zabbix or cacti :-)


On Mon, May 21, 2012 at 10:14 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Am 21.05.2012 22:13, schrieb Tomasz Paszkowski:
>
>> Just to clarify. You'd like to measure I/O on those system which are
>> currently running on physical machines ?
>
> IOPs not just I/O.
>
> Stefan



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 15:07         ` Tomasz Paszkowski
@ 2012-05-21 21:22           ` Sławomir Skowron
  2012-05-21 23:52             ` Quenten Grasso
  2012-05-22  6:30             ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 53+ messages in thread
From: Sławomir Skowron @ 2012-05-21 21:22 UTC (permalink / raw)
  To: ceph-devel; +Cc: Tomasz Paszkowski

Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.

I like to test setup like this, but maybe someone have any real life info ??

On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
> Another great thing that should be mentioned is:
> https://github.com/facebook/flashcache/. It gives really huge
> performance improvements for reads/writes (especialy on FunsionIO
> drives) event without using librbd caching :-)
>
>
>
> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>> Hi,
>>
>> For your journal , if you have money, you can use
>>
>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>> I'm using them with zfs san, they rocks for journal.
>> http://www.stec-inc.com/product/zeusram.php
>>
>> another interessesting product is ddrdrive
>> http://www.ddrdrive.com/
>>
>> ----- Mail original -----
>>
>> De: "Stefan Priebe" <s.priebe@profihost.ag>
>> À: "Gregory Farnum" <greg@inktank.com>
>> Cc: ceph-devel@vger.kernel.org
>> Envoyé: Samedi 19 Mai 2012 10:37:01
>> Objet: Re: Designing a cluster guide
>>
>> Hi Greg,
>>
>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>> Is multi core or more speed important?
>>> Right now, it's primarily the speed of a single core. The MDS is
>>> highly threaded but doing most things requires grabbing a big lock.
>>> How fast is a qualitative rather than quantitative assessment at this
>>> point, though.
>> So would you recommand a fast (more ghz) Core i3 instead of a single
>> xeon for this system? (price per ghz is better).
>>
>>> It depends on what your nodes look like, and what sort of cluster
>>> you're running. The monitors are pretty lightweight, but they will add
>>> *some* load. More important is their disk access patterns — they have
>>> to do a lot of syncs. So if they're sharing a machine with some other
>>> daemon you want them to have an independent disk and to be running a
>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>> only distribution I know for sure does this is Ubuntu 12.04.)
>> Which kernel and which glibc version supports this? I have searched
>> google but haven't found an exact version. We're using debian lenny
>> squeeze with a custom kernel.
>>
>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>> and you should go for 22x SSD Disks in a Raid 6?
>>> You'll need to do your own failure calculations on this one, I'm
>>> afraid. Just take note that you'll presumably be limited to the speed
>>> of your journaling device here.
>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>> is this still too slow? Another idea was to use only a ramdisk for the
>> journal and backup the files while shutting down to disk and restore
>> them after boot.
>>
>>> Given that Ceph is going to be doing its own replication, though, I
>>> wouldn't want to add in another whole layer of replication with raid10
>>> — do you really want to multiply your storage requirements by another
>>> factor of two?
>> OK correct bad idea.
>>
>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>> I would use the hardware controller over btrfs raid for now; it allows
>>> more flexibility in eg switching to xfs. :)
>> OK but overall you would recommand running one osd per disk right? So
>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>> on this machine?
>>
>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>> Dual socket servers will be overkill given the setup you're
>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>> daemon. You might consider it if you decided you wanted to do an OSD
>>> per disk instead (that's a more common configuration, but it requires
>>> more CPU and RAM per disk and we don't know yet which is the better
>>> choice).
>> Is there also a rule of thumb for the memory?
>>
>> My biggest problem with ceph right now is the awful slow speed while
>> doing random reads and writes.
>>
>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>> which is def. too slow.
>>
>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>>
>> --
>>
>>
>>
>>
>>        Alexandre D erumier
>> Ingénieur Système
>> Fixe : 03 20 68 88 90
>> Fax : 03 20 68 90 81
>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>> 12 rue Marivaux 75002 Paris - France
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Tomasz Paszkowski
> SS7, Asterisk, SAN, Datacenter, Cloud Computing
> +48500166299
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Designing a cluster guide
  2012-05-21 21:22           ` Sławomir Skowron
@ 2012-05-21 23:52             ` Quenten Grasso
  2012-05-22  0:30               ` Gregory Farnum
  2012-05-22  6:30             ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 53+ messages in thread
From: Quenten Grasso @ 2012-05-21 23:52 UTC (permalink / raw)
  To: ceph-devel

Hi All,


I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks, 
in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.

Can someone help clarify this one,

Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
Or  
Once the data is written to the (journal disk) is this considered successful by the client?
Or
Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)


Pros 
Quite fast Write throughput to the journal disks,
No write wareout of SSD's
RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)


Cons
Not as fast as SSD's
More rackspace required per server.


Regards,
Quenten

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Slawomir Skowron
Sent: Tuesday, 22 May 2012 7:22 AM
To: ceph-devel@vger.kernel.org
Cc: Tomasz Paszkowski
Subject: Re: Designing a cluster guide

Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.

I like to test setup like this, but maybe someone have any real life info ??

On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
> Another great thing that should be mentioned is:
> https://github.com/facebook/flashcache/. It gives really huge
> performance improvements for reads/writes (especialy on FunsionIO
> drives) event without using librbd caching :-)
>
>
>
> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>> Hi,
>>
>> For your journal , if you have money, you can use
>>
>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>> I'm using them with zfs san, they rocks for journal.
>> http://www.stec-inc.com/product/zeusram.php
>>
>> another interessesting product is ddrdrive
>> http://www.ddrdrive.com/
>>
>> ----- Mail original -----
>>
>> De: "Stefan Priebe" <s.priebe@profihost.ag>
>> À: "Gregory Farnum" <greg@inktank.com>
>> Cc: ceph-devel@vger.kernel.org
>> Envoyé: Samedi 19 Mai 2012 10:37:01
>> Objet: Re: Designing a cluster guide
>>
>> Hi Greg,
>>
>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>> Is multi core or more speed important?
>>> Right now, it's primarily the speed of a single core. The MDS is
>>> highly threaded but doing most things requires grabbing a big lock.
>>> How fast is a qualitative rather than quantitative assessment at this
>>> point, though.
>> So would you recommand a fast (more ghz) Core i3 instead of a single
>> xeon for this system? (price per ghz is better).
>>
>>> It depends on what your nodes look like, and what sort of cluster
>>> you're running. The monitors are pretty lightweight, but they will add
>>> *some* load. More important is their disk access patterns — they have
>>> to do a lot of syncs. So if they're sharing a machine with some other
>>> daemon you want them to have an independent disk and to be running a
>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>> only distribution I know for sure does this is Ubuntu 12.04.)
>> Which kernel and which glibc version supports this? I have searched
>> google but haven't found an exact version. We're using debian lenny
>> squeeze with a custom kernel.
>>
>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>> and you should go for 22x SSD Disks in a Raid 6?
>>> You'll need to do your own failure calculations on this one, I'm
>>> afraid. Just take note that you'll presumably be limited to the speed
>>> of your journaling device here.
>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>> is this still too slow? Another idea was to use only a ramdisk for the
>> journal and backup the files while shutting down to disk and restore
>> them after boot.
>>
>>> Given that Ceph is going to be doing its own replication, though, I
>>> wouldn't want to add in another whole layer of replication with raid10
>>> — do you really want to multiply your storage requirements by another
>>> factor of two?
>> OK correct bad idea.
>>
>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>> I would use the hardware controller over btrfs raid for now; it allows
>>> more flexibility in eg switching to xfs. :)
>> OK but overall you would recommand running one osd per disk right? So
>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>> on this machine?
>>
>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>> Dual socket servers will be overkill given the setup you're
>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>> daemon. You might consider it if you decided you wanted to do an OSD
>>> per disk instead (that's a more common configuration, but it requires
>>> more CPU and RAM per disk and we don't know yet which is the better
>>> choice).
>> Is there also a rule of thumb for the memory?
>>
>> My biggest problem with ceph right now is the awful slow speed while
>> doing random reads and writes.
>>
>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>> which is def. too slow.
>>
>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>>
>> --
>>
>>
>>
>>
>>        Alexandre D erumier
>> Ingénieur Système
>> Fixe : 03 20 68 88 90
>> Fax : 03 20 68 90 81
>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>> 12 rue Marivaux 75002 Paris - France
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Tomasz Paszkowski
> SS7, Asterisk, SAN, Datacenter, Cloud Computing
> +48500166299
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 23:52             ` Quenten Grasso
@ 2012-05-22  0:30               ` Gregory Farnum
  2012-05-22  0:42                 ` Quenten Grasso
  2012-05-22  9:04                 ` Jerker Nyberg
  0 siblings, 2 replies; 53+ messages in thread
From: Gregory Farnum @ 2012-05-22  0:30 UTC (permalink / raw)
  To: Quenten Grasso; +Cc: ceph-devel

On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso <QGrasso@onq.com.au> wrote:
> Hi All,
>
>
> I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
> in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
>
> Can someone help clarify this one,
>
> Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
> Or
> Once the data is written to the (journal disk) is this considered successful by the client?
This one — the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?

> Or
> Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
>
>
> Pros
> Quite fast Write throughput to the journal disks,
> No write wareout of SSD's
> RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
>
>
> Cons
> Not as fast as SSD's
> More rackspace required per server.
>
>
> Regards,
> Quenten
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Slawomir Skowron
> Sent: Tuesday, 22 May 2012 7:22 AM
> To: ceph-devel@vger.kernel.org
> Cc: Tomasz Paszkowski
> Subject: Re: Designing a cluster guide
>
> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
> separate journaling partitions with hardware RAID1.
>
> I like to test setup like this, but maybe someone have any real life info ??
>
> On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
>> Another great thing that should be mentioned is:
>> https://github.com/facebook/flashcache/. It gives really huge
>> performance improvements for reads/writes (especialy on FunsionIO
>> drives) event without using librbd caching :-)
>>
>>
>>
>> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>> Hi,
>>>
>>> For your journal , if you have money, you can use
>>>
>>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>>> I'm using them with zfs san, they rocks for journal.
>>> http://www.stec-inc.com/product/zeusram.php
>>>
>>> another interessesting product is ddrdrive
>>> http://www.ddrdrive.com/
>>>
>>> ----- Mail original -----
>>>
>>> De: "Stefan Priebe" <s.priebe@profihost.ag>
>>> À: "Gregory Farnum" <greg@inktank.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Envoyé: Samedi 19 Mai 2012 10:37:01
>>> Objet: Re: Designing a cluster guide
>>>
>>> Hi Greg,
>>>
>>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>>> Is multi core or more speed important?
>>>> Right now, it's primarily the speed of a single core. The MDS is
>>>> highly threaded but doing most things requires grabbing a big lock.
>>>> How fast is a qualitative rather than quantitative assessment at this
>>>> point, though.
>>> So would you recommand a fast (more ghz) Core i3 instead of a single
>>> xeon for this system? (price per ghz is better).
>>>
>>>> It depends on what your nodes look like, and what sort of cluster
>>>> you're running. The monitors are pretty lightweight, but they will add
>>>> *some* load. More important is their disk access patterns — they have
>>>> to do a lot of syncs. So if they're sharing a machine with some other
>>>> daemon you want them to have an independent disk and to be running a
>>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>>> only distribution I know for sure does this is Ubuntu 12.04.)
>>> Which kernel and which glibc version supports this? I have searched
>>> google but haven't found an exact version. We're using debian lenny
>>> squeeze with a custom kernel.
>>>
>>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>>> and you should go for 22x SSD Disks in a Raid 6?
>>>> You'll need to do your own failure calculations on this one, I'm
>>>> afraid. Just take note that you'll presumably be limited to the speed
>>>> of your journaling device here.
>>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>>> is this still too slow? Another idea was to use only a ramdisk for the
>>> journal and backup the files while shutting down to disk and restore
>>> them after boot.
>>>
>>>> Given that Ceph is going to be doing its own replication, though, I
>>>> wouldn't want to add in another whole layer of replication with raid10
>>>> — do you really want to multiply your storage requirements by another
>>>> factor of two?
>>> OK correct bad idea.
>>>
>>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>>> I would use the hardware controller over btrfs raid for now; it allows
>>>> more flexibility in eg switching to xfs. :)
>>> OK but overall you would recommand running one osd per disk right? So
>>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>>> on this machine?
>>>
>>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>>> Dual socket servers will be overkill given the setup you're
>>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>>> daemon. You might consider it if you decided you wanted to do an OSD
>>>> per disk instead (that's a more common configuration, but it requires
>>>> more CPU and RAM per disk and we don't know yet which is the better
>>>> choice).
>>> Is there also a rule of thumb for the memory?
>>>
>>> My biggest problem with ceph right now is the awful slow speed while
>>> doing random reads and writes.
>>>
>>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>>> which is def. too slow.
>>>
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>>
>>> --
>>>
>>>
>>>
>>>
>>>        Alexandre D erumier
>>> Ingénieur Système
>>> Fixe : 03 20 68 88 90
>>> Fax : 03 20 68 90 81
>>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>>> 12 rue Marivaux 75002 Paris - France
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Tomasz Paszkowski
>> SS7, Asterisk, SAN, Datacenter, Cloud Computing
>> +48500166299
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> -----
> Pozdrawiam
>
> Sławek "sZiBis" Skowron
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Designing a cluster guide
  2012-05-22  0:30               ` Gregory Farnum
@ 2012-05-22  0:42                 ` Quenten Grasso
  2012-05-22  0:46                   ` Quenten Grasso
  2012-05-22  9:04                 ` Jerker Nyberg
  1 sibling, 1 reply; 53+ messages in thread
From: Quenten Grasso @ 2012-05-22  0:42 UTC (permalink / raw)
  To: 'Gregory Farnum'; +Cc: ceph-devel

Hi Greg,

I'm only talking about journal disks not storage. :)



Regards,
Quenten 


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Cc: ceph-devel@vger.kernel.org
Subject: Re: Designing a cluster guide

On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso <QGrasso@onq.com.au> wrote:
> Hi All,
>
>
> I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
> in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
>
> Can someone help clarify this one,
>
> Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
> Or
> Once the data is written to the (journal disk) is this considered successful by the client?
This one — the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?

> Or
> Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
>
>
> Pros
> Quite fast Write throughput to the journal disks,
> No write wareout of SSD's
> RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
>
>
> Cons
> Not as fast as SSD's
> More rackspace required per server.
>
>
> Regards,
> Quenten
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Slawomir Skowron
> Sent: Tuesday, 22 May 2012 7:22 AM
> To: ceph-devel@vger.kernel.org
> Cc: Tomasz Paszkowski
> Subject: Re: Designing a cluster guide
>
> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
> separate journaling partitions with hardware RAID1.
>
> I like to test setup like this, but maybe someone have any real life info ??
>
> On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
>> Another great thing that should be mentioned is:
>> https://github.com/facebook/flashcache/. It gives really huge
>> performance improvements for reads/writes (especialy on FunsionIO
>> drives) event without using librbd caching :-)
>>
>>
>>
>> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>> Hi,
>>>
>>> For your journal , if you have money, you can use
>>>
>>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>>> I'm using them with zfs san, they rocks for journal.
>>> http://www.stec-inc.com/product/zeusram.php
>>>
>>> another interessesting product is ddrdrive
>>> http://www.ddrdrive.com/
>>>
>>> ----- Mail original -----
>>>
>>> De: "Stefan Priebe" <s.priebe@profihost.ag>
>>> À: "Gregory Farnum" <greg@inktank.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Envoyé: Samedi 19 Mai 2012 10:37:01
>>> Objet: Re: Designing a cluster guide
>>>
>>> Hi Greg,
>>>
>>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>>> Is multi core or more speed important?
>>>> Right now, it's primarily the speed of a single core. The MDS is
>>>> highly threaded but doing most things requires grabbing a big lock.
>>>> How fast is a qualitative rather than quantitative assessment at this
>>>> point, though.
>>> So would you recommand a fast (more ghz) Core i3 instead of a single
>>> xeon for this system? (price per ghz is better).
>>>
>>>> It depends on what your nodes look like, and what sort of cluster
>>>> you're running. The monitors are pretty lightweight, but they will add
>>>> *some* load. More important is their disk access patterns — they have
>>>> to do a lot of syncs. So if they're sharing a machine with some other
>>>> daemon you want them to have an independent disk and to be running a
>>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>>> only distribution I know for sure does this is Ubuntu 12.04.)
>>> Which kernel and which glibc version supports this? I have searched
>>> google but haven't found an exact version. We're using debian lenny
>>> squeeze with a custom kernel.
>>>
>>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>>> and you should go for 22x SSD Disks in a Raid 6?
>>>> You'll need to do your own failure calculations on this one, I'm
>>>> afraid. Just take note that you'll presumably be limited to the speed
>>>> of your journaling device here.
>>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>>> is this still too slow? Another idea was to use only a ramdisk for the
>>> journal and backup the files while shutting down to disk and restore
>>> them after boot.
>>>
>>>> Given that Ceph is going to be doing its own replication, though, I
>>>> wouldn't want to add in another whole layer of replication with raid10
>>>> — do you really want to multiply your storage requirements by another
>>>> factor of two?
>>> OK correct bad idea.
>>>
>>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>>> I would use the hardware controller over btrfs raid for now; it allows
>>>> more flexibility in eg switching to xfs. :)
>>> OK but overall you would recommand running one osd per disk right? So
>>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>>> on this machine?
>>>
>>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>>> Dual socket servers will be overkill given the setup you're
>>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>>> daemon. You might consider it if you decided you wanted to do an OSD
>>>> per disk instead (that's a more common configuration, but it requires
>>>> more CPU and RAM per disk and we don't know yet which is the better
>>>> choice).
>>> Is there also a rule of thumb for the memory?
>>>
>>> My biggest problem with ceph right now is the awful slow speed while
>>> doing random reads and writes.
>>>
>>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>>> which is def. too slow.
>>>
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>>
>>> --
>>>
>>>
>>>
>>>
>>>        Alexandre D erumier
>>> Ingénieur Système
>>> Fixe : 03 20 68 88 90
>>> Fax : 03 20 68 90 81
>>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>>> 12 rue Marivaux 75002 Paris - France
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Tomasz Paszkowski
>> SS7, Asterisk, SAN, Datacenter, Cloud Computing
>> +48500166299
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> -----
> Pozdrawiam
>
> Sławek "sZiBis" Skowron
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Designing a cluster guide
  2012-05-22  0:42                 ` Quenten Grasso
@ 2012-05-22  0:46                   ` Quenten Grasso
  2012-05-22  5:51                     ` Sławomir Skowron
  0 siblings, 1 reply; 53+ messages in thread
From: Quenten Grasso @ 2012-05-22  0:46 UTC (permalink / raw)
  To: 'Gregory Farnum'; +Cc: ceph-devel

I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :)


Regards,
Quenten 


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Quenten Grasso
Sent: Tuesday, 22 May 2012 10:43 AM
To: 'Gregory Farnum'
Cc: ceph-devel@vger.kernel.org
Subject: RE: Designing a cluster guide

Hi Greg,

I'm only talking about journal disks not storage. :)



Regards,
Quenten 


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Cc: ceph-devel@vger.kernel.org
Subject: Re: Designing a cluster guide

On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso <QGrasso@onq.com.au> wrote:
> Hi All,
>
>
> I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
> in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
>
> Can someone help clarify this one,
>
> Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
> Or
> Once the data is written to the (journal disk) is this considered successful by the client?
This one — the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?

> Or
> Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
>
>
> Pros
> Quite fast Write throughput to the journal disks,
> No write wareout of SSD's
> RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
>
>
> Cons
> Not as fast as SSD's
> More rackspace required per server.
>
>
> Regards,
> Quenten
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Slawomir Skowron
> Sent: Tuesday, 22 May 2012 7:22 AM
> To: ceph-devel@vger.kernel.org
> Cc: Tomasz Paszkowski
> Subject: Re: Designing a cluster guide
>
> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
> separate journaling partitions with hardware RAID1.
>
> I like to test setup like this, but maybe someone have any real life info ??
>
> On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
>> Another great thing that should be mentioned is:
>> https://github.com/facebook/flashcache/. It gives really huge
>> performance improvements for reads/writes (especialy on FunsionIO
>> drives) event without using librbd caching :-)
>>
>>
>>
>> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>> Hi,
>>>
>>> For your journal , if you have money, you can use
>>>
>>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>>> I'm using them with zfs san, they rocks for journal.
>>> http://www.stec-inc.com/product/zeusram.php
>>>
>>> another interessesting product is ddrdrive
>>> http://www.ddrdrive.com/
>>>
>>> ----- Mail original -----
>>>
>>> De: "Stefan Priebe" <s.priebe@profihost.ag>
>>> À: "Gregory Farnum" <greg@inktank.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Envoyé: Samedi 19 Mai 2012 10:37:01
>>> Objet: Re: Designing a cluster guide
>>>
>>> Hi Greg,
>>>
>>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>>> Is multi core or more speed important?
>>>> Right now, it's primarily the speed of a single core. The MDS is
>>>> highly threaded but doing most things requires grabbing a big lock.
>>>> How fast is a qualitative rather than quantitative assessment at this
>>>> point, though.
>>> So would you recommand a fast (more ghz) Core i3 instead of a single
>>> xeon for this system? (price per ghz is better).
>>>
>>>> It depends on what your nodes look like, and what sort of cluster
>>>> you're running. The monitors are pretty lightweight, but they will add
>>>> *some* load. More important is their disk access patterns — they have
>>>> to do a lot of syncs. So if they're sharing a machine with some other
>>>> daemon you want them to have an independent disk and to be running a
>>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>>> only distribution I know for sure does this is Ubuntu 12.04.)
>>> Which kernel and which glibc version supports this? I have searched
>>> google but haven't found an exact version. We're using debian lenny
>>> squeeze with a custom kernel.
>>>
>>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>>> and you should go for 22x SSD Disks in a Raid 6?
>>>> You'll need to do your own failure calculations on this one, I'm
>>>> afraid. Just take note that you'll presumably be limited to the speed
>>>> of your journaling device here.
>>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>>> is this still too slow? Another idea was to use only a ramdisk for the
>>> journal and backup the files while shutting down to disk and restore
>>> them after boot.
>>>
>>>> Given that Ceph is going to be doing its own replication, though, I
>>>> wouldn't want to add in another whole layer of replication with raid10
>>>> — do you really want to multiply your storage requirements by another
>>>> factor of two?
>>> OK correct bad idea.
>>>
>>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>>> I would use the hardware controller over btrfs raid for now; it allows
>>>> more flexibility in eg switching to xfs. :)
>>> OK but overall you would recommand running one osd per disk right? So
>>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>>> on this machine?
>>>
>>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>>> Dual socket servers will be overkill given the setup you're
>>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>>> daemon. You might consider it if you decided you wanted to do an OSD
>>>> per disk instead (that's a more common configuration, but it requires
>>>> more CPU and RAM per disk and we don't know yet which is the better
>>>> choice).
>>> Is there also a rule of thumb for the memory?
>>>
>>> My biggest problem with ceph right now is the awful slow speed while
>>> doing random reads and writes.
>>>
>>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>>> which is def. too slow.
>>>
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>>
>>> --
>>>
>>>
>>>
>>>
>>>        Alexandre D erumier
>>> Ingénieur Système
>>> Fixe : 03 20 68 88 90
>>> Fax : 03 20 68 90 81
>>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>>> 12 rue Marivaux 75002 Paris - France
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Tomasz Paszkowski
>> SS7, Asterisk, SAN, Datacenter, Cloud Computing
>> +48500166299
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> -----
> Pozdrawiam
>
> Sławek "sZiBis" Skowron
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\r��f���h���z�\x1e�w���
���j:+v���w�j�m����\r����zZ+�����ݢj"��!�i

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-22  0:46                   ` Quenten Grasso
@ 2012-05-22  5:51                     ` Sławomir Skowron
  2012-05-29  7:25                       ` Quenten Grasso
  0 siblings, 1 reply; 53+ messages in thread
From: Sławomir Skowron @ 2012-05-22  5:51 UTC (permalink / raw)
  To: Quenten Grasso; +Cc: Gregory Farnum, ceph-devel

I have some performance from rbd cluster near 320MB/s on VM from 3
node cluster, but with 10GE, and with 26 2.5" SAS drives used on every
machine it's not everything that can be.
Every osd drive is raid0 with one drive via battery cached nvram in
hardware raid ctrl.
Every osd take much ram for caching.

That's why i'am thinking about to change 2 drives for SSD in raid1
with hpa tuned for increase durability of drive for journaling - but
if this will work ;)

With newest drives can theoreticaly get 500MB/s with a long queue
depth. This means that i can in theory improve bandwith score, and
take lower latency, and better handling of multiple IO writes, from
many hosts.
Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl,
and in near future improve from cache in kvm (i need to test that -
this will improve performance)

But if SSD drive goes slower, it can get whole performance down in
writes. It's is very delicate.

Pozdrawiam

iSS

Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso <QGrasso@onq.com.au> napisał(a):

> I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :)
>
>
> Regards,
> Quenten
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Quenten Grasso
> Sent: Tuesday, 22 May 2012 10:43 AM
> To: 'Gregory Farnum'
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: Designing a cluster guide
>
> Hi Greg,
>
> I'm only talking about journal disks not storage. :)
>
>
>
> Regards,
> Quenten
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Gregory Farnum
> Sent: Tuesday, 22 May 2012 10:30 AM
> To: Quenten Grasso
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Designing a cluster guide
>
> On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso <QGrasso@onq.com.au> wrote:
>> Hi All,
>>
>>
>> I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
>> in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
>>
>> Can someone help clarify this one,
>>
>> Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
>> Or
>> Once the data is written to the (journal disk) is this considered successful by the client?
> This one — the write is considered "safe" once it is on-disk on all
> OSDs currently responsible for hosting the object.
>
> Every time anybody mentions RAID10 I have to remind them of the
> storage amplification that entails, though. Are you sure you want that
> on top of (well, underneath, really) Ceph's own replication?
>
>> Or
>> Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
>>
>>
>> Pros
>> Quite fast Write throughput to the journal disks,
>> No write wareout of SSD's
>> RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
>>
>>
>> Cons
>> Not as fast as SSD's
>> More rackspace required per server.
>>
>>
>> Regards,
>> Quenten
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Slawomir Skowron
>> Sent: Tuesday, 22 May 2012 7:22 AM
>> To: ceph-devel@vger.kernel.org
>> Cc: Tomasz Paszkowski
>> Subject: Re: Designing a cluster guide
>>
>> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
>> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
>> separate journaling partitions with hardware RAID1.
>>
>> I like to test setup like this, but maybe someone have any real life info ??
>>
>> On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
>>> Another great thing that should be mentioned is:
>>> https://github.com/facebook/flashcache/. It gives really huge
>>> performance improvements for reads/writes (especialy on FunsionIO
>>> drives) event without using librbd caching :-)
>>>
>>>
>>>
>>> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>> Hi,
>>>>
>>>> For your journal , if you have money, you can use
>>>>
>>>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>>>> I'm using them with zfs san, they rocks for journal.
>>>> http://www.stec-inc.com/product/zeusram.php
>>>>
>>>> another interessesting product is ddrdrive
>>>> http://www.ddrdrive.com/
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Stefan Priebe" <s.priebe@profihost.ag>
>>>> À: "Gregory Farnum" <greg@inktank.com>
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Envoyé: Samedi 19 Mai 2012 10:37:01
>>>> Objet: Re: Designing a cluster guide
>>>>
>>>> Hi Greg,
>>>>
>>>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>>>> Is multi core or more speed important?
>>>>> Right now, it's primarily the speed of a single core. The MDS is
>>>>> highly threaded but doing most things requires grabbing a big lock.
>>>>> How fast is a qualitative rather than quantitative assessment at this
>>>>> point, though.
>>>> So would you recommand a fast (more ghz) Core i3 instead of a single
>>>> xeon for this system? (price per ghz is better).
>>>>
>>>>> It depends on what your nodes look like, and what sort of cluster
>>>>> you're running. The monitors are pretty lightweight, but they will add
>>>>> *some* load. More important is their disk access patterns — they have
>>>>> to do a lot of syncs. So if they're sharing a machine with some other
>>>>> daemon you want them to have an independent disk and to be running a
>>>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>>>> only distribution I know for sure does this is Ubuntu 12.04.)
>>>> Which kernel and which glibc version supports this? I have searched
>>>> google but haven't found an exact version. We're using debian lenny
>>>> squeeze with a custom kernel.
>>>>
>>>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>>>> and you should go for 22x SSD Disks in a Raid 6?
>>>>> You'll need to do your own failure calculations on this one, I'm
>>>>> afraid. Just take note that you'll presumably be limited to the speed
>>>>> of your journaling device here.
>>>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>>>> is this still too slow? Another idea was to use only a ramdisk for the
>>>> journal and backup the files while shutting down to disk and restore
>>>> them after boot.
>>>>
>>>>> Given that Ceph is going to be doing its own replication, though, I
>>>>> wouldn't want to add in another whole layer of replication with raid10
>>>>> — do you really want to multiply your storage requirements by another
>>>>> factor of two?
>>>> OK correct bad idea.
>>>>
>>>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>>>> I would use the hardware controller over btrfs raid for now; it allows
>>>>> more flexibility in eg switching to xfs. :)
>>>> OK but overall you would recommand running one osd per disk right? So
>>>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>>>> on this machine?
>>>>
>>>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>>>> Dual socket servers will be overkill given the setup you're
>>>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>>>> daemon. You might consider it if you decided you wanted to do an OSD
>>>>> per disk instead (that's a more common configuration, but it requires
>>>>> more CPU and RAM per disk and we don't know yet which is the better
>>>>> choice).
>>>> Is there also a rule of thumb for the memory?
>>>>
>>>> My biggest problem with ceph right now is the awful slow speed while
>>>> doing random reads and writes.
>>>>
>>>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>>>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>>>> which is def. too slow.
>>>>
>>>> Stefan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>>        Alexandre D erumier
>>>> Ingénieur Système
>>>> Fixe : 03 20 68 88 90
>>>> Fax : 03 20 68 90 81
>>>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>>>> 12 rue Marivaux 75002 Paris - France
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Tomasz Paszkowski
>>> SS7, Asterisk, SAN, Datacenter, Cloud Computing
>>> +48500166299
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> -----
>> Pozdrawiam
>>
>> Sławek "sZiBis" Skowron
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j
> ��f���h���z�\x1e�w���
> ���j:+v���w�j�m����
> ����zZ+�����ݢj"��!�i
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z�{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+��ݢj"��!�i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 18:13       ` Gregory Farnum
@ 2012-05-22  6:20         ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-22  6:20 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Am 21.05.2012 20:13, schrieb Gregory Farnum:
> On Sat, May 19, 2012 at 1:37 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> So would you recommand a fast (more ghz) Core i3 instead of a single xeon
>> for this system? (price per ghz is better).
> 
> If that's all the MDS is doing there, probably? (It would also depend
> on cache sizes and things; I don't have a good sense for how that
> impacts the MDS' performance.)
As i'm only using KVM / rbd i don't have any MDS.

> Well, RAID1 isn't going to make it any faster than just the single
> SSD, is why I pointed that out.
>
> I wouldn't recommend using a ramdisk for the journal — that will
> guarantee local data loss in the event the server doesn't shut down
> properly, and if it happens to several servers at once you get a good
> chance of losing client writes.
Sure but it's the same WHEN NOT using a Raid 1 for the journal isn't it?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-21 21:22           ` Sławomir Skowron
  2012-05-21 23:52             ` Quenten Grasso
@ 2012-05-22  6:30             ` Stefan Priebe - Profihost AG
  2012-05-22  6:59               ` Sławomir Skowron
  1 sibling, 1 reply; 53+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-22  6:30 UTC (permalink / raw)
  To: Sławomir Skowron; +Cc: ceph-devel, Tomasz Paszkowski

Am 21.05.2012 23:22, schrieb Sławomir Skowron:
> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
> separate journaling partitions with hardware RAID1.
> I like to test setup like this, but maybe someone have any real life
> info ??

HPA?

That was also my idea but most of the people here still claim that
they're too slow and you need something MORE powerful like.

zeus ram: http://www.stec-inc.com/product/zeusram.php
fusion io: http://www.fusionio.com/platforms/iodrive2/

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-22  6:30             ` Stefan Priebe - Profihost AG
@ 2012-05-22  6:59               ` Sławomir Skowron
  0 siblings, 0 replies; 53+ messages in thread
From: Sławomir Skowron @ 2012-05-22  6:59 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel, Tomasz Paszkowski

http://en.wikipedia.org/wiki/Host_protected_area

On Tue, May 22, 2012 at 8:30 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Am 21.05.2012 23:22, schrieb Sławomir Skowron:
>> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
>> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
>> separate journaling partitions with hardware RAID1.
>> I like to test setup like this, but maybe someone have any real life
>> info ??
>
> HPA?
>
> That was also my idea but most of the people here still claim that
> they're too slow and you need something MORE powerful like.
>
> zeus ram: http://www.stec-inc.com/product/zeusram.php
> fusion io: http://www.fusionio.com/platforms/iodrive2/

But in commodity hardware, or cheap servers, even in mid-range
machines, cost of pci-e flash/ram card is too high even in small
cluster.

>
> Stefan



-- 
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-22  0:30               ` Gregory Farnum
  2012-05-22  0:42                 ` Quenten Grasso
@ 2012-05-22  9:04                 ` Jerker Nyberg
  2012-05-23  5:31                   ` Gregory Farnum
  1 sibling, 1 reply; 53+ messages in thread
From: Jerker Nyberg @ 2012-05-22  9:04 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 524 bytes --]

On Mon, 21 May 2012, Gregory Farnum wrote:

> This one  the write is considered "safe" once it is on-disk on all
> OSDs currently responsible for hosting the object.

Is it possible to configure the client to consider the write successful 
when the data is hitting RAM on all the OSDs but not yet committed to 
disk?

Also, the IBM zFS research file system is talking about cooperative cache 
and Lustre about a collaborative cache. Do you have any thoughts of this 
regarding Ceph?

Regards,
Jerker Nyberg, Uppsala, Sweden.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-22  9:04                 ` Jerker Nyberg
@ 2012-05-23  5:31                   ` Gregory Farnum
  2012-05-23 19:47                     ` Jerker Nyberg
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Farnum @ 2012-05-23  5:31 UTC (permalink / raw)
  To: Jerker Nyberg; +Cc: ceph-devel

On Tuesday, May 22, 2012 at 2:04 AM, Jerker Nyberg wrote:
> On Mon, 21 May 2012, Gregory Farnum wrote:
>  
> > This one the write is considered "safe" once it is on-disk on all
> > OSDs currently responsible for hosting the object.
>  
>  
>  
> Is it possible to configure the client to consider the write successful  
> when the data is hitting RAM on all the OSDs but not yet committed to  
> disk?

Direct users of the RADOS object store (i.e., librados) can do all kinds of things with the integrity guarantee options. But I don't believe there's currently a way to make the filesystem do so — among other things, you're running through the page cache and other writeback caches anyway, so it generally wouldn't be useful except when running an fsync or similar. And at that point you probably really want to not be lying to the application that's asking for it.

> Also, the IBM zFS research file system is talking about cooperative cache  
> and Lustre about a collaborative cache. Do you have any thoughts of this  
> regarding Ceph?

I haven't heard of this before, but assuming I'm understanding my brief read directly, this isn't on the current Ceph roadmap. I sort of see how it's useful, but I think it's less useful for a system like Ceph — we're more scale-out in terms of CPU and memory correlating with added disk space compared to something like Lustre where the object storage (OST) and the object handlers (OSS) are divorced, and we stripe files across more servers than I believe Lustre tends to do.
But perhaps I'm missing something — do you have a use case on Ceph?
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-23  5:31                   ` Gregory Farnum
@ 2012-05-23 19:47                     ` Jerker Nyberg
  2012-05-23 21:47                       ` Gregory Farnum
  0 siblings, 1 reply; 53+ messages in thread
From: Jerker Nyberg @ 2012-05-23 19:47 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1004 bytes --]

On Tue, 22 May 2012, Gregory Farnum wrote:

> Direct users of the RADOS object store (i.e., librados) can do all kinds 
> of things with the integrity guarantee options. But I don't believe 
> there's currently a way to make the filesystem do so ÿÿ among other 
> things, you're running through the page cache and other writeback caches 
> anyway, so it generally wouldn't be useful except when running an fsync 
> or similar. And at that point you probably really want to not be lying 
> to the application that's asking for it.

I am comparing with in-memory databases. If replication and failovers are 
used, couldn't in-memory in some cases be good enough? And faster.

> do you have a use case on Ceph?

Currently of interest:

  * Scratch file system for HPC. (kernel client)
  * Scratch file system for research groups. (SMB, NFS, SSH)
  * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
  * Metropolitan cluster.
  * VDI backend. KVM with RBD.

Regards,
Jerker Nyberg, Uppsala, Sweden.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-23 19:47                     ` Jerker Nyberg
@ 2012-05-23 21:47                       ` Gregory Farnum
  2012-05-24  8:33                         ` Jerker Nyberg
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Farnum @ 2012-05-23 21:47 UTC (permalink / raw)
  To: Jerker Nyberg; +Cc: ceph-devel

On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg <jerker@update.uu.se> wrote:
> On Tue, 22 May 2012, Gregory Farnum wrote:
>
>> Direct users of the RADOS object store (i.e., librados) can do all kinds
>> of things with the integrity guarantee options. But I don't believe there's
>> currently a way to make the filesystem do so яя among other things, you're
>> running through the page cache and other writeback caches anyway, so it
>> generally wouldn't be useful except when running an fsync or similar. And at
>> that point you probably really want to not be lying to the application
>> that's asking for it.
>
>
> I am comparing with in-memory databases. If replication and failovers are
> used, couldn't in-memory in some cases be good enough? And faster.
>
>
>> do you have a use case on Ceph?
>
>
> Currently of interest:
>
>  * Scratch file system for HPC. (kernel client)
>  * Scratch file system for research groups. (SMB, NFS, SSH)
>  * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
>  * Metropolitan cluster.
>  * VDI backend. KVM with RBD.
Hmm. Sounds to me like scratch filesystems would get a lot out of not
having to hit disk on the commit, but not much out of having separate
caching locations versus just letting the OSD page cache handle it. :)
The others, I don't really see collaborative caching helping much either.

So basically it sounds like you want to be able to toggle off Ceph's
data safety requirements. That would have to be done in the clients;
it wouldn't even be hard in ceph-fuse (although I'm not sure about the
kernel client). It's probably a pretty easy way to jump into the code
base.... :)
Anyway, make a bug for it in the tracker (I don't think one exists
yet, though I could be wrong) and someday when we start work on the
filesystem again we should be able to get to it. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-23 21:47                       ` Gregory Farnum
@ 2012-05-24  8:33                         ` Jerker Nyberg
  0 siblings, 0 replies; 53+ messages in thread
From: Jerker Nyberg @ 2012-05-24  8:33 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1412 bytes --]

On Wed, 23 May 2012, Gregory Farnum wrote:

> On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg <jerker@update.uu.se> wrote:
>
>>  * Scratch file system for HPC. (kernel client)
>>  * Scratch file system for research groups. (SMB, NFS, SSH)
>>  * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
>>  * Metropolitan cluster.
>>  * VDI backend. KVM with RBD.
>
> Hmm. Sounds to me like scratch filesystems would get a lot out of not
> having to hit disk on the commit, but not much out of having separate
> caching locations versus just letting the OSD page cache handle it. :)
> The others, I don't really see collaborative caching helping much either.

Oh, sorry, those were my use cases for ceph in general. Yes, scratch is 
mosty of interest. But also fast backup. Currently IOPS is limiting our 
backup speed on a small cluster with many files but not much data. I have 
problems scanning through and backing all changed files every night. 
Currently I am backing to ZFS but Ceph might help with scaling up 
performance and size. Another option is going for SSD instead of 
mechanical drives.

> Anyway, make a bug for it in the tracker (I don't think one exists
> yet, though I could be wrong) and someday when we start work on the
> filesystem again we should be able to get to it. :)

Thank you for your thoughts on this. I hope to be able to do that soon.

Regards,
Jerker Nyberg, Uppsala, Sweden.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Designing a cluster guide
  2012-05-22  5:51                     ` Sławomir Skowron
@ 2012-05-29  7:25                       ` Quenten Grasso
  2012-05-29 16:50                         ` Tommi Virtanen
  0 siblings, 1 reply; 53+ messages in thread
From: Quenten Grasso @ 2012-05-29  7:25 UTC (permalink / raw)
  To: 'Slawomir Skowron'; +Cc: ceph-devel

Interesting, I've been thinking about this and I think most Ceph installations could benefit from more nodes and less disks per node.

For example 

We have a replica level of 2, your RBD block size of 4mb. You start writing a file of 10gb, This is divided effectively into 4mb chunks, 

The first chunk to node 1 and node 2 (at the same time I assume) which is written to a journal then replayed to the data file system.

Second chunk might be sent to node 2 and 3 at the same time which is written to a journal then replayed. (we now have overlap from chunk 1) 

Third chunk might be sent to 1 and 3 (we have more overlap from chunks 1 and 2) and as you can see this quickly this becomes an issue.

So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less "overlap".

Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).

Side note this may sound crazy but the more I read about SSD's the less I wish to use/rely on them and RAM SSD's are crazly priced imo. =)

Regards,
Quenten


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Slawomir Skowron
Sent: Tuesday, 22 May 2012 3:52 PM
To: Quenten Grasso
Cc: Gregory Farnum; ceph-devel@vger.kernel.org
Subject: Re: Designing a cluster guide

I have some performance from rbd cluster near 320MB/s on VM from 3
node cluster, but with 10GE, and with 26 2.5" SAS drives used on every
machine it's not everything that can be.
Every osd drive is raid0 with one drive via battery cached nvram in
hardware raid ctrl.
Every osd take much ram for caching.

That's why i'am thinking about to change 2 drives for SSD in raid1
with hpa tuned for increase durability of drive for journaling - but
if this will work ;)

With newest drives can theoreticaly get 500MB/s with a long queue
depth. This means that i can in theory improve bandwith score, and
take lower latency, and better handling of multiple IO writes, from
many hosts.
Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl,
and in near future improve from cache in kvm (i need to test that -
this will improve performance)

But if SSD drive goes slower, it can get whole performance down in
writes. It's is very delicate.

Pozdrawiam

iSS

Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso <QGrasso@onq.com.au> napisał(a):

> I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :)
>
>
> Regards,
> Quenten
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Quenten Grasso
> Sent: Tuesday, 22 May 2012 10:43 AM
> To: 'Gregory Farnum'
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: Designing a cluster guide
>
> Hi Greg,
>
> I'm only talking about journal disks not storage. :)
>
>
>
> Regards,
> Quenten
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Gregory Farnum
> Sent: Tuesday, 22 May 2012 10:30 AM
> To: Quenten Grasso
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Designing a cluster guide
>
> On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso <QGrasso@onq.com.au> wrote:
>> Hi All,
>>
>>
>> I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
>> in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
>>
>> Can someone help clarify this one,
>>
>> Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
>> Or
>> Once the data is written to the (journal disk) is this considered successful by the client?
> This one — the write is considered "safe" once it is on-disk on all
> OSDs currently responsible for hosting the object.
>
> Every time anybody mentions RAID10 I have to remind them of the
> storage amplification that entails, though. Are you sure you want that
> on top of (well, underneath, really) Ceph's own replication?
>
>> Or
>> Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
>>
>>
>> Pros
>> Quite fast Write throughput to the journal disks,
>> No write wareout of SSD's
>> RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
>>
>>
>> Cons
>> Not as fast as SSD's
>> More rackspace required per server.
>>
>>
>> Regards,
>> Quenten
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Slawomir Skowron
>> Sent: Tuesday, 22 May 2012 7:22 AM
>> To: ceph-devel@vger.kernel.org
>> Cc: Tomasz Paszkowski
>> Subject: Re: Designing a cluster guide
>>
>> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
>> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
>> separate journaling partitions with hardware RAID1.
>>
>> I like to test setup like this, but maybe someone have any real life info ??
>>
>> On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@gmail.com> wrote:
>>> Another great thing that should be mentioned is:
>>> https://github.com/facebook/flashcache/. It gives really huge
>>> performance improvements for reads/writes (especialy on FunsionIO
>>> drives) event without using librbd caching :-)
>>>
>>>
>>>
>>> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>> Hi,
>>>>
>>>> For your journal , if you have money, you can use
>>>>
>>>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>>>> I'm using them with zfs san, they rocks for journal.
>>>> http://www.stec-inc.com/product/zeusram.php
>>>>
>>>> another interessesting product is ddrdrive
>>>> http://www.ddrdrive.com/
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Stefan Priebe" <s.priebe@profihost.ag>
>>>> À: "Gregory Farnum" <greg@inktank.com>
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Envoyé: Samedi 19 Mai 2012 10:37:01
>>>> Objet: Re: Designing a cluster guide
>>>>
>>>> Hi Greg,
>>>>
>>>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>>>> Is multi core or more speed important?
>>>>> Right now, it's primarily the speed of a single core. The MDS is
>>>>> highly threaded but doing most things requires grabbing a big lock.
>>>>> How fast is a qualitative rather than quantitative assessment at this
>>>>> point, though.
>>>> So would you recommand a fast (more ghz) Core i3 instead of a single
>>>> xeon for this system? (price per ghz is better).
>>>>
>>>>> It depends on what your nodes look like, and what sort of cluster
>>>>> you're running. The monitors are pretty lightweight, but they will add
>>>>> *some* load. More important is their disk access patterns — they have
>>>>> to do a lot of syncs. So if they're sharing a machine with some other
>>>>> daemon you want them to have an independent disk and to be running a
>>>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>>>> only distribution I know for sure does this is Ubuntu 12.04.)
>>>> Which kernel and which glibc version supports this? I have searched
>>>> google but haven't found an exact version. We're using debian lenny
>>>> squeeze with a custom kernel.
>>>>
>>>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>>>> and you should go for 22x SSD Disks in a Raid 6?
>>>>> You'll need to do your own failure calculations on this one, I'm
>>>>> afraid. Just take note that you'll presumably be limited to the speed
>>>>> of your journaling device here.
>>>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>>>> is this still too slow? Another idea was to use only a ramdisk for the
>>>> journal and backup the files while shutting down to disk and restore
>>>> them after boot.
>>>>
>>>>> Given that Ceph is going to be doing its own replication, though, I
>>>>> wouldn't want to add in another whole layer of replication with raid10
>>>>> — do you really want to multiply your storage requirements by another
>>>>> factor of two?
>>>> OK correct bad idea.
>>>>
>>>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>>>> I would use the hardware controller over btrfs raid for now; it allows
>>>>> more flexibility in eg switching to xfs. :)
>>>> OK but overall you would recommand running one osd per disk right? So
>>>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>>>> on this machine?
>>>>
>>>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>>>> Dual socket servers will be overkill given the setup you're
>>>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>>>> daemon. You might consider it if you decided you wanted to do an OSD
>>>>> per disk instead (that's a more common configuration, but it requires
>>>>> more CPU and RAM per disk and we don't know yet which is the better
>>>>> choice).
>>>> Is there also a rule of thumb for the memory?
>>>>
>>>> My biggest problem with ceph right now is the awful slow speed while
>>>> doing random reads and writes.
>>>>
>>>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>>>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>>>> which is def. too slow.
>>>>
>>>> Stefan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>>        Alexandre D erumier
>>>> Ingénieur Système
>>>> Fixe : 03 20 68 88 90
>>>> Fax : 03 20 68 90 81
>>>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>>>> 12 rue Marivaux 75002 Paris - France
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Tomasz Paszkowski
>>> SS7, Asterisk, SAN, Datacenter, Cloud Computing
>>> +48500166299
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> -----
>> Pozdrawiam
>>
>> Sławek "sZiBis" Skowron
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j
> ��f���h���z�\x1e�w���
> ���j:+v���w�j�m����
> ����zZ+�����ݢj"��!�i
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z�{ay�\x1dʇڙ�,j\r��f���h���z�\x1e�w���
���j:+v���w�j�m����\r����zZ+��ݢj"��!�i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-29  7:25                       ` Quenten Grasso
@ 2012-05-29 16:50                         ` Tommi Virtanen
  0 siblings, 0 replies; 53+ messages in thread
From: Tommi Virtanen @ 2012-05-29 16:50 UTC (permalink / raw)
  To: Quenten Grasso; +Cc: Slawomir Skowron, ceph-devel

On Tue, May 29, 2012 at 12:25 AM, Quenten Grasso <QGrasso@onq.com.au> wrote:
> So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less "overlap".

First of all, a typical way to run Ceph is with say 8-12 disks per
node, and an OSD per disk. That means your 3-10 node clusters actually
have 24-120 OSDs on them. The number of physical machines is not
really a factor, number of OSDs is what matters.

Secondly, 10-node or 3-node clusters are fairly uninteresting for
Ceph. The real challenge is at the hundreds, thousands and above
range.

> Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).

A journal is still needed on btrfs, snapshots just enable us to write
to the journal in parallel to the real write, instead of needing to
journal first.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-05-17 21:27   ` Gregory Farnum
  2012-05-19  8:37     ` Stefan Priebe
@ 2012-06-29 18:07     ` Gregory Farnum
  2012-06-29 18:42       ` Brian Edmonds
  1 sibling, 1 reply; 53+ messages in thread
From: Gregory Farnum @ 2012-06-29 18:07 UTC (permalink / raw)
  To: ceph-devel

On Thu, May 17, 2012 at 2:27 PM, Gregory Farnum <greg@inktank.com> wrote:
> Sorry this got left for so long...
>
> On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> Hi,
>>
>> the "Designing a cluster guide"
>> http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
>> still leaves some questions unanswered.
>>
>> It mentions for example "Fast CPU" for the mds system. What does fast
>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>> Is multi core or more speed important?
> Right now, it's primarily the speed of a single core. The MDS is
> highly threaded but doing most things requires grabbing a big lock.
> How fast is a qualitative rather than quantitative assessment at this
> point, though.
>
>> The Cluster Design Recommendations mentions to seperate all Daemons on
>> dedicated machines. Is this also for the MON useful? As they're so
>> leightweight why not running them on the OSDs?
> It depends on what your nodes look like, and what sort of cluster
> you're running. The monitors are pretty lightweight, but they will add
> *some* load. More important is their disk access patterns — they have
> to do a lot of syncs. So if they're sharing a machine with some other
> daemon you want them to have an independent disk and to be running a
> new kernel&glibc so that they can use syncfs rather than sync. (The
> only distribution I know for sure does this is Ubuntu 12.04.)

I just had it pointed out to me that I rather overstated the
importance of syncfs if you were going to do this. The monitor mostly
does fsync, not sync/syncfs(), so that's not so important. What is
important is that it has highly seeky disk behavior, so you don't want
a ceph-osd and ceph-mon daemon to be sharing a disk. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-06-29 18:07     ` Gregory Farnum
@ 2012-06-29 18:42       ` Brian Edmonds
  2012-06-29 18:50         ` Gregory Farnum
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Edmonds @ 2012-06-29 18:42 UTC (permalink / raw)
  To: ceph-devel

On Fri, Jun 29, 2012 at 11:07 AM, Gregory Farnum <greg@inktank.com> wrote:
>>> the "Designing a cluster guide"
>>> http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
>>> still leaves some questions unanswered.

Oh, thank you.  I've been poking through the Ceph docs, but somehow
had not managed to turn up the wiki yet.

What are the likely and worst case scenarios if the OSD journal were
to simply be on a garden variety ramdisk, no battery backing?  In the
case of a single node losing power, and thus losing some data, surely
Ceph can recognize this, and handle it through normal redundancy?  I
could see it being an issue if the whole cluster lost power at once.
Anything I'm missing?

Brian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-06-29 18:42       ` Brian Edmonds
@ 2012-06-29 18:50         ` Gregory Farnum
  2012-06-29 20:59           ` Brian Edmonds
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Farnum @ 2012-06-29 18:50 UTC (permalink / raw)
  To: Brian Edmonds; +Cc: ceph-devel

On Fri, Jun 29, 2012 at 11:42 AM, Brian Edmonds <mornir@gmail.com> wrote:
> What are the likely and worst case scenarios if the OSD journal were
> to simply be on a garden variety ramdisk, no battery backing?  In the
> case of a single node losing power, and thus losing some data, surely
> Ceph can recognize this, and handle it through normal redundancy?  I
> could see it being an issue if the whole cluster lost power at once.
> Anything I'm missing?

If you lose a journal, you lose the OSD. The end. We could potentially
recover much of the data through developer-driven manual data
inspection, but I suspect it's roughly equivalent to what a lot of
data forensics services offer — expensive for everybody and not
something to rely on.
Ceph can certainly handle losing *one* OSD, but if you have a
correlated failure of more than one, you're almost certain to lose
data some amount of data (how much depends on how many OSDs you have,
and how you've replicated that data). If that's an acceptable tradeoff
for you, go for it...but I doubt that it is when you come down to it.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-06-29 18:50         ` Gregory Farnum
@ 2012-06-29 20:59           ` Brian Edmonds
  2012-06-29 21:11             ` Gregory Farnum
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Edmonds @ 2012-06-29 20:59 UTC (permalink / raw)
  To: ceph-devel

On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum <greg@inktank.com> wrote:
> If you lose a journal, you lose the OSD.

Really?  Everything?  Not just recent commits?  I would have hoped it
would just come back up in an old state.  Replication should have
already been taking care of regaining redundancy for the stuff that
was on it, particularly the newest stuff that wouldn't return with it
and say "Hi, I'm back."

I suppose it makes the design easier though. =)

Brian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-06-29 20:59           ` Brian Edmonds
@ 2012-06-29 21:11             ` Gregory Farnum
  2012-06-29 21:18               ` Brian Edmonds
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Farnum @ 2012-06-29 21:11 UTC (permalink / raw)
  To: Brian Edmonds; +Cc: ceph-devel

On Fri, Jun 29, 2012 at 1:59 PM, Brian Edmonds <mornir@gmail.com> wrote:
> On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum <greg@inktank.com> wrote:
>> If you lose a journal, you lose the OSD.
>
> Really?  Everything?  Not just recent commits?  I would have hoped it
> would just come back up in an old state.  Replication should have
> already been taking care of regaining redundancy for the stuff that
> was on it, particularly the newest stuff that wouldn't return with it
> and say "Hi, I'm back."
>
> I suppose it makes the design easier though. =)

Well, actually this depends on the filesystem you're using. With
btrfs, the OSD will roll back to a consistent state, but you don't
know how out-of-date that state is. (Practically speaking, it's pretty
new, but if you were doing any writes it is going to be data loss.)
With xfs/ext4/other, the OSD can't create consistency points the same
way it can with btrfs, and so the loss of a journal means that it
can't repair itself.

Sorry for not mentioning the distinction earlier; I didn't think we'd
implemented the rollback on btrfs. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-06-29 21:11             ` Gregory Farnum
@ 2012-06-29 21:18               ` Brian Edmonds
  2012-06-29 21:30                 ` Gregory Farnum
  2012-06-29 21:33                 ` Sage Weil
  0 siblings, 2 replies; 53+ messages in thread
From: Brian Edmonds @ 2012-06-29 21:18 UTC (permalink / raw)
  To: ceph-devel

On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum <greg@inktank.com> wrote:
> Well, actually this depends on the filesystem you're using. With
> btrfs, the OSD will roll back to a consistent state, but you don't
> know how out-of-date that state is.

Ok, so assuming btrfs, then a single machine failure with a ramdisk
journal should not result in any data loss, assuming replication is
working?  The cluster would then be at risk of data loss primarily
from a full power outage.  (In practice I'd expect either one machine
to die, or a power loss to take out all of them, and smaller but
non-unitary losses would be uncommon.)

Something to play with, perhaps.

Brian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-06-29 21:18               ` Brian Edmonds
@ 2012-06-29 21:30                 ` Gregory Farnum
  2012-06-29 21:33                 ` Sage Weil
  1 sibling, 0 replies; 53+ messages in thread
From: Gregory Farnum @ 2012-06-29 21:30 UTC (permalink / raw)
  To: Brian Edmonds; +Cc: ceph-devel

On Fri, Jun 29, 2012 at 2:18 PM, Brian Edmonds <mornir@gmail.com> wrote:
> On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum <greg@inktank.com> wrote:
>> Well, actually this depends on the filesystem you're using. With
>> btrfs, the OSD will roll back to a consistent state, but you don't
>> know how out-of-date that state is.
>
> Ok, so assuming btrfs, then a single machine failure with a ramdisk
> journal should not result in any data loss, assuming replication is
> working?  The cluster would then be at risk of data loss primarily
> from a full power outage.  (In practice I'd expect either one machine
> to die, or a power loss to take out all of them, and smaller but
> non-unitary losses would be uncommon.)

That's correct. And replication will be working — it's all
synchronous, so if the replication isn't working, you won't be able to
write. :) There are some edge cases here — if an OSD is "down" but not
"out" then you might not have the same number of data copies as
normal, but that's all configurable.

>
> Something to play with, perhaps.
>
> Brian.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Designing a cluster guide
  2012-06-29 21:18               ` Brian Edmonds
  2012-06-29 21:30                 ` Gregory Farnum
@ 2012-06-29 21:33                 ` Sage Weil
  1 sibling, 0 replies; 53+ messages in thread
From: Sage Weil @ 2012-06-29 21:33 UTC (permalink / raw)
  To: Brian Edmonds; +Cc: ceph-devel

On Fri, 29 Jun 2012, Brian Edmonds wrote:
> On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum <greg@inktank.com> wrote:
> > Well, actually this depends on the filesystem you're using. With
> > btrfs, the OSD will roll back to a consistent state, but you don't
> > know how out-of-date that state is.
> 
> Ok, so assuming btrfs, then a single machine failure with a ramdisk
> journal should not result in any data loss, assuming replication is
> working?  The cluster would then be at risk of data loss primarily
> from a full power outage.  (In practice I'd expect either one machine
> to die, or a power loss to take out all of them, and smaller but
> non-unitary losses would be uncommon.)

Right.  From a data-safety perspective ("the cluster said my writes were 
safe.. are they?") consider journal loss an OSD failure.  If there aren't 
other surviving replicas, something may be lost.

From a recovery perspective, it is a partial failure; not everything was 
lost, and recovery will be quick (only recent objects get copied around).  
Maybe your application can tolerate that, maybe it can't.

sage


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2012-06-29 21:33 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-10 12:09 slow performance even when using SSDs Stefan Priebe - Profihost AG
2012-05-10 13:09 ` Stefan Priebe - Profihost AG
2012-05-10 18:24   ` Calvin Morrow
2012-05-10 13:23 ` Designing a cluster guide Stefan Priebe - Profihost AG
2012-05-17 21:27   ` Gregory Farnum
2012-05-19  8:37     ` Stefan Priebe
2012-05-19 16:15       ` Alexandre DERUMIER
2012-05-20  7:56         ` Stefan Priebe
2012-05-20  8:13           ` Alexandre DERUMIER
2012-05-20  8:19           ` Christian Brunner
2012-05-20  8:27             ` Stefan Priebe
2012-05-20  8:31               ` Christian Brunner
2012-05-21  8:22                 ` Stefan Priebe - Profihost AG
2012-05-21 15:03                   ` Christian Brunner
2012-05-20  8:56             ` Tim O'Donovan
2012-05-20  9:24               ` Stefan Priebe
2012-05-20  9:46                 ` Tim O'Donovan
2012-05-20  9:49                   ` Stefan Priebe
2012-05-21 14:59               ` Christian Brunner
2012-05-21 15:05                 ` Stefan Priebe - Profihost AG
2012-05-21 15:12                   ` Tomasz Paszkowski
     [not found]                     ` <CANT588uxL7jrf1BfowUeer_AnDTfGjzkWVFhS4aNMaMSst_jyA@mail.gmail.com>
2012-05-21 15:36                       ` Tomasz Paszkowski
2012-05-21 18:15                         ` Damien Churchill
2012-05-21 20:11                     ` Stefan Priebe
2012-05-21 20:13                       ` Tomasz Paszkowski
2012-05-21 20:14                         ` Stefan Priebe
2012-05-21 20:19                           ` Tomasz Paszkowski
2012-05-21 15:07         ` Tomasz Paszkowski
2012-05-21 21:22           ` Sławomir Skowron
2012-05-21 23:52             ` Quenten Grasso
2012-05-22  0:30               ` Gregory Farnum
2012-05-22  0:42                 ` Quenten Grasso
2012-05-22  0:46                   ` Quenten Grasso
2012-05-22  5:51                     ` Sławomir Skowron
2012-05-29  7:25                       ` Quenten Grasso
2012-05-29 16:50                         ` Tommi Virtanen
2012-05-22  9:04                 ` Jerker Nyberg
2012-05-23  5:31                   ` Gregory Farnum
2012-05-23 19:47                     ` Jerker Nyberg
2012-05-23 21:47                       ` Gregory Farnum
2012-05-24  8:33                         ` Jerker Nyberg
2012-05-22  6:30             ` Stefan Priebe - Profihost AG
2012-05-22  6:59               ` Sławomir Skowron
2012-05-21 18:13       ` Gregory Farnum
2012-05-22  6:20         ` Stefan Priebe - Profihost AG
2012-06-29 18:07     ` Gregory Farnum
2012-06-29 18:42       ` Brian Edmonds
2012-06-29 18:50         ` Gregory Farnum
2012-06-29 20:59           ` Brian Edmonds
2012-06-29 21:11             ` Gregory Farnum
2012-06-29 21:18               ` Brian Edmonds
2012-06-29 21:30                 ` Gregory Farnum
2012-06-29 21:33                 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.