All of lore.kernel.org
 help / color / mirror / Atom feed
* Hardware-config suggestions for HDD-based OSD node?
@ 2010-03-28 22:36 Craig Dunwoody
  2010-03-29  0:29 ` Martin Millnert
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-28 22:36 UTC (permalink / raw)
  To: ceph-devel; +Cc: cdunwoody


I'd be interested to hear from anyone who has suggestions about
optimizing the hardware config of an HDD-based OSD node for Ceph, using
currently available COTS hardware components.

More specifically, I'm interested in how one might try for an efficient
balance among key hardware resources including:

    CPU cores
    Main-memory throughput and capacity
    HDD controllers
    HDDs
    SSDs for journaling, if any
    NICs

Some reasonable answers I expect might include:

-   It's very early days for Ceph, no one really knows yet, and the only
    way to find out is to experiment with real hardware and
    applications, which is expensive

-   Answer depends a lot on many factors, including:
    -   Cost/performance tradeoff choices for a particular application
    -   Details of workloads for a particular application
    -   Details of hardware-component performance characteristics

Seems to me that one of many possible approaches would be to choose a
particular HDD type (e.g. 2TB 3.5" 7200RPM SAS-6G), and then work toward
the following goals, recognizing that there are tensions/conflicts among
these goals:

    Goal G1
        Maximize the incremental improvement in overall FS access
        performance that results from each incremental addition of a
        single HDD.

    Goal G2
        Minimize physical space used per bit of total FS capacity.

    Goal G3
        Minimize total hardware cost per bit of total FS capacity.

I would expect to be able to do well on G1 by stacking up nodes, each
with a single HDD, single cosd instance, and one or more GigE ports.
However, I would expect to do better on G2 and G3 by increasing #HDDs
per node.

Based on currently available server components that are relatively
inexpensive and convenient to deploy, I can imagine that for some
applications it might be attractive to stack up 1RU-rackmount nodes,
each with four HDDs, four cosd instances, and two or more GigE ports.

Beyond that, I'm wondering if it would be possible to serve some
applications better with a fatter OSD node config.  In particular, could
I improve space-efficiency (G2) and maybe also cost-per-bit (G3) by
increasing the #HDDs per node until incremental performance contribution
of each additional HDD (G1) just starts to drop below what I would get
with only a single HDD per node?

As one really extreme example, at a cost that might be acceptable for
some applications I could build a single max-configuration node with:
    
     2 CPU sockets
    24 CPU threads (2 x 6core x 2thread, or 2 x 12core x 1thread)
    12 DIMMs (currently up to 96GB capacity, up to 85 GByte/sec peak)
     3 8port SAS6G HBAs (aggregate 14.4GByte/sec peak to HDDs)
     5 2port 10GigE NICs (aggregate 12.5GByte/sec peak to network)

Using appropriate chassis, I could attach a pretty large number of 2TB
3.5" 7200RPM SAS-6G HDDs to this node, even hundreds if I wanted to (but
I wouldn't).

I'm wondering how large I could push the number of attached HDDs, before
the incremental performance contribution of each HDD starts to drop off.

As number of attached HDDs increases, I would expect to hit a number of
hardware and software resource limitations in the node.  Certainly the
achievable sustained throughput of the lowest-level hardware interfaces
would be only a fraction of the aggregate-peak numbers that I listed
above.

As one very crude calculation, ignoring many other constraints, if I
thought that I could get all HDDs streaming simultaneously to Ethernet
at a sustained 100MByte/sec each (I can't), and I thought that I could
sustain 50% of wire-speed across the ten 10GigE ports, then I'd limit
myself to about 62 HDDs (6.25 GByte/sec) to avoid worrying about the
Ethernet interfaces throttling the aggregate streaming throughput of the
HDDs.

I expect that a more-realistic assumption about max aggregate streaming
throughput under Ceph would lead to a higher limit on #HDDs based on
this one consideration.

I would expect that long before reaching 62 HDDs, many other constraints
would cause the per-HDD performance contribution to drop below the
single-HDD-per-server level, including:

-   Limitations in CPU throughput
-   Limitations in main-memory throughput and capacity
-   Various Linux limitations
-   Various Ceph limitations

62 HDDs and 62 cosd instances would be 2.6 cosd instances per CPU
thread, which seems to me like a lot.  I would not be surprised at all
to receive a recommendation to limit to less than 1.0 cosd instance per
CPU thread.

I can imagine reducing the number of cosd instances by running each atop
a multi-HDD btrfs-level stripe, but I expect that might have various
disadvantages, and I do like the simplicity of one cosd instance per
btrfs filesystem per HDD.

Realistically, I expect that there might be a sweet-spot at a much more
moderate number of HDDs per node, with a node hardware config that is
much less extreme than the example I described above.

I also wonder if perhaps the sweet-spot for #HDDs per OSD node might be
able to increase over time, as Ceph matures and more tuning is done.

Thanks in advance to anyone for any thoughts/comments on this topic.
Would appreciate any suggestions on better ways to analyze the
tradeoffs, and corrections of any fundamental misunderstandings that I
might have about how Ceph works and how to configure it.

-- 
Craig Dunwoody
GraphStream Incorporated

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hardware-config suggestions for HDD-based OSD node?
  2010-03-28 22:36 Hardware-config suggestions for HDD-based OSD node? Craig Dunwoody
@ 2010-03-29  0:29 ` Martin Millnert
  2010-03-29  1:15 ` Gregory Farnum
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Martin Millnert @ 2010-03-29  0:29 UTC (permalink / raw)
  To: Craig Dunwoody; +Cc: ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 1895 bytes --]

On Sun, 2010-03-28 at 15:36 -0700, Craig Dunwoody wrote:
> I'd be interested to hear from anyone who has suggestions about
> optimizing the hardware config of an HDD-based OSD node for Ceph, using
> currently available COTS hardware components.

Craig, list,

while this does not match your G1, G2 or G3, there is a G4 absolutely
worth considering IMO:

  Maximize storage area and transfer speed divided by hardware
investment + MRU.

Then G5:
  Optimize for performance / node.

or G6:
  Optimize for performance of the storage network.

matters too.  And both must be weighed against not only the hardware
investment, but also the MRU due to space, cooling and power
consumption.

I've done some raw calculations for the G4 and what I found was that if
you don't mind installing COTS-HW not exactly of your standard data
center make and model, you stand to gain a lot by simply deploying many
quite low-power devices with 4-5 SATA ports IMO. But it wholly depends
on what you are after.  I believe it is very interesting for a
data-warehousing application of Ceph.

Potentially, I must add.  I haven't tried it.  :) 
But for any sizable installation, I believe the storage network itself
will let, as you scale up, get sufficient performance. Ie., you might
hit some ceiling of the storage networks' performance soon enough
anyway. At least if you're using front ends to interface to it.

Unresolved in the above equation is MDS/OSD performance (and ratio) and
actual per-OSD performance. Power consumption is quite easy to get a
ball-park max/min/avg figure on.


I think you have to figure out what it is you need done for your
specific application, and back-track from there. Because there is no
single optimal configuration of a distributed file system such as Ceph,
for all applications.

Cheers,
-- 
Martin Millnert <martin@millnert.se>

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 345 bytes --]

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

[-- Attachment #3: Type: text/plain, Size: 161 bytes --]

_______________________________________________
Ceph-devel mailing list
Ceph-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ceph-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hardware-config suggestions for HDD-based OSD node?
  2010-03-28 22:36 Hardware-config suggestions for HDD-based OSD node? Craig Dunwoody
  2010-03-29  0:29 ` Martin Millnert
@ 2010-03-29  1:15 ` Gregory Farnum
  2010-03-29  1:48   ` Craig Dunwoody
  2010-03-29  5:18 ` ales-76
  2010-03-29 21:26 ` Sage Weil
  3 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2010-03-29  1:15 UTC (permalink / raw)
  To: Craig Dunwoody; +Cc: ceph-devel

Craig:
I expect that Sage will have a lot more to offer you in this area, but
for now I have a few responses I can offer off the top of my head. :)
1) It's early days for Ceph. We're going to be offering a public beta
of the object store Real Soon Now that I expect will give us a better
idea of how different hardware scales, but it hasn't been run
long-term on anything larger than some old single- and dual-core
systems since Sage's thesis research.
2) The OSD code will happily eat all the memory you can give it to use
as cache; though the useful cache size/drive will of course depend on
your application. ;)
3) All the failure recovery code right operates on the cosd process
level. You can design the CRUSH layout map in such a way that it won't
put any replicas on the same physical box, but you will need to be
much more careful of such things than if you're running one
process/box. This will also mean that a failure will impact your
network more dramatically-- each box which replicates/leads the failed
box will need to send data to p times as many other processes as if
they were running one process/box. (p being the number of
processes/box) On the upside, that means recovery may be done faster.
4) The less data you store per-process, the more your maintenance
overhead will be. If we've done our jobs right this won't be a problem
at all, but it would mean that any scaling issues appear to you faster
than to others.
5) The OSD supports different directories for the object store and for
the journal. SSDs will give you lots better journaling and thus lower
write latency, though if your applications are happy to do asyn IO I
don't think this should impact bandwidth.
-Greg

On Sun, Mar 28, 2010 at 3:36 PM, Craig Dunwoody
<cdunwoody@graphstream.com> wrote:
>
> I'd be interested to hear from anyone who has suggestions about
> optimizing the hardware config of an HDD-based OSD node for Ceph, using
> currently available COTS hardware components.
>
> More specifically, I'm interested in how one might try for an efficient
> balance among key hardware resources including:
>
>    CPU cores
>    Main-memory throughput and capacity
>    HDD controllers
>    HDDs
>    SSDs for journaling, if any
>    NICs
>
> Some reasonable answers I expect might include:
>
> -   It's very early days for Ceph, no one really knows yet, and the only
>    way to find out is to experiment with real hardware and
>    applications, which is expensive
>
> -   Answer depends a lot on many factors, including:
>    -   Cost/performance tradeoff choices for a particular application
>    -   Details of workloads for a particular application
>    -   Details of hardware-component performance characteristics
>
> Seems to me that one of many possible approaches would be to choose a
> particular HDD type (e.g. 2TB 3.5" 7200RPM SAS-6G), and then work toward
> the following goals, recognizing that there are tensions/conflicts among
> these goals:
>
>    Goal G1
>        Maximize the incremental improvement in overall FS access
>        performance that results from each incremental addition of a
>        single HDD.
>
>    Goal G2
>        Minimize physical space used per bit of total FS capacity.
>
>    Goal G3
>        Minimize total hardware cost per bit of total FS capacity.
>
> I would expect to be able to do well on G1 by stacking up nodes, each
> with a single HDD, single cosd instance, and one or more GigE ports.
> However, I would expect to do better on G2 and G3 by increasing #HDDs
> per node.
>
> Based on currently available server components that are relatively
> inexpensive and convenient to deploy, I can imagine that for some
> applications it might be attractive to stack up 1RU-rackmount nodes,
> each with four HDDs, four cosd instances, and two or more GigE ports.
>
> Beyond that, I'm wondering if it would be possible to serve some
> applications better with a fatter OSD node config.  In particular, could
> I improve space-efficiency (G2) and maybe also cost-per-bit (G3) by
> increasing the #HDDs per node until incremental performance contribution
> of each additional HDD (G1) just starts to drop below what I would get
> with only a single HDD per node?
>
> As one really extreme example, at a cost that might be acceptable for
> some applications I could build a single max-configuration node with:
>
>     2 CPU sockets
>    24 CPU threads (2 x 6core x 2thread, or 2 x 12core x 1thread)
>    12 DIMMs (currently up to 96GB capacity, up to 85 GByte/sec peak)
>     3 8port SAS6G HBAs (aggregate 14.4GByte/sec peak to HDDs)
>     5 2port 10GigE NICs (aggregate 12.5GByte/sec peak to network)
>
> Using appropriate chassis, I could attach a pretty large number of 2TB
> 3.5" 7200RPM SAS-6G HDDs to this node, even hundreds if I wanted to (but
> I wouldn't).
>
> I'm wondering how large I could push the number of attached HDDs, before
> the incremental performance contribution of each HDD starts to drop off.
>
> As number of attached HDDs increases, I would expect to hit a number of
> hardware and software resource limitations in the node.  Certainly the
> achievable sustained throughput of the lowest-level hardware interfaces
> would be only a fraction of the aggregate-peak numbers that I listed
> above.
>
> As one very crude calculation, ignoring many other constraints, if I
> thought that I could get all HDDs streaming simultaneously to Ethernet
> at a sustained 100MByte/sec each (I can't), and I thought that I could
> sustain 50% of wire-speed across the ten 10GigE ports, then I'd limit
> myself to about 62 HDDs (6.25 GByte/sec) to avoid worrying about the
> Ethernet interfaces throttling the aggregate streaming throughput of the
> HDDs.
>
> I expect that a more-realistic assumption about max aggregate streaming
> throughput under Ceph would lead to a higher limit on #HDDs based on
> this one consideration.
>
> I would expect that long before reaching 62 HDDs, many other constraints
> would cause the per-HDD performance contribution to drop below the
> single-HDD-per-server level, including:
>
> -   Limitations in CPU throughput
> -   Limitations in main-memory throughput and capacity
> -   Various Linux limitations
> -   Various Ceph limitations
>
> 62 HDDs and 62 cosd instances would be 2.6 cosd instances per CPU
> thread, which seems to me like a lot.  I would not be surprised at all
> to receive a recommendation to limit to less than 1.0 cosd instance per
> CPU thread.
>
> I can imagine reducing the number of cosd instances by running each atop
> a multi-HDD btrfs-level stripe, but I expect that might have various
> disadvantages, and I do like the simplicity of one cosd instance per
> btrfs filesystem per HDD.
>
> Realistically, I expect that there might be a sweet-spot at a much more
> moderate number of HDDs per node, with a node hardware config that is
> much less extreme than the example I described above.
>
> I also wonder if perhaps the sweet-spot for #HDDs per OSD node might be
> able to increase over time, as Ceph matures and more tuning is done.
>
> Thanks in advance to anyone for any thoughts/comments on this topic.
> Would appreciate any suggestions on better ways to analyze the
> tradeoffs, and corrections of any fundamental misunderstandings that I
> might have about how Ceph works and how to configure it.
>
> --
> Craig Dunwoody
> GraphStream Incorporated
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Ceph-devel mailing list
> Ceph-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ceph-devel
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hardware-config suggestions for HDD-based OSD node?
  2010-03-29  1:15 ` Gregory Farnum
@ 2010-03-29  1:48   ` Craig Dunwoody
  0 siblings, 0 replies; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-29  1:48 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: cdunwoody, ceph-devel


Hello Greg,

-   Thanks very much for your comments.

-   I will look forward to learning more about this as your team
    and others starts to test Ceph on a wider range of hardware
    configs.

-   I can see how for some applications, amount of main memory capacity
    available per HDD for caching, might become a significant constraint
    on max #HDDs that can be supported cost-efficiently per OSD node.

-   One thing I could conclude is that at least until more is known, there
    might be extra benefit from configuring nodes to allow for extra
    flexibility in the quantity of installed hardware resources (CPU,
    memory, HBA, NIC, HDD, SSD, etc.), such that these could be adjusted
    appropriately in response to measurements of how specific
    applications perform.

-- 
Craig Dunwoody
GraphStream Incorporated

greg writes:
>I expect that Sage will have a lot more to offer you in this area, but
>for now I have a few responses I can offer off the top of my head. :)
>1) It's early days for Ceph. We're going to be offering a public beta
>of the object store Real Soon Now that I expect will give us a better
>idea of how different hardware scales, but it hasn't been run
>long-term on anything larger than some old single- and dual-core
>systems since Sage's thesis research.
>2) The OSD code will happily eat all the memory you can give it to use
>as cache; though the useful cache size/drive will of course depend on
>your application. ;)
>3) All the failure recovery code right operates on the cosd process
>level. You can design the CRUSH layout map in such a way that it won't
>put any replicas on the same physical box, but you will need to be
>much more careful of such things than if you're running one
>process/box. This will also mean that a failure will impact your
>network more dramatically-- each box which replicates/leads the failed
>box will need to send data to p times as many other processes as if
>they were running one process/box. (p being the number of
>processes/box) On the upside, that means recovery may be done faster.
>4) The less data you store per-process, the more your maintenance
>overhead will be. If we've done our jobs right this won't be a problem
>at all, but it would mean that any scaling issues appear to you faster
>than to others.
>5) The OSD supports different directories for the object store and for
>the journal. SSDs will give you lots better journaling and thus lower
>write latency, though if your applications are happy to do asyn IO I
>don't think this should impact bandwidth.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hardware-config suggestions for HDD-based OSD node?
  2010-03-28 22:36 Hardware-config suggestions for HDD-based OSD node? Craig Dunwoody
  2010-03-29  0:29 ` Martin Millnert
  2010-03-29  1:15 ` Gregory Farnum
@ 2010-03-29  5:18 ` ales-76
  2010-03-29 13:00   ` Craig Dunwoody
  2010-03-29 21:26 ` Sage Weil
  3 siblings, 1 reply; 11+ messages in thread
From: ales-76 @ 2010-03-29  5:18 UTC (permalink / raw)
  To: ceph-devel; +Cc: Craig Dunwoody

> ------------ Původní zpráva ------------
> Od: Craig Dunwoody <cdunwoody@graphstream.com>
> Předmět: [ceph-devel] Hardware-config suggestions for HDD-based OSD node?
> Datum: 29.3.2010 00:37:56

> As number of attached HDDs increases, I would expect to hit a number of
> hardware and software resource limitations in the node.  Certainly the
> achievable sustained throughput of the lowest-level hardware interfaces
> would be only a fraction of the aggregate-peak numbers that I listed
> above.

As to the number of disks, Sun Fire X4500 with 48 SATA disks showed real-world performance around 800-1000 MB/s (benchmarks on the web - iSCSI, no Ceph). Since it is a amd64 based I guess you can get similar I/O rates from any Intel box on the market today, provided you have enough SATA ports and/or PCI-E slots. One 10gbps NIC should be enough (these are usually two ported anyway). I would say the Linux will not be the bottleneck here even if you use software RAID. Usually it is the network or the protocol that limit the performance.

Regards

Aleš Bláha

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hardware-config suggestions for HDD-based OSD node?
  2010-03-29  5:18 ` ales-76
@ 2010-03-29 13:00   ` Craig Dunwoody
  2010-03-29 15:46     ` Aleš Bláha
  0 siblings, 1 reply; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-29 13:00 UTC (permalink / raw)
  To: ales-76; +Cc: cdunwoody, ceph-devel


Hello Ales,

Thanks for your comments.  Because it is relatively simple to scale-up
various hardware resources in an OSD node, even up to extreme levels as
I described, I look forward to being able to do experiments of adding
various resources to an OSD node until a point of diminishing returns.

Unfortunately, this starts to get expensive for experiments with many
OSD nodes. Also, any such results will also depend on characteristics of
application workload.

Craig Dunwoody
GraphStream Incorporated

ales writes:
>As to the number of disks, Sun Fire X4500 with 48 SATA disks showed
>real-world performance around 800-1000 MB/s (benchmarks on the web -
>iSCSI, no Ceph). Since it is a amd64 based I guess you can get similar
>I/O rates from any Intel box on the market today, provided you have
>enough SATA ports and/or PCI-E slots. One 10gbps NIC should be enough
>(these are usually two ported anyway). I would say the Linux will not be
>the bottleneck here even if you use software RAID. Usually it is the
>network or the protocol that limit the performance.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hardware-config suggestions for HDD-based OSD node?
  2010-03-29 13:00   ` Craig Dunwoody
@ 2010-03-29 15:46     ` Aleš Bláha
  2010-03-29 22:05       ` [OLD ceph-devel] " Craig Dunwoody
  0 siblings, 1 reply; 11+ messages in thread
From: Aleš Bláha @ 2010-03-29 15:46 UTC (permalink / raw)
  To: Craig Dunwoody; +Cc: ceph-devel

On Mon, 29 Mar 2010 15:00:25 +0200 (CEST)
Craig Dunwoody <cdunwoody@graphstream.com> wrote:

Hello Craig,

Sure raw numbers for a hardware are often very impressive, but
it is up to the software to squeeze the performance out of it.
Experimenting is always the best way to find out, but you can search
the web to get the picture. I suggest taking a look at Lustre
setups - the architecture is very similar, they also use mostly
of-the-shelf componenst. Apparently the common Lustre OSD is rather fat
- plenty of disks divided into several RAID groups, very much like the
X4500. Then againg, HPC people are mostly focused on
streaming writes, so if your application differs then your point of
diminishing returns might be elsewhere.

Ales Blaha

> 
> Hello Ales,
> 
> Thanks for your comments.  Because it is relatively simple to scale-up
> various hardware resources in an OSD node, even up to extreme levels as
> I described, I look forward to being able to do experiments of adding
> various resources to an OSD node until a point of diminishing returns.
> 
> Unfortunately, this starts to get expensive for experiments with many
> OSD nodes. Also, any such results will also depend on characteristics of
> application workload.
> 
> Craig Dunwoody
> GraphStream Incorporated
> 
> ales writes:
> >As to the number of disks, Sun Fire X4500 with 48 SATA disks showed
> >real-world performance around 800-1000 MB/s (benchmarks on the web -
> >iSCSI, no Ceph). Since it is a amd64 based I guess you can get similar
> >I/O rates from any Intel box on the market today, provided you have
> >enough SATA ports and/or PCI-E slots. One 10gbps NIC should be enough
> >(these are usually two ported anyway). I would say the Linux will not be
> >the bottleneck here even if you use software RAID. Usually it is the
> >network or the protocol that limit the performance.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OLD ceph-devel] Hardware-config suggestions for HDD-based OSD node?
  2010-03-28 22:36 Hardware-config suggestions for HDD-based OSD node? Craig Dunwoody
                   ` (2 preceding siblings ...)
  2010-03-29  5:18 ` ales-76
@ 2010-03-29 21:26 ` Sage Weil
  2010-03-29 22:54   ` Craig Dunwoody
  3 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2010-03-29 21:26 UTC (permalink / raw)
  To: Craig Dunwoody; +Cc: ceph-devel, ceph-devel

Hi Craig,

On Sun, 28 Mar 2010, Craig Dunwoody wrote:
> -   It's very early days for Ceph, no one really knows yet, and the only
>     way to find out is to experiment with real hardware and
>     applications, which is expensive

Generally, this is where we are at.  :) I have just a few other things to 
add to this thread.

First, the cosd daemon is pretty heavily multithreaded, at various levels.  
There are threads for handling network IO, for processing reads, for 
serializing and preparing writes, and for journaling and applying writes 
to disk.  A single properly tuned cosd daemon can probably keep all your 
cores busy.  The thread pool sizes are all configurable, though, so you 
can also run multiple daemons per machine.

I would be inclined to pool multiple raw disks together with btrfs for 
each cosd instance as that will let btrfs replicate its metadata 
(and/or data) and recover from disk errors.  That will generally be faster 
than having ceph rereplicate all of the node's data.

The other performance consideration is the osd journaling.  If low latency 
writes are a concern, the journal can be placed on a separate device.  
That can be a raw disk device (dedicated spindle), although you generally 
pay the full rotational latency each time cosd sends accumulated items to 
the disk (some smarts that tries to time the disk rotation and adjusts the 
next write accordingly could probably improve this).  It also wastes an 
entire disk for a journal that doesn't need to get that big.

The journal can also be put on an NVRAM device, like a Micro Memory card.  
These are a bit hard to come by, but they're fast.

An SSD is probably the most cost effective option.  Cheap models are 
probably fine, too, since all writes are sequential and probably won't 
work the firmware very hard.

My suspicion is that the most frequent performance limiter is going to be 
the network.  Any node with 2 or more disks can outrun a GigE link with 
streaming io, and 10GigE deployments aren't all that common yet.  Cheap 
switches with narrow backplanes also tend to a bottleneck.

In the end, it's really going to come down to the workload.  How hot/cold 
is the data?  Maybe 1gps per 100TB osd is fine, maybe it's not.  How much 
RAM is going to be a cost/benefit question, which depends on how skewed 
the access distribution is (more skewed = more effective caching, whereas 
caches will be almost useless with a truly flat access distribution).

> I would expect that long before reaching 62 HDDs, many other constraints
> would cause the per-HDD performance contribution to drop below the
> single-HDD-per-server level, including:
> 
> -   Limitations in CPU throughput
> -   Limitations in main-memory throughput and capacity
> -   Various Linux limitations
> -   Various Ceph limitations

The last thing I'll mention is that the cosd code isn't very well 
optimized at this point.  At the very least, writes are fully buffered 
(which means at least one memory copy into the page cache).  And there is 
a lot of other stuff going on in preparing and staging writes that could 
be improved performance-wise.

Cheers-
sage

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OLD ceph-devel] Hardware-config suggestions for HDD-based OSD node?
  2010-03-29 15:46     ` Aleš Bláha
@ 2010-03-29 22:05       ` Craig Dunwoody
  0 siblings, 0 replies; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-29 22:05 UTC (permalink / raw)
  To: Aleš Bláha; +Cc: cdunwoody, ceph-devel


Hello Ales, 

Thank you for the additional suggestions.  It will be interesting
to see the similarities and differences between hardware configs that
work best for Lustre, and those that end up working best for Ceph.

Craig Dunwoody
GraphStream Incorporated

ales writes:
>Sure raw numbers for a hardware are often very impressive, but
>it is up to the software to squeeze the performance out of it.
>Experimenting is always the best way to find out, but you can search
>the web to get the picture. I suggest taking a look at Lustre
>setups - the architecture is very similar, they also use mostly
>of-the-shelf componenst. Apparently the common Lustre OSD is rather fat
>- plenty of disks divided into several RAID groups, very much like the
>X4500. Then againg, HPC people are mostly focused on
>streaming writes, so if your application differs then your point of
>diminishing returns might be elsewhere.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OLD ceph-devel] Hardware-config suggestions for HDD-based OSD node?
  2010-03-29 21:26 ` Sage Weil
@ 2010-03-29 22:54   ` Craig Dunwoody
  0 siblings, 0 replies; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-29 22:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: cdunwoody, ceph-devel, ceph-devel


Hello Sage,

Thanks very much for your comments.

I can see it making sense to set up btrfs to stripe across multiple HDDs
and replicate metadata.  I'd be more reluctant to replicate file data
inside the fault domain of a single OSD node, because I feel I would get
more benefit from that (relatively expensive) redundancy by putting it
in a separate node.

In general, I expect that Ceph will end up getting used across a very
diverse set of applications, and accordingly there might be a lot of
variation among applications in terms of specific performance
limitations that people run up against.

I would agree that when designing a storage system for any set of
applications, it's certainly quite a challenge to characterize the
expected workload (which might very well also change over time), and
then come up with a particular configuration of currently-available
off-the-shelf hardware and software building-blocks that one hopes could
support that workload more efficiently than other currently-available
alternatives.

Almost all of the storage systems that my company currently builds end
up in HPC-type setups that already natively have relatively fat network
pipes for cluster-interconnect (e.g. 32Gbps InfiniBand and/or 10Gbps
Ethernet), and there is a natural desire to try to fill up fatter pipes
using fatter storage bricks.

That said, I can imagine that even for this kind of situation, it might
sometimes be more efficient to fan out to a larger number of thinner
network pipes and thinner storage bricks.

Perhaps some future Ceph optimizations will be most helpful for thinner
OSD-brick setups, and others will be more important for fatter OSD
bricks that are trying to fill fatter network pipes.  Fortunately, it
appears that there are still plenty of optimizations left to do in Ceph
that are likely to benefit a very wide range of appplications and
hardware configs.

Craig Dunwoody
GraphStream Incorporated

sage writes:
>Generally, this is where we are at.  :) I have just a few other things to 
>add to this thread.
>
>First, the cosd daemon is pretty heavily multithreaded, at various levels.  
>There are threads for handling network IO, for processing reads, for 
>serializing and preparing writes, and for journaling and applying writes 
>to disk.  A single properly tuned cosd daemon can probably keep all your 
>cores busy.  The thread pool sizes are all configurable, though, so you 
>can also run multiple daemons per machine.
>
>I would be inclined to pool multiple raw disks together with btrfs for 
>each cosd instance as that will let btrfs replicate its metadata 
>(and/or data) and recover from disk errors.  That will generally be faster 
>than having ceph rereplicate all of the node's data.
>
>The other performance consideration is the osd journaling.  If low latency 
>writes are a concern, the journal can be placed on a separate device.  
>That can be a raw disk device (dedicated spindle), although you generally 
>pay the full rotational latency each time cosd sends accumulated items to 
>the disk (some smarts that tries to time the disk rotation and adjusts the 
>next write accordingly could probably improve this).  It also wastes an 
>entire disk for a journal that doesn't need to get that big.
>
>The journal can also be put on an NVRAM device, like a Micro Memory card.  
>These are a bit hard to come by, but they're fast.
>
>An SSD is probably the most cost effective option.  Cheap models are 
>probably fine, too, since all writes are sequential and probably won't 
>work the firmware very hard.
>
>My suspicion is that the most frequent performance limiter is going to be 
>the network.  Any node with 2 or more disks can outrun a GigE link with 
>streaming io, and 10GigE deployments aren't all that common yet.  Cheap 
>switches with narrow backplanes also tend to a bottleneck.
>
>In the end, it's really going to come down to the workload.  How hot/cold 
>is the data?  Maybe 1gps per 100TB osd is fine, maybe it's not.  How much 
>RAM is going to be a cost/benefit question, which depends on how skewed 
>the access distribution is (more skewed = more effective caching, whereas 
>caches will be almost useless with a truly flat access distribution).
>
>The last thing I'll mention is that the cosd code isn't very well 
>optimized at this point.  At the very least, writes are fully buffered 
>(which means at least one memory copy into the page cache).  And there is 
>a lot of other stuff going on in preparing and staging writes that could 
>be improved performance-wise.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hardware-config suggestions for HDD-based OSD node?
@ 2010-03-29  0:58 Craig Dunwoody
  0 siblings, 0 replies; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-29  0:58 UTC (permalink / raw)
  To: Martin Millnert; +Cc: cdunwoody, ceph-devel


Hello Martin,

martin writes:
>while this does not match your G1, G2 or G3, there is a G4 absolutely
>worth considering IMO:
>  Maximize storage area and transfer speed divided by hardware
>investment + MRU.
...
>I think you have to figure out what it is you need done for your
>specific application, and back-track from there. Because there is no
>single optimal configuration of a distributed file system such as Ceph,
>for all applications.

Thanks very much for your comments.

Sorry that I wasn't familiar with your use of the term "MRU" -- please
help me understand.

I agree completely that details of Ceph node and cluster configs that
are most efficient for a specific application, will depend on many
details of that application's requirements.

Craig Dunwoody
cdunwoody@graphstream.com

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-03-29 22:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-28 22:36 Hardware-config suggestions for HDD-based OSD node? Craig Dunwoody
2010-03-29  0:29 ` Martin Millnert
2010-03-29  1:15 ` Gregory Farnum
2010-03-29  1:48   ` Craig Dunwoody
2010-03-29  5:18 ` ales-76
2010-03-29 13:00   ` Craig Dunwoody
2010-03-29 15:46     ` Aleš Bláha
2010-03-29 22:05       ` [OLD ceph-devel] " Craig Dunwoody
2010-03-29 21:26 ` Sage Weil
2010-03-29 22:54   ` Craig Dunwoody
2010-03-29  0:58 Craig Dunwoody

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.