All of lore.kernel.org
 help / color / mirror / Atom feed
* Hardware-config suggestions for HDD-based OSD node?
@ 2010-03-28 22:36 Craig Dunwoody
  2010-03-29  0:29 ` Martin Millnert
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-28 22:36 UTC (permalink / raw)
  To: ceph-devel; +Cc: cdunwoody


I'd be interested to hear from anyone who has suggestions about
optimizing the hardware config of an HDD-based OSD node for Ceph, using
currently available COTS hardware components.

More specifically, I'm interested in how one might try for an efficient
balance among key hardware resources including:

    CPU cores
    Main-memory throughput and capacity
    HDD controllers
    HDDs
    SSDs for journaling, if any
    NICs

Some reasonable answers I expect might include:

-   It's very early days for Ceph, no one really knows yet, and the only
    way to find out is to experiment with real hardware and
    applications, which is expensive

-   Answer depends a lot on many factors, including:
    -   Cost/performance tradeoff choices for a particular application
    -   Details of workloads for a particular application
    -   Details of hardware-component performance characteristics

Seems to me that one of many possible approaches would be to choose a
particular HDD type (e.g. 2TB 3.5" 7200RPM SAS-6G), and then work toward
the following goals, recognizing that there are tensions/conflicts among
these goals:

    Goal G1
        Maximize the incremental improvement in overall FS access
        performance that results from each incremental addition of a
        single HDD.

    Goal G2
        Minimize physical space used per bit of total FS capacity.

    Goal G3
        Minimize total hardware cost per bit of total FS capacity.

I would expect to be able to do well on G1 by stacking up nodes, each
with a single HDD, single cosd instance, and one or more GigE ports.
However, I would expect to do better on G2 and G3 by increasing #HDDs
per node.

Based on currently available server components that are relatively
inexpensive and convenient to deploy, I can imagine that for some
applications it might be attractive to stack up 1RU-rackmount nodes,
each with four HDDs, four cosd instances, and two or more GigE ports.

Beyond that, I'm wondering if it would be possible to serve some
applications better with a fatter OSD node config.  In particular, could
I improve space-efficiency (G2) and maybe also cost-per-bit (G3) by
increasing the #HDDs per node until incremental performance contribution
of each additional HDD (G1) just starts to drop below what I would get
with only a single HDD per node?

As one really extreme example, at a cost that might be acceptable for
some applications I could build a single max-configuration node with:
    
     2 CPU sockets
    24 CPU threads (2 x 6core x 2thread, or 2 x 12core x 1thread)
    12 DIMMs (currently up to 96GB capacity, up to 85 GByte/sec peak)
     3 8port SAS6G HBAs (aggregate 14.4GByte/sec peak to HDDs)
     5 2port 10GigE NICs (aggregate 12.5GByte/sec peak to network)

Using appropriate chassis, I could attach a pretty large number of 2TB
3.5" 7200RPM SAS-6G HDDs to this node, even hundreds if I wanted to (but
I wouldn't).

I'm wondering how large I could push the number of attached HDDs, before
the incremental performance contribution of each HDD starts to drop off.

As number of attached HDDs increases, I would expect to hit a number of
hardware and software resource limitations in the node.  Certainly the
achievable sustained throughput of the lowest-level hardware interfaces
would be only a fraction of the aggregate-peak numbers that I listed
above.

As one very crude calculation, ignoring many other constraints, if I
thought that I could get all HDDs streaming simultaneously to Ethernet
at a sustained 100MByte/sec each (I can't), and I thought that I could
sustain 50% of wire-speed across the ten 10GigE ports, then I'd limit
myself to about 62 HDDs (6.25 GByte/sec) to avoid worrying about the
Ethernet interfaces throttling the aggregate streaming throughput of the
HDDs.

I expect that a more-realistic assumption about max aggregate streaming
throughput under Ceph would lead to a higher limit on #HDDs based on
this one consideration.

I would expect that long before reaching 62 HDDs, many other constraints
would cause the per-HDD performance contribution to drop below the
single-HDD-per-server level, including:

-   Limitations in CPU throughput
-   Limitations in main-memory throughput and capacity
-   Various Linux limitations
-   Various Ceph limitations

62 HDDs and 62 cosd instances would be 2.6 cosd instances per CPU
thread, which seems to me like a lot.  I would not be surprised at all
to receive a recommendation to limit to less than 1.0 cosd instance per
CPU thread.

I can imagine reducing the number of cosd instances by running each atop
a multi-HDD btrfs-level stripe, but I expect that might have various
disadvantages, and I do like the simplicity of one cosd instance per
btrfs filesystem per HDD.

Realistically, I expect that there might be a sweet-spot at a much more
moderate number of HDDs per node, with a node hardware config that is
much less extreme than the example I described above.

I also wonder if perhaps the sweet-spot for #HDDs per OSD node might be
able to increase over time, as Ceph matures and more tuning is done.

Thanks in advance to anyone for any thoughts/comments on this topic.
Would appreciate any suggestions on better ways to analyze the
tradeoffs, and corrections of any fundamental misunderstandings that I
might have about how Ceph works and how to configure it.

-- 
Craig Dunwoody
GraphStream Incorporated

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: Hardware-config suggestions for HDD-based OSD node?
@ 2010-03-29  0:58 Craig Dunwoody
  0 siblings, 0 replies; 11+ messages in thread
From: Craig Dunwoody @ 2010-03-29  0:58 UTC (permalink / raw)
  To: Martin Millnert; +Cc: cdunwoody, ceph-devel


Hello Martin,

martin writes:
>while this does not match your G1, G2 or G3, there is a G4 absolutely
>worth considering IMO:
>  Maximize storage area and transfer speed divided by hardware
>investment + MRU.
...
>I think you have to figure out what it is you need done for your
>specific application, and back-track from there. Because there is no
>single optimal configuration of a distributed file system such as Ceph,
>for all applications.

Thanks very much for your comments.

Sorry that I wasn't familiar with your use of the term "MRU" -- please
help me understand.

I agree completely that details of Ceph node and cluster configs that
are most efficient for a specific application, will depend on many
details of that application's requirements.

Craig Dunwoody
cdunwoody@graphstream.com

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-03-29 22:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-28 22:36 Hardware-config suggestions for HDD-based OSD node? Craig Dunwoody
2010-03-29  0:29 ` Martin Millnert
2010-03-29  1:15 ` Gregory Farnum
2010-03-29  1:48   ` Craig Dunwoody
2010-03-29  5:18 ` ales-76
2010-03-29 13:00   ` Craig Dunwoody
2010-03-29 15:46     ` Aleš Bláha
2010-03-29 22:05       ` [OLD ceph-devel] " Craig Dunwoody
2010-03-29 21:26 ` Sage Weil
2010-03-29 22:54   ` Craig Dunwoody
2010-03-29  0:58 Craig Dunwoody

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.