From mboxrd@z Thu Jan 1 00:00:00 1970 From: Craig Dunwoody Subject: Hardware-config suggestions for HDD-based OSD node? Date: Sun, 28 Mar 2010 15:36:57 -0700 Message-ID: <23450.1269815817@n20.hq.graphstream.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ceph-devel-bounces@lists.sourceforge.net To: ceph-devel@lists.sourceforge.net Cc: cdunwoody@graphstream.com List-Id: ceph-devel.vger.kernel.org I'd be interested to hear from anyone who has suggestions about optimizing the hardware config of an HDD-based OSD node for Ceph, using currently available COTS hardware components. More specifically, I'm interested in how one might try for an efficient balance among key hardware resources including: CPU cores Main-memory throughput and capacity HDD controllers HDDs SSDs for journaling, if any NICs Some reasonable answers I expect might include: - It's very early days for Ceph, no one really knows yet, and the only way to find out is to experiment with real hardware and applications, which is expensive - Answer depends a lot on many factors, including: - Cost/performance tradeoff choices for a particular application - Details of workloads for a particular application - Details of hardware-component performance characteristics Seems to me that one of many possible approaches would be to choose a particular HDD type (e.g. 2TB 3.5" 7200RPM SAS-6G), and then work toward the following goals, recognizing that there are tensions/conflicts among these goals: Goal G1 Maximize the incremental improvement in overall FS access performance that results from each incremental addition of a single HDD. Goal G2 Minimize physical space used per bit of total FS capacity. Goal G3 Minimize total hardware cost per bit of total FS capacity. I would expect to be able to do well on G1 by stacking up nodes, each with a single HDD, single cosd instance, and one or more GigE ports. However, I would expect to do better on G2 and G3 by increasing #HDDs per node. Based on currently available server components that are relatively inexpensive and convenient to deploy, I can imagine that for some applications it might be attractive to stack up 1RU-rackmount nodes, each with four HDDs, four cosd instances, and two or more GigE ports. Beyond that, I'm wondering if it would be possible to serve some applications better with a fatter OSD node config. In particular, could I improve space-efficiency (G2) and maybe also cost-per-bit (G3) by increasing the #HDDs per node until incremental performance contribution of each additional HDD (G1) just starts to drop below what I would get with only a single HDD per node? As one really extreme example, at a cost that might be acceptable for some applications I could build a single max-configuration node with: 2 CPU sockets 24 CPU threads (2 x 6core x 2thread, or 2 x 12core x 1thread) 12 DIMMs (currently up to 96GB capacity, up to 85 GByte/sec peak) 3 8port SAS6G HBAs (aggregate 14.4GByte/sec peak to HDDs) 5 2port 10GigE NICs (aggregate 12.5GByte/sec peak to network) Using appropriate chassis, I could attach a pretty large number of 2TB 3.5" 7200RPM SAS-6G HDDs to this node, even hundreds if I wanted to (but I wouldn't). I'm wondering how large I could push the number of attached HDDs, before the incremental performance contribution of each HDD starts to drop off. As number of attached HDDs increases, I would expect to hit a number of hardware and software resource limitations in the node. Certainly the achievable sustained throughput of the lowest-level hardware interfaces would be only a fraction of the aggregate-peak numbers that I listed above. As one very crude calculation, ignoring many other constraints, if I thought that I could get all HDDs streaming simultaneously to Ethernet at a sustained 100MByte/sec each (I can't), and I thought that I could sustain 50% of wire-speed across the ten 10GigE ports, then I'd limit myself to about 62 HDDs (6.25 GByte/sec) to avoid worrying about the Ethernet interfaces throttling the aggregate streaming throughput of the HDDs. I expect that a more-realistic assumption about max aggregate streaming throughput under Ceph would lead to a higher limit on #HDDs based on this one consideration. I would expect that long before reaching 62 HDDs, many other constraints would cause the per-HDD performance contribution to drop below the single-HDD-per-server level, including: - Limitations in CPU throughput - Limitations in main-memory throughput and capacity - Various Linux limitations - Various Ceph limitations 62 HDDs and 62 cosd instances would be 2.6 cosd instances per CPU thread, which seems to me like a lot. I would not be surprised at all to receive a recommendation to limit to less than 1.0 cosd instance per CPU thread. I can imagine reducing the number of cosd instances by running each atop a multi-HDD btrfs-level stripe, but I expect that might have various disadvantages, and I do like the simplicity of one cosd instance per btrfs filesystem per HDD. Realistically, I expect that there might be a sweet-spot at a much more moderate number of HDDs per node, with a node hardware config that is much less extreme than the example I described above. I also wonder if perhaps the sweet-spot for #HDDs per OSD node might be able to increase over time, as Ceph matures and more tuning is done. Thanks in advance to anyone for any thoughts/comments on this topic. Would appreciate any suggestions on better ways to analyze the tradeoffs, and corrections of any fundamental misunderstandings that I might have about how Ceph works and how to configure it. -- Craig Dunwoody GraphStream Incorporated ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev