From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: [OLD ceph-devel] Hardware-config suggestions for HDD-based OSD node? Date: Mon, 29 Mar 2010 14:26:16 -0700 (PDT) Message-ID: References: <23450.1269815817@n20.hq.graphstream.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <23450.1269815817@n20.hq.graphstream.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ceph-devel-bounces@lists.sourceforge.net To: Craig Dunwoody Cc: ceph-devel@vger.kernel.org, ceph-devel@lists.sourceforge.net List-Id: ceph-devel.vger.kernel.org Hi Craig, On Sun, 28 Mar 2010, Craig Dunwoody wrote: > - It's very early days for Ceph, no one really knows yet, and the only > way to find out is to experiment with real hardware and > applications, which is expensive Generally, this is where we are at. :) I have just a few other things to add to this thread. First, the cosd daemon is pretty heavily multithreaded, at various levels. There are threads for handling network IO, for processing reads, for serializing and preparing writes, and for journaling and applying writes to disk. A single properly tuned cosd daemon can probably keep all your cores busy. The thread pool sizes are all configurable, though, so you can also run multiple daemons per machine. I would be inclined to pool multiple raw disks together with btrfs for each cosd instance as that will let btrfs replicate its metadata (and/or data) and recover from disk errors. That will generally be faster than having ceph rereplicate all of the node's data. The other performance consideration is the osd journaling. If low latency writes are a concern, the journal can be placed on a separate device. That can be a raw disk device (dedicated spindle), although you generally pay the full rotational latency each time cosd sends accumulated items to the disk (some smarts that tries to time the disk rotation and adjusts the next write accordingly could probably improve this). It also wastes an entire disk for a journal that doesn't need to get that big. The journal can also be put on an NVRAM device, like a Micro Memory card. These are a bit hard to come by, but they're fast. An SSD is probably the most cost effective option. Cheap models are probably fine, too, since all writes are sequential and probably won't work the firmware very hard. My suspicion is that the most frequent performance limiter is going to be the network. Any node with 2 or more disks can outrun a GigE link with streaming io, and 10GigE deployments aren't all that common yet. Cheap switches with narrow backplanes also tend to a bottleneck. In the end, it's really going to come down to the workload. How hot/cold is the data? Maybe 1gps per 100TB osd is fine, maybe it's not. How much RAM is going to be a cost/benefit question, which depends on how skewed the access distribution is (more skewed = more effective caching, whereas caches will be almost useless with a truly flat access distribution). > I would expect that long before reaching 62 HDDs, many other constraints > would cause the per-HDD performance contribution to drop below the > single-HDD-per-server level, including: > > - Limitations in CPU throughput > - Limitations in main-memory throughput and capacity > - Various Linux limitations > - Various Ceph limitations The last thing I'll mention is that the cosd code isn't very well optimized at this point. At the very least, writes are fully buffered (which means at least one memory copy into the page cache). And there is a lot of other stuff going on in preparing and staging writes that could be improved performance-wise. Cheers- sage ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev