From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:51513 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751370AbaGVCv3 (ORCPT ); Mon, 21 Jul 2014 22:51:29 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1X9QAp-0001dC-JS for linux-btrfs@vger.kernel.org; Tue, 22 Jul 2014 04:51:27 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 22 Jul 2014 04:51:27 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 22 Jul 2014 04:51:27 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: 1 week to rebuid 4x 3TB raid10 is a long time! Date: Tue, 22 Jul 2014 02:51:13 +0000 (UTC) Message-ID: References: <53CC1553.1020908@shiftmail.org> <20140721013609.6d99c399@natsu> <37e3a8cf8b7439d5cd2745b5efb9d37f.squirrel@webmail.wanet.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: ronnie sahlberg posted on Mon, 21 Jul 2014 09:46:07 -0700 as excerpted: > On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.duncan@cox.net> wrote: >> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: >> >>> If you assume a 12ms average seek time (normal for 7200RPM SATA >>> drives), an 8.3ms rotational latency (half a rotation), an average >>> 64kb write and a 100MB/S streaming write speed, each write comes in >>> at ~21ms, which gives us ~47 IOPS. With the 64KB write size, this >>> comes out to ~3MB/S, DISK LIMITED. >> >>> The 5MB/S that TM is seeing is fine, considering the small files he >>> says he has. >> > That is actually nonsense. > Raid rebuild operates on the block/stripe layer and not on the > filesystem layer. If we were talking about a normal raid, yes. But we're talking about btrFS, note the FS for filesystem, so indeed it *IS* the filesystem layer. Now this particular "filesystem" /does/ happen to have raid properties as well, but it's definitely filesystem level... > It does not matter at all what the average file size is. ... and the filesize /does/ matter. > Raid rebuild is really only limited by disk i/o speed when performing a > linear read of the whole spindle using huge i/o sizes, > or, if you have multiple spindles on the same bus, the bus saturation > speed. Makes sense... if you're dealing at the raid level. If we were talking about dmraid or mdraid... and they're both much more mature and optimized, as well, so 50 MiB/sec, per spindle in parallel, would indeed be a reasonable expectation for them. But (barring bugs, which will and do happen at this stage of development) btrfs both makes far better data validity guarantees, and does a lot more complex processing what with COW and snapshotting, etc, of course in addition to the normal filesystem level stuff AND the raid-level stuff it does. > Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle, > when doing a raid rebuild. ... And perfectly reasonable, at least at this point, to expect ~5 MiB/ sec total thruput, one spindle at a time, for btrfs. > That is for the naive rebuild that rebuilds every single stripe. A > smarter rebuild that knows which stripes are unused can skip the unused > stripes and thus become even faster than that. > > > Now, that the rebuild is off by an order of magnitude is by design but > should be fixed at some stage, but with the current state of btrfs it is > probably better to focus on other more urgent areas first. Because of all the extra work it does, btrfs may never get to full streaming speed across all spindles at once. But it can and will certainly get much better than it is, once the focus moves to optimization. *AND*, because it /does/ know which areas of the device are actually in use, once btrfs is optimized, it's quite likely that despite the slower raw speed, because it won't have to deal with the unused area, at least with the typically 20-60% unused filesystems most people run, rebuild times will match or be faster than raid-layer-only technologies that must rebuild the entire device, because they do /not/ know which areas are unused. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman