From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757255Ab2HPLFa (ORCPT ); Thu, 16 Aug 2012 07:05:30 -0400 Received: from mo-65-41-216-221.sta.embarqhsd.net ([65.41.216.221]:51605 "EHLO greer.hardwarefreak.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755302Ab2HPLF3 (ORCPT ); Thu, 16 Aug 2012 07:05:29 -0400 Message-ID: <502CD3F8.70001@hardwarefreak.com> Date: Thu, 16 Aug 2012 06:05:28 -0500 From: Stan Hoeppner Reply-To: stan@hardwarefreak.com User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20120713 Thunderbird/14.0 MIME-Version: 1.0 To: Miquel van Smoorenburg CC: linux-kernel@vger.kernel.org Subject: Re: O_DIRECT to md raid 6 is slow References: <502B8D1F.7030706@anonymous.org.uk> <201208152307.q7FN7hMR008630@xs8.xs4all.nl> In-Reply-To: <201208152307.q7FN7hMR008630@xs8.xs4all.nl> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote: > In article you write: >> It's time to blow away the array and start over. You're already >> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID, >> but for a handful of niche all streaming workloads with little/no >> rewrite, such as video surveillance or DVR workloads. >> >> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why: >> Deleting a single file changes only a few bytes of directory metadata. >> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data, >> modify the directory block in question, calculate parity, then write out >> 3MB of data to rust. So you consume 6MB of bandwidth to write less than >> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify >> a few bytes of metadata. Yes, insane. > > Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have > to read that 4K block, and the corresponding 4K block on the > parity drive, recalculate parity, and write back 4K of data and 4K > of parity. (read|read) modify (write|write). You do not have to > do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks. See: http://www.spinics.net/lists/xfs/msg12627.html Dave usually knows what he's talking about, and I didn't see Neil nor anyone else correcting him on his description of md RMW behavior. What I stated above is pretty much exactly what Dave stated, but for the fact I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6 and 5MB/6MB for 12 drives. >> Parity RAID sucks in general because of RMW, but it is orders of >> magnitude worse when one chooses to use an insane chunk size to boot, >> and especially so with a large drive count. [snip] > Also, 256K or 512K isn't all that big nowadays, there's not much > latency difference between reading 32K or 512K.. You're forgetting 3 very important things: 1. All filesystems have metadata 2. All (worth using) filesystems have a metadata journal 3. All workloads include some, if not major, metadata operations When writing journal and directory metadata there is a huge difference between a 32KB and 512KB chunk especially as the drive count in the array increases. Rarely does a filesystem pack enough journal operations into a single writeout to fill a 512KB stripe, let alone a 4MB stripe. With a 32KB chunk you see full stripe width journal writes frequently, minimizing the number of RMW writes to the journal, even up to 16 data spindle parity arrays (18 drive RAID6). Using a 512KB chunk will cause most journal writes to be partial stripe writes, triggering RMW for most journal writes. The same is true for directory metadata writes. Everyone knows that parity RAID sucks for anything but purely streaming workloads with little metadata. With most/all other workloads, using a large chunk size, such as the md metadata 1.2 default of 512KB, with parity RAID, simply makes it much worse, whether the RMW cycle affects all disks or just one data disk and one parity disk. >> Recreate your array, partition aligned, and manually specify a sane >> chunk size of something like 32KB. You'll be much happier with real >> workloads. > > Aligning is a good idea, Understatement of the century. Just as critical, if not more so, FS stripe alignment is mandatory with parity RAID lest full stripe writeout can/will trigger RMW. > and on modern distributions partitions, > LVM lv's etc are generally created with 1MB alignment. But using > a small chunksize like 32K? That depends on the workload, but > in most cases I'd advise against it. People should ignore your advice in this regard. A small chunk size is optimal for nearly all workloads on a parity array for the reasons I stated above. It's the large chunk that is extremely workload dependent, as again, it only fits well with low metadata streaming workloads. -- Stan