From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id q6P9U0oE065447 for ; Wed, 25 Jul 2012 04:30:00 -0500 Received: from mail-ob0-f181.google.com (mail-ob0-f181.google.com [209.85.214.181]) by cuda.sgi.com with ESMTP id a6HKVRDTx4fgd38R (version=TLSv1 cipher=RC4-SHA bits=128 verify=NO) for ; Wed, 25 Jul 2012 02:29:59 -0700 (PDT) Received: by obbup19 with SMTP id up19so925689obb.26 for ; Wed, 25 Jul 2012 02:29:58 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <50077A34.5070304@hardwarefreak.com> References: <5004875D.1020305@hardwarefreak.com> <5004C243.6040404@hardwarefreak.com> <20120717052621.GB23387@dastard> <50061CEA.4070609@hardwarefreak.com> <50066115.7070807@hardwarefreak.com> <50068EC5.5020704@hardwarefreak.com> <50077A34.5070304@hardwarefreak.com> Date: Wed, 25 Jul 2012 11:29:58 +0200 Message-ID: Subject: Re: A little RAID experiment From: Stefan Ring List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Linux fs XFS There appears to be a bit of a tension in this thread, and I have the suspicion that it's a case of mismatched presumed expectations. The sole purpose of my activity here over the last months was to present some findings which I thought would be interesting to XFS developers. If I were working on XFS, I would be interested. From most of the answers, though, I get the impression that I am perceived as looking for help tuning my XFS setup, which is not the case at all. In fact, I'm quite happy with it. Let me recap just to give this thread the intended tone: This episode of my journey with XFS started when I read that there had been recent significant performance improvements to XFS' metadata operations. Having tried XFS every couple of years or so before, and always with the same verdict -- horribly slow -- I was curious if it had finally become usable. A new server machine arriving just at the right time would serve as the perfect testbed. I threw some workloads at it, which I hoped would resemble my typical workload, and I focussed especially on areas which bothered me the most on our current development server running ext3. Everything worked more or less satisfactorily, except for the case of un-tarring a metadata-heavy tarball in the presence of considerable free-space fragmentation. In this particular case, performance was conspicuously poor, and after some digging with blktrace and seekwatcher, I identified the cause of this slowness to be a write pattern that looked like this (in block numbers), where the step width (arbitrarily displayed as 10000 here for illustration purposes) was 1/4 of the size of the volume, clearly because the volume had 4 allocation groups (the default). Of course it was not entirely regular, but overall it was very similar to this: 10001 20001 30001 40001 10002 20002 30002 40002 10003 20003 ... I tuned and tweaked everything I could think of -- elevator settings, readahead, su/sw, barrier, RAID hardware cache --, but the behavior would always be the same. It just so happens that the RAID controller in this machine (HP SmartArray P400) doesn't cope very well with a write pattern like this. To it, the sequence appears to be random, and it performs even worse than it would if it were actually random. Going by what I think to know about the topic, it struck me as odd that blocks would be sent to disk in this very unfavorable order. To my mind, three entities had failed at sanitizing the write sequence: the filesystem, the block layer and the RAID controller. My opinion is still unchanged regarding the latter two. The strikingly bad performance on the RAID controller piqued my interest, and I went on a different journey investigating this oddity and created a minor sysbench modification that would just measure performance for this particular pattern. Not many people helped with my experiment, and I was accused of wanting ponies. If I'm the only one who is curious about this, then so be it. I deemed it worthwile sharing my experience and pointing out that a sequence like the one above is a death blow to all HP gear I've got my hands on so far. It has been pointed out that XFS schedules the writes like this on purpose so that they can be done in parallel, and that I should create a concatenated volume with physical devices matching the allocation groups. I actually went through this exercise, and yes, it was very beneficial, but that's not the point. I don't want to (have to) do that. And it's not always feasible, anyway. What about home usage with a single SATA disk? Is it not worthwile to perform well on low-end devices? You might ask then, why even bother using XFS instead of ext4? I care about the multi-user case. The problem I have with ext is that it is unbearably unresponsive when someone writes a semi-large amount of data (a few gigs) at once -- like extracting a large-ish tarball. Just using vim, even with :set nofsync, is almost impossible during that time. I have adopted various disgusting hacks like extracting to a ramdisk instead and rsyncing the lot over to the real disk with a very low --bwlimit, but I'm thoroughly fed up with this kind of crap, and in general, XFS works very well. If noone cares about my findings, I will henceforth be quiet on this topic. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs