All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Brown <david.brown@hesbynett.no>
To: linux-raid@vger.kernel.org
Subject: Re: potentially lost largeish raid5 array..
Date: Mon, 26 Sep 2011 22:29:25 +0200	[thread overview]
Message-ID: <j5qnb5$c9j$1@dough.gmane.org> (raw)
In-Reply-To: <4E80D815.9080005@hardwarefreak.com>

On 26/09/11 21:52, Stan Hoeppner wrote:
> On 9/26/2011 5:51 AM, David Brown wrote:
>> On 26/09/2011 01:58, Stan Hoeppner wrote:
>>> On 9/25/2011 10:18 AM, David Brown wrote:
>>>> On 25/09/11 16:39, Stan Hoeppner wrote:
>>>>> On 9/25/2011 8:03 AM, David Brown wrote:
>>
>> (Sorry for getting so off-topic here - if it is bothering anyone, please
>> say and I will stop. Also Stan, you have been extremely helpful, but if
>> you feel you've given enough free support to an ignorant user, I fully
>> understand it. But every answer leads me to new questions, and I hope
>> that others in this mailing list will also find some of the information
>> useful.)
>
> I don't mind at all. I love 'talking shop' WRT storage architecture and
> XFS. Others might though as we're very far OT at this point. The proper
> place for this discussion in the XFS mailing list. There are folks there
> far more knowledgeable than me and could thus answer your questions more
> thoroughly, and correct me if I make an error.
>

I will stop after this post (at least, I will /try/ not to continue...). 
  I've got all the information I was looking for now, and if I need more 
details I'll take your advice and look at the XFS mailing list.  Before 
that, though, I should really try it out a little first - I don't have 
any need of a big XFS system at the moment, but it is on my list of 
"experiments" to try some quiet evening.

> <snip for brevity>
>
>>
>> To my mind, it is an unfortunate limitation that it is only top-level
>> directories that are spread across allocation groups, rather than all
>> directories. It means the directory structure needs to be changed to
>> suit the filesystem.
>
> That's because you don't yet fully understand how all this XFS goodness
> works. Recall my comments about architecting the storage stack to
> optimize the performance of a specific workload? Using an XFS+linear
> concat setup is a tradeoff, just like anything else. To get maximum
> performance you may need to trade some directory layout complexity for
> that performance. If you don't want that complexity, simply go with a
> plain striped array and use any directory layout you wish.
>

I understand this - tradeoffs are inevitable.  It's a shame that it is a 
necessary tradeoff here.  I can well see that in some cases (such as a 
big dovecot server) the benefits of XFS + linear concat outweigh the 
(real or perceived) benefits of a domain/user directory structure.  But 
that doesn't stop me wanting both!

> Striped arrays don't rely on directory or AG placement for performance
> as does a linear concat array. However, because of the nature of a
> striped array, you'll simply get less performance with the specific
> workloads I've mentioned. This is because you will often generate many
> physical IOs to the spindles per filesystem operation. With the linear
> concat each filesystem IO generates one physical IO to one spindle. Thus
> with a highly concurrent workload you get more real file IOPS than with
> a striped array before the disks hit their head seek limit. There are
> other factors as well, such as latency. Block latency will usually be
> lower with a linear concat than with a striped array.
>
> I think what you're failing to fully understand is the serious level of
> flexibility that XFS provides, and the resulting complexity of
> understanding required by the sysop. Other Linux filesystem offer zero
> flexibility WRT optimizing for the underlying hardware layout. Because
> of XFS' architecture one can tailor its performance characteristics to
> many different physical storage architectures, including standard
> striped arrays, linear concats, a combination of the two, etc, and
> specific workloads. Again, an XFS+linear concat is a specific
> configuration of XFS and the underlying storage, tailored to a specific
> type of workload.
>
>> In some cases, such as a dovecot mail server,
>> that's not a big issue. But in other cases it could be - it is a
>> somewhat artificial restraint in the way you organise your directories.
>
> No, it's not a limitation, but a unique capability. See above.
>

Well, let me rephrase - it is a unique capability, but it is limited to 
situations where you can spread your load among many top-level directories.

>> Of course, scaling across top-level directories is much better than no
>> scaling at all - and I'm sure the XFS developers have good reason for
>> organising the allocation groups in this way.
>
>> You have certainly answered my question now - many thanks. Now I am
>> clear how I need to organise directories in order to take advantage of
>> allocation groups.
>
> Again, this directory layout strategy only applies when using a linear
> concat. It's not necessary with XFS atop a striped array. And it's only
> a good fit for high concurrency high IOPS workloads.
>

Yes, I understand that.

>> Even though I don't have any filesystems planned that
>> will be big enough to justify linear concats,
>
> A linear concat can be as small as 2 disks, even 2 partitions, 4 with
> redundancy (2 mirror pairs). Maybe you meant workload here instead of
> filesystem?
>

Yes, I meant workload :-)


>> spreading data across
>> allocation groups will spread the load across kernel threads and
>> therefore across processor cores, so it is important to understand it.
>
> While this is true, and great engineering, it's only relevant on systems
> doing large concurrent/continuous IO, as in multiple GB/s, given the
> power of today's CPUs.
>
> The XFS allocation strategy is brilliant, and simply beats the stuffing
> out of all the other current Linux filesystems. It's time for me to stop
> answering your questions, and time for you to read:
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html
>

These will keep me busy for a little while.

> If you have further questions after digesting these valuable resources,
> please post them on the xfs mailing list:
>
> http://oss.sgi.com/mailman/listinfo/xfs
>
> Myself and others would be happy to respond.
>
>>> The large file case is transactional database specific, and careful
>>> planning and layout of the disks and filesystem are needed. In this case
>>> we span a single large database file over multiple small allocation
>>> groups. Transactional DB systems typically write only a few hundred
>>> bytes per record. Consider a large retailer point of sale application.
>>> With a striped array you would suffer the read-modify-write penalty when
>>> updating records. With a linear concat you simply directly update a
>>> single 4KB block.
>
>> When you are doing that, you would then use a large number of allocation
>> groups - is that correct?
>
> Not necessarily. It's a balancing act. And it's a rather complicated
> setup. To thoroughly answer this question will take far more list space
> and time than I have available. And given your questions the maildir
> example prompted, you'll have far more if I try to explain this setup.
>
> Please read the docs I mentioned above. They won't directly answer this
> question, but will allow you to answer it yourself after you digest the
> information.
>

Fair enough - thanks for those links.  And if I have more questions, 
I'll try the XFS list - if nothing else, it will give you a break!

>> References I have seen on the internet seem to be in two minds about
>> whether you should have many or a few allocation groups. On the one
>> hand, multiple groups let you do more things in parallel - on the other
>
> More parallelism only to an extent. Disks are very slow. Once you have
> enough AGs for your workload to saturate your drive head actuators,
> additional AGs simply create a drag on performance due to excess head
> seeking amongst all your AGs. Again, it's a balancing act.
>
>> hand, each group means more memory and overhead needed to keep track of
>> inode tables, etc.
>
> This is irrelevant. The impact of these things is infinitely small
> compared to the physical disk overhead caused by too many AGs.
>

OK.

One of the problems with reading stuff on the net is that it is often 
out of date, and there is no one checking the correctness of what is 
published.

>> Certainly I see the point of having an allocation
>> group per part of the linear concat (or a multiple of the number of
>> parts), and I can see the point of having at least as many groups as you
>> have processor cores, but is there any point in having more groups than
>> that?
>
> You should be realizing about now why most people call tuning XFS a
> "Black art". ;) Read the docs about allocation groups.
>
>> I have read on the net about a size limitation of 4GB per group,
>
> You're read in the wrong place, read old docs. The current AG size limit
> is 1TB, has been for quite some time. It will be bumped up some time in
> the future as disk sizes increase. The next limit will likely be 4TB.
>
>> which would mean using more groups on a big system, but I get the
>> impression that this was a 32-bit limitation and that on a 64-bit system
>
> The AG size limit has nothing to do with the system instruction width.
> It is an 'arbitrary' fixed size.
>

OK.

>> the limit is 1 TB per group. Assuming a workload with lots of parallel
>> IO rather than large streams, are there any guidelines as to ideal
>> numbers of groups? Or is it better just to say that if you want the last
>> 10% out of a big system, you need to test it and benchmark it yourself
>> with a realistic test workload?
>
> There are no general guidelines here, but for the mkfs.xfs defaults.
> Coincidentally, recent versions of mkfs.xfs will read the mdadm config
> and build the filesystem correctly, automatically, on top of striped md
> raid arrays.
>

Yes, I have read about that - very convenient.

> Other than that, there are no general guidelines, and especially none
> for a linear concat. The reason is that all storage hardware acts a
> little bit differently and each host/storage combo may require different
> XFS optimizations for peak performance. Pre-production testing is
> *always* a good idea, and not just for XFS. :)
>
> Unless or until one finds that the mkfs.xfs defaults aren't yielding the
> required performance, it's best not to peek under the hood, as you're
> going to get dirty once you dive in to tune the engine. ;)
>

I usually find that when I get a new server to play with, I start poking 
around, trying different fancy combinations of filesystems and disk 
arrangements, trying benchmarks, etc.  Then I realise time is running 
out before it all has to be in place, and I set up something reasonable 
with default settings.  Unlike a car engine, it's easy to put the system 
back to factory condition with a reformat!

>>> XFS is extremely flexible and powerful. It can be tailored to yield
>>> maximum performance for just about any workload with sufficient
>>> concurrency.
>
>> I have also read that JFS uses allocation groups - have you any idea how
>> these compare to XFS, and whether it scales in the same way?
>
> I've never used JFS. AIUI it staggers along like a zombie, with one dev
> barely maintaining it today. It seems there hasn't been real active
> Linux JFS code work for about 7 years, since 2004, only a handful of
> commits, all bug fixes IIRC. The tools package appears to have received
> slightly more attention.
>

That's the impression I got too.

> XFS sees regular commits to both experimental and stable trees, both bug
> fixes and new features, with at least a dozen or so devs banging on it
> at a given time. I believe there is at least one Red Hat employee
> working on XFS full time, or nearly so. Christoph is a kernel dev who
> works on XFS, and could give you a more accurate head count. Christoph?
>
> BTW, this is my last post on this subject. It must move to the XFS list,
> or die.
>

Fair enough.

Many thanks for your help and your patience.


  reply	other threads:[~2011-09-26 20:29 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-23  1:50 potentially lost largeish raid5 array Thomas Fjellstrom
2011-09-23  4:32 ` NeilBrown
2011-09-23  4:49   ` Thomas Fjellstrom
2011-09-23  4:58     ` Roman Mamedov
2011-09-23  5:10       ` Thomas Fjellstrom
2011-09-23  7:06         ` David Brown
2011-09-23  7:37           ` Thomas Fjellstrom
2011-09-23 12:56         ` Stan Hoeppner
2011-09-23 13:28           ` David Brown
2011-09-23 16:22           ` Thomas Fjellstrom
2011-09-23 23:24             ` Stan Hoeppner
2011-09-24  0:11               ` Thomas Fjellstrom
2011-09-24 12:17                 ` Stan Hoeppner
2011-09-24 13:11                   ` (unknown) Tomáš Dulík
2011-09-24 15:16                   ` potentially lost largeish raid5 array David Brown
2011-09-24 16:38                     ` Stan Hoeppner
2011-09-25 13:03                       ` David Brown
2011-09-25 14:39                         ` Stan Hoeppner
2011-09-25 15:18                           ` David Brown
2011-09-25 23:58                             ` Stan Hoeppner
2011-09-26 10:51                               ` David Brown
2011-09-26 19:52                                 ` Stan Hoeppner
2011-09-26 20:29                                   ` David Brown [this message]
2011-09-26 23:28                                   ` Krzysztof Adamski
2011-09-27  3:53                                     ` Stan Hoeppner
2011-09-24 17:48                   ` Thomas Fjellstrom
2011-09-24  5:59             ` Mikael Abrahamsson
2011-09-24 17:53               ` Thomas Fjellstrom
2011-09-25 18:07           ` Robert L Mathews
2011-09-26  6:08             ` Mikael Abrahamsson
2011-09-26  2:26           ` Krzysztof Adamski
2011-09-23  5:11     ` NeilBrown
2011-09-23  5:22       ` Thomas Fjellstrom
2011-09-23  8:09         ` Thomas Fjellstrom
2011-09-23  9:15           ` NeilBrown
2011-09-23 16:26             ` Thomas Fjellstrom
2011-09-25  9:37               ` NeilBrown
2011-09-24 21:57             ` Aapo Laine
2011-09-25  9:18               ` Kristleifur Daðason
2011-09-25 10:10               ` NeilBrown
2011-10-01 23:21                 ` Aapo Laine
2011-10-02 17:00                   ` Aapo Laine
2011-10-05  2:13                     ` NeilBrown
2011-10-05  2:06                   ` NeilBrown
2011-11-05 12:17                 ` Alexander Lyakas
2011-11-06 21:58                   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='j5qnb5$c9j$1@dough.gmane.org' \
    --to=david.brown@hesbynett.no \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.