All of lore.kernel.org
 help / color / mirror / Atom feed
* I/O hang, possibly XFS, possibly general
@ 2011-06-02 14:42 Paul Anderson
  2011-06-02 16:17 ` Stan Hoeppner
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Paul Anderson @ 2011-06-02 14:42 UTC (permalink / raw)
  To: xfs-oss

This morning, I had a symptom of a I/O throughput problem in which
dirty pages appeared to be taking a long time to write to disk.

The system is a large x64 192GiB dell 810 server running 2.6.38.5 from
kernel.org - the basic workload was data intensive - concurrent large
NFS (with high metadata/low filesize), rsync/lftp (with low
metadata/high file size) all working in a 200TiB XFS volume on a
software MD raid0 on top of 7 software MD raid6, each w/18 drives.  I
had mounted the filesystem with inode64,largeio,logbufs=8,noatime.

The specific symptom was that 'sync' hung, a dpkg command hung
(presumably trying to issue fsync), and experimenting with "killall
-STOP" or "kill -STOP" of the workload jobs didn't let the system
drain I/O enough to finish the sync.  I probably did not wait long
enough, however.

So here's what I did to diagnose: when all workloads were stopped,
there was still low rate I/O from kflush->md array jobs.  No CPU
starvation, but the I/O rate was low - 5-30MiB/second (the array can
readily do >1000MiB/second for big I/O).  Mind you, one "md5sum
--check" job was able to run at >200MiB/second without trouble - turn
it off or on and the aggregate I/O load shoots right up or down along
with it, so I'm fairly confident in the underlying physical arrays as
well as XFS large data I/O.

I did "echo 3 > /proc/sys/vm/drop_caches" repeatedly and noticed that
according to top, the total amount of cached data would drop down
rapidly (first time had the big drop), but still be stuck at around
8-10Gigabytes.  While continuing to do this, I noticed finally that
the cached data value was in fact dropping slowly (at the rate of
5-30MiB/second), and in fact finally dropped down to approximately
60Megabytes at which point the stuck dpkg command finished, and I was
again able to issue sync commands that finished instantly.

My guess is that I've done something to fill the buffer pool with slow
to flush metadata - and prior to rebooting the machine a few minutes
ago, I removed the largeio option in /etc/fstab.

I can't say this is an XFS bug specifically, but more likely how I am
using it - are there other tools I can use to better diagnose what is
going on?  I do know it will happen again, since we will have 5 of
these machines running at very high rates soon.  Also, any suggestions
for better metadata or log management are very welcome.

This particular machine is probably our worst, since it has the widest
variation in offered file I/O load (tens of millions of small files,
thousands of >1GB  files).  If this workload is pushing XFS too hard,
I can deploy new hardware to split the workload across different
filesystems.

Thanks very much for any thoughts or suggestions,

Paul Anderson

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson
@ 2011-06-02 16:17 ` Stan Hoeppner
  2011-06-02 18:56 ` Peter Grandi
  2011-06-03  0:42 ` Christoph Hellwig
  2 siblings, 0 replies; 25+ messages in thread
From: Stan Hoeppner @ 2011-06-02 16:17 UTC (permalink / raw)
  To: Paul Anderson; +Cc: xfs-oss

On 6/2/2011 9:42 AM, Paul Anderson wrote:

> had mounted the filesystem with inode64,largeio,logbufs=8,noatime.

I don't see 'delaylog' in your mount options nor an external log device
specified.  Delayed logging will dramatically decrease IOPs to the log
device via cleverly discarding duplicate metadata write operations and
other tricks.  Enabling it may solve your problem given your high
metadata workload.  Delayed logging design document:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt

Delaylog was an optional mount option from 2.6.35 to 2.6.38.  In 2.6.39
and up it is the default.  Give it a go.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson
  2011-06-02 16:17 ` Stan Hoeppner
@ 2011-06-02 18:56 ` Peter Grandi
  2011-06-02 21:24   ` Paul Anderson
  2011-06-03  0:06   ` Phil Karn
  2011-06-03  0:42 ` Christoph Hellwig
  2 siblings, 2 replies; 25+ messages in thread
From: Peter Grandi @ 2011-06-02 18:56 UTC (permalink / raw)
  To: Linux fs XFS

> This morning, I had a symptom of a I/O throughput problem in which
> dirty pages appeared to be taking a long time to write to disk.

That can happen because of a lot of reasons, like elevator
issues (CFQ has serious problems) and even CPU scheduler issues,
RAID HA firmware problems (if you are using one, and you seem to
be using MD, but then you may be using several in JBOD mode to
handle all the disks), or problems with the Linux page cache
(read ahead, the abominable plugger) or the flusher (the
defaults are not so hot). Sometimes there are odd resonances
between the page cache and multiple layers od MD or LVM too.

Lots of people have been burned even with much simpler setups
than the one you describe below:

> The system is a large x64 192GiB dell 810 server running
> 2.6.38.5 from kernel.org - the basic workload was data
> intensive - concurrent large NFS (with high metadata/low
> filesize),

Very imaginative. :-)

> rsync/lftp (with low metadata/high file size)

More suitable, but insignificant compared to this:

> all working in a 200TiB XFS volume on a software MD raid0 on
> top of 7 software MD raid6, each w/18 drives.

That's rather more than imaginative :-). But this is a family
oriented mailing list so I can't use appropriate euphemisms,
because they no longer look like euphemisms.

> [ ... ] (the array can readily do >1000MiB/second for big
> I/O). [ ... ]

In a very specific narrow case, and you can get that with a lot
less disks. You have 126 drives that can each do 130MB/s (outer
tracks), so you should be getting 10GB/s :-).

Also, your 1000MiB/s set probably is not full yet, so that's
outer tracks only, and when it fills up, data gets into the
inner tracks, and get a bit churned, then the real performances
will "shine" through.

> I did "echo 3 > /proc/sys/vm/drop_caches" repeatedly and
> noticed that according to top, the total amount of cached data
> would drop down rapidly (first time had the big drop), but
> still be stuck at around 8-10Gigabytes.

You have to watch '/proc/meminfo' to check the dirty pages in
the cache. But you seem to have 8-10GiB of dirty pages in your
192GiB system. Extraordinarily imaginative.

> While continuing to do this, I noticed finally that the cached
> data value was in fact dropping slowly (at the rate of
> 5-30MiB/second), and in fact finally dropped down to
> approximately 60Megabytes at which point the stuck dpkg
> command finished, and I was again able to issue sync commands
> that finished instantly.

Fantastic stuff, is that cached data or cached and dirty data?
Guessing that it is cached and dirty (also because of the
"Subject" line), do you really want to have several GiB of
cached dirty pages?

Do you want these to be zillions of little metadata transactions
scattered at random all over the place?  How "good" (I hesitate
to use the very word in the context) is this more than imaginative
RAID60 set at writing widely scattered small transactions?

> [ ... ]  since we will have 5 of these machines running at
> very high rates soon.

Look forward to that :-).

> Also, any suggestions for better metadata

Use some kind of low overhead database if you need a database,
else pray :-)

> or log management are very welcome.

Separate drives/flash SSD/RAM SSD. As previously revealed by a
question I asked, Linux MD does full-width stripe updates with
RAID6. The wider, the better of course :-).

> This particular machine is probably our worst, since it has
> the widest variation in offered file I/O load (tens of
> millions of small files, thousands of >1GB files).

Wide variation is not the problem, and neither is the machine,
it is the approach.

> If this workload is pushing XFS too hard,

XFS is a very good design within a fairly well defined envelope,
and often the problems are more with Linux or application
issues, but you may be a bit outside that envelope (euphemism
alert), and you need to work on the grain of the storage system
(understatement of the week).

> I can deploy new hardware to split the workload across
> different filesystems.

My usual recommendation is to default (unless you have
extraordinarily good arguments otherwise, and almost nobody
does) to use RAID10 sets of at most 10 pairs (of "enterprise"
drives of no more than 1TB each), with XFS or JFS depending on
workload, as many servers as needed (if at all possible located
topologically near to their users to avoid some potentially
nasty network syndromes like incast), and forget about having a
single large storage pool. Other details as to the flusher
(every 1-2 seconds), elevator (deadline or noop), ... can matter
a great deal.

If you do need a single large storage pool almost the only
reasonable way currently (even if I have great hopes for
GlusterFS) is Lustre or one of its forks (or much simpler
imitators like DPM), and that has its own downsides (it takes a
lot of work), but a single large storage pool is almost never
needed, at most a single large namespace, and that can be
instantiated with an automounter (and Lustre/DPM/.... is in
effect a more sophisticated automounter).

If you know better go ahead and build 200TB XFS filesystems on
top of a 7x(16+2) drive RAID60 and put lots of small files in
them (or whatever) and don't even think about 'fsck' because you
"know" it will never happen. And what about backing up one of
those storage sets to another one? That can happen in the
"background" of course, with no extra load :-).

Just realized another imaginative detail: a 126 drive RAID60 set
delivering 200TB, looks like that you are using 2TB drives. Why
am I not surprised? It would be just picture-perfect if they
were low cost "eco" drives, and only a bit less so if they were
ordinary drives without ERC. Indeed cost conscious budget heroes
can only suggest using 2TB drives in a 126-drive RAID60 set even
for a small-file metadata intensive workload, because IOPS and
concurrent RW are obsolete concepts in many parts of the world.

Disclaimer: some smart people I know built knowingly a similar
and fortunately much smaller collection of RAID6 sets because
that was the least worst option for them, and since they know
that it will not fill up before they can replace it, they are
effectively short-stroking all those 2TB drives (I still would
have bought ERC ones if possible) so it's cooler than it looks.

> Thanks very much for any thoughts or suggestions,

* Don't expect to slap together a lot of stuff at random and it
  working just like that. But then if you didn't expect that you
  wouldn't have done any of the above.

* "My usual recommendation" above is freely given yet often
  worth more than months/years of very expensive consultants.

* This mailing list is continuing proof that the "let's bang it
  together, it will just work" club is large.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 18:56 ` Peter Grandi
@ 2011-06-02 21:24   ` Paul Anderson
  2011-06-02 23:59     ` Phil Karn
  2011-06-03 22:19     ` Peter Grandi
  2011-06-03  0:06   ` Phil Karn
  1 sibling, 2 replies; 25+ messages in thread
From: Paul Anderson @ 2011-06-02 21:24 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS

Hi Peter - I appreciate the feedback!

The background for this is that we live in an extreme corner case of
the world - our use case is dealing with 1GiB to 100GiB files at
present, and in the future probably to 500GiB files (aggregated data
from multiple deep sequencing runs).

The data itself has very odd lifecycle behavior, as well - since it is
research, the different stages are still being sorted out, but some
stages are essentially write once, read once, maybe keep, maybe
discard, depending on the research scenario.

Parenthetically, I will note there are numerous other issues and
problems that impose constraints beyond what is noted here -
conventional work flow, research problems, budgets, rack space, rack
power, time and more.

On Thu, Jun 2, 2011 at 2:56 PM, Peter Grandi <pg_xf2@xf2.for.sabi.co.uk> wrote:
>> This morning, I had a symptom of a I/O throughput problem in which
>> dirty pages appeared to be taking a long time to write to disk.
>
> That can happen because of a lot of reasons, like elevator
> issues (CFQ has serious problems) and even CPU scheduler issues,
> RAID HA firmware problems (if you are using one, and you seem to
> be using MD, but then you may be using several in JBOD mode to
> handle all the disks), or problems with the Linux page cache
> (read ahead, the abominable plugger) or the flusher (the
> defaults are not so hot). Sometimes there are odd resonances
> between the page cache and multiple layers od MD or LVM too.

All JBOD chassis (SuperMicro SC 847's)... been experimenting with the
flusher, will look at the others.

>
> Lots of people have been burned even with much simpler setups
> than the one you describe below:

No doubt.

>
>> The system is a large x64 192GiB dell 810 server running
>> 2.6.38.5 from kernel.org - the basic workload was data
>> intensive - concurrent large NFS (with high metadata/low
>> filesize),
>
> Very imaginative. :-)
>
>> rsync/lftp (with low metadata/high file size)
>
> More suitable, but insignificant compared to this:

The rsync job currently appear to be causing the issue - it was
rsyncing around 250,000 files.  If the copy had already been done, the
rsync is fast (i.e. stat is fast, despite the numbers), but when it
starts moving data, the IOPS pegs and seems to be the limiting factor.

>
>> all working in a 200TiB XFS volume on a software MD raid0 on
>> top of 7 software MD raid6, each w/18 drives.
>
> That's rather more than imaginative :-). But this is a family
> oriented mailing list so I can't use appropriate euphemisms,
> because they no longer look like euphemisms.

We most likely live in different worlds - this is a pure research
group with "different" constraints than those you're probably used to.
 Not my choice, but 4-10X the cost per unit of storage is currently
not an option.

>> [ ... ] (the array can readily do >1000MiB/second for big
>> I/O). [ ... ]
>
> In a very specific narrow case, and you can get that with a lot
> less disks. You have 126 drives that can each do 130MB/s (outer
> tracks), so you should be getting 10GB/s :-).

The raw hardware will do about 5GiB/sec - near as I can tell, this is
saturating the pci-e bus (maybe main memory).

With XFS freshly installed, it was doing around 1400MiB/sec write, and
around 1900MiB/sec read - 10 parallel high throughput processes read
or writing as fast as possible (which actually is our use case).

> Also, your 1000MiB/s set probably is not full yet, so that's
> outer tracks only, and when it fills up, data gets into the
> inner tracks, and get a bit churned, then the real performances
> will "shine" through.

Yeah - overall, I expect it to drop - perhaps 50%?  I dunno.  The
particular filesystem being discussed is 80% full at the moment.

>> I did "echo 3 > /proc/sys/vm/drop_caches" repeatedly and
>> noticed that according to top, the total amount of cached data
>> would drop down rapidly (first time had the big drop), but
>> still be stuck at around 8-10Gigabytes.
>
> You have to watch '/proc/meminfo' to check the dirty pages in
> the cache. But you seem to have 8-10GiB of dirty pages in your
> 192GiB system. Extraordinarily imaginative.

Will watch that - yes, too many dirty pages in RAM - defaults are far
from optimal here.

>
>> While continuing to do this, I noticed finally that the cached
>> data value was in fact dropping slowly (at the rate of
>> 5-30MiB/second), and in fact finally dropped down to
>> approximately 60Megabytes at which point the stuck dpkg
>> command finished, and I was again able to issue sync commands
>> that finished instantly.
>
> Fantastic stuff, is that cached data or cached and dirty data?
> Guessing that it is cached and dirty (also because of the
> "Subject" line), do you really want to have several GiB of
> cached dirty pages?

After watching it reach steady state at around 60M, it appears not to
be dirty, as a sync command returned immediately and had no effect on
that value.

No, I do not want lots of dirty pages, however, I'm also aware that if
those are just data pages, it represents a few seconds of system
operation.

> Do you want these to be zillions of little metadata transactions
> scattered at random all over the place?  How "good" (I hesitate
> to use the very word in the context) is this more than imaginative
> RAID60 set at writing widely scattered small transactions?



>> [ ... ]  since we will have 5 of these machines running at
>> very high rates soon.
>
> Look forward to that :-).

We are, actually, it is a tremendous improvement over what we've been using.

>
>> Also, any suggestions for better metadata
>
> Use some kind of low overhead database if you need a database,
> else pray :-)

No database will work that I'm aware of, at least for the end data storage.

>
>> or log management are very welcome.
>
> Separate drives/flash SSD/RAM SSD. As previously revealed by a
> question I asked, Linux MD does full-width stripe updates with
> RAID6. The wider, the better of course :-).
>
>> This particular machine is probably our worst, since it has
>> the widest variation in offered file I/O load (tens of
>> millions of small files, thousands of >1GB files).
>
> Wide variation is not the problem, and neither is the machine,
> it is the approach.

All other approaches I am aware of cost more.  I favor Lustre, but the
infrastructure costs alone for a 2-5PB system will tend to be
exceptional.  Not that we may have much choice - the system we have is
well beyond the limits of what we should really be doing - however,
the constraints are also exceptional.

>> If this workload is pushing XFS too hard,
>
> XFS is a very good design within a fairly well defined envelope,
> and often the problems are more with Linux or application
> issues, but you may be a bit outside that envelope (euphemism
> alert), and you need to work on the grain of the storage system
> (understatement of the week).

>
>> I can deploy new hardware to split the workload across
>> different filesystems.
>
> My usual recommendation is to default (unless you have
> extraordinarily good arguments otherwise, and almost nobody
> does) to use RAID10 sets of at most 10 pairs (of "enterprise"
> drives of no more than 1TB each), with XFS or JFS depending on
> workload, as many servers as needed (if at all possible located
> topologically near to their users to avoid some potentially
> nasty network syndromes like incast), and forget about having a
> single large storage pool. Other details as to the flusher
> (every 1-2 seconds), elevator (deadline or noop), ... can matter
> a great deal.

re RAID10 specifically, I'd love to do something better - however the
process is currently severely cost and space constrained.

> If you do need a single large storage pool almost the only
> reasonable way currently (even if I have great hopes for
> GlusterFS) is Lustre or one of its forks (or much simpler
> imitators like DPM), and that has its own downsides (it takes a
> lot of work), but a single large storage pool is almost never
> needed, at most a single large namespace, and that can be
> instantiated with an automounter (and Lustre/DPM/.... is in
> effect a more sophisticated automounter).

"It takes a lot of work" is another reason we aren't readily able to
go to other architectures, despite their many advantages.

>
> If you know better go ahead and build 200TB XFS filesystems on
> top of a 7x(16+2) drive RAID60 and put lots of small files in
> them (or whatever) and don't even think about 'fsck' because you
> "know" it will never happen. And what about backing up one of
> those storage sets to another one? That can happen in the
> "background" of course, with no extra load :-).

fsck happens in less than a day, likewise rebuilding all RAIDs...
backups are interesting - it is impossible in the old scenario (our
prior generation storage) - possible now due to higher disk and
network bandwidth.  Keep in mind our ultimate backup is tissue
samples.

> Just realized another imaginative detail: a 126 drive RAID60 set
> delivering 200TB, looks like that you are using 2TB drives. Why
> am I not surprised? It would be just picture-perfect if they
> were low cost "eco" drives, and only a bit less so if they were
> ordinary drives without ERC. Indeed cost conscious budget heroes
> can only suggest using 2TB drives in a 126-drive RAID60 set even
> for a small-file metadata intensive workload, because IOPS and
> concurrent RW are obsolete concepts in many parts of the world.

We fortunately are were able to afford reasonably good enterprise drives.

2TB drives are mandatory - there simply isn't enough available space
in the data center otherwise.

The bulk of the work is not small-file - almost all is large files.

> Disclaimer: some smart people I know built knowingly a similar
> and fortunately much smaller collection of RAID6 sets because
> that was the least worst option for them, and since they know
> that it will not fill up before they can replace it, they are
> effectively short-stroking all those 2TB drives (I still would
> have bought ERC ones if possible) so it's cooler than it looks.

That is precisely the situation here - it is the "least worst" option.

>
>> Thanks very much for any thoughts or suggestions,
>
> * Don't expect to slap together a lot of stuff at random and it
>  working just like that. But then if you didn't expect that you
>  wouldn't have done any of the above.
>
> * "My usual recommendation" above is freely given yet often
>  worth more than months/years of very expensive consultants.
>
> * This mailing list is continuing proof that the "let's bang it
>  together, it will just work" club is large.

Research is research - not my choice of how it is done, either.

Paul


>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 21:24   ` Paul Anderson
@ 2011-06-02 23:59     ` Phil Karn
  2011-06-03  0:39       ` Dave Chinner
  2011-06-03 22:19     ` Peter Grandi
  1 sibling, 1 reply; 25+ messages in thread
From: Phil Karn @ 2011-06-02 23:59 UTC (permalink / raw)
  To: Paul Anderson; +Cc: Linux fs XFS

On 6/2/11 2:24 PM, Paul Anderson wrote:

> The data itself has very odd lifecycle behavior, as well - since it is
> research, the different stages are still being sorted out, but some
> stages are essentially write once, read once, maybe keep, maybe
> discard, depending on the research scenario.
...
> The bulk of the work is not small-file - almost all is large files.

Out of curiosity, do your writers use the fallocate() call? If not, how
fragmented do your filesystems get?

Even if most of your data isn't read very often, it seems like a good
idea to minimize its fragmentation because that also reduces
fragmentation of the free list, which makes it easier to keep contiguous
other files that *are* heavily read. Also, fewer extents per file means
less metadata per file, ergo less metadata and log I/O, etc.

When a writer knows in advance how big a file will be, I can't see any
downside to having it call fallocate() to let the file system know. Soon
after I switched to XFS six months ago I've been running locally patched
versions of rsync/tar/cp and so on, and they really do minimize
fragmentation with very little effort.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 18:56 ` Peter Grandi
  2011-06-02 21:24   ` Paul Anderson
@ 2011-06-03  0:06   ` Phil Karn
  1 sibling, 0 replies; 25+ messages in thread
From: Phil Karn @ 2011-06-03  0:06 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS

On 6/2/11 11:56 AM, Peter Grandi wrote:

> Disclaimer: some smart people I know built knowingly a similar
> and fortunately much smaller collection of RAID6 sets because
> that was the least worst option for them, and since they know
> that it will not fill up before they can replace it, they are
> effectively short-stroking all those 2TB drives (I still would
> have bought ERC ones if possible) so it's cooler than it looks.

What do you mean by "short stroking"? That the data (and head motions)
stay in one part of the disk? I haven't been using XFS that long and I'm
no expert on it, but I've noticed that it seems to distribute files
pretty evenly across an entire disk. Even without the inode64 option,
only the inodes are kept at the beginning; the data can be anywhere.

The only way I can think of to confine the activity on a lightly-loaded
XFS file system to one part of a disk (e.g., to reduce average seek
times and to stay in the faster outer area of the drive) is to create
partitions that initially span only part of the disk, then grow them
later as needed. Is that what you mean?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 23:59     ` Phil Karn
@ 2011-06-03  0:39       ` Dave Chinner
  2011-06-03  2:11         ` Phil Karn
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2011-06-03  0:39 UTC (permalink / raw)
  To: Phil Karn; +Cc: Paul Anderson, Linux fs XFS

On Thu, Jun 02, 2011 at 04:59:25PM -0700, Phil Karn wrote:
> On 6/2/11 2:24 PM, Paul Anderson wrote:
> 
> > The data itself has very odd lifecycle behavior, as well - since it is
> > research, the different stages are still being sorted out, but some
> > stages are essentially write once, read once, maybe keep, maybe
> > discard, depending on the research scenario.
> ...
> > The bulk of the work is not small-file - almost all is large files.
> 
> Out of curiosity, do your writers use the fallocate() call? If not, how
> fragmented do your filesystems get?
> 
> Even if most of your data isn't read very often, it seems like a good
> idea to minimize its fragmentation because that also reduces
> fragmentation of the free list, which makes it easier to keep contiguous
> other files that *are* heavily read. Also, fewer extents per file means
> less metadata per file, ergo less metadata and log I/O, etc.
> 
> When a writer knows in advance how big a file will be, I can't see any
> downside to having it call fallocate() to let the file system know.

You're ignoring the fact that delayed allocation effectively does
this for you without needing to physically allocate the blocks.
So when you have files that are short lived, you don't actually do
any allocation at all, Further delayed allocation results in
allocation order according to writeback order rather than write()
order, so I/O patterns are much nicer when using delayed allocation.

Basicaly you are removing one of the major IO optimisation
capabilities of XFS by preallocating everything like this.

> Soon
> after I switched to XFS six months ago I've been running locally patched
> versions of rsync/tar/cp and so on, and they really do minimize
> fragmentation with very little effort.

So you don't have any idea of how well XFS minimises fragmentation
without needing to use preallocation? Sounds like you have a classic
case of premature optimisation. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson
  2011-06-02 16:17 ` Stan Hoeppner
  2011-06-02 18:56 ` Peter Grandi
@ 2011-06-03  0:42 ` Christoph Hellwig
  2011-06-03  1:39   ` Dave Chinner
  2 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2011-06-03  0:42 UTC (permalink / raw)
  To: Paul Anderson; +Cc: xfs-oss

On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote:
> This morning, I had a symptom of a I/O throughput problem in which
> dirty pages appeared to be taking a long time to write to disk.
> 
> The system is a large x64 192GiB dell 810 server running 2.6.38.5 from
> kernel.org - the basic workload was data intensive - concurrent large
> NFS (with high metadata/low filesize), rsync/lftp (with low
> metadata/high file size) all working in a 200TiB XFS volume on a
> software MD raid0 on top of 7 software MD raid6, each w/18 drives.  I
> had mounted the filesystem with inode64,largeio,logbufs=8,noatime.

A few comments on the setup before trying to analze what's going on in
detail.  I'd absolutely recommend an external log device for this setup,
that is buy another two fast but small disks, or take two existing ones
and use a RAID 1 for the external log device.  This will speed up
anything log intensive, which both NFS, and resync workloads are lot.

Second thing if you can split the workloads into multiple volumes if you
have two such different workloads, so thay they don't interfear with
each other.

Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case
for almost any type of I/O.  You end up doing even relatively small I/O
to all of the disks in the worst case.  I think you'd be much better
off with a simple linear concatenation of the RAID6 devices, even if you
can split them into multiple filesystems

> The specific symptom was that 'sync' hung, a dpkg command hung
> (presumably trying to issue fsync), and experimenting with "killall
> -STOP" or "kill -STOP" of the workload jobs didn't let the system
> drain I/O enough to finish the sync.  I probably did not wait long
> enough, however.

It really sounds like you're simply killloing the MD setup with a
log of log I/O that does to all the devices.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03  0:42 ` Christoph Hellwig
@ 2011-06-03  1:39   ` Dave Chinner
  2011-06-03 15:59     ` Paul Anderson
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2011-06-03  1:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Paul Anderson, xfs-oss

On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote:
> On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote:
> > This morning, I had a symptom of a I/O throughput problem in which
> > dirty pages appeared to be taking a long time to write to disk.
> > 
> > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from
> > kernel.org - the basic workload was data intensive - concurrent large
> > NFS (with high metadata/low filesize), rsync/lftp (with low
> > metadata/high file size) all working in a 200TiB XFS volume on a
> > software MD raid0 on top of 7 software MD raid6, each w/18 drives.  I
> > had mounted the filesystem with inode64,largeio,logbufs=8,noatime.
> 
> A few comments on the setup before trying to analze what's going on in
> detail.  I'd absolutely recommend an external log device for this setup,
> that is buy another two fast but small disks, or take two existing ones
> and use a RAID 1 for the external log device.  This will speed up
> anything log intensive, which both NFS, and resync workloads are lot.
> 
> Second thing if you can split the workloads into multiple volumes if you
> have two such different workloads, so thay they don't interfear with
> each other.
> 
> Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case
> for almost any type of I/O.  You end up doing even relatively small I/O
> to all of the disks in the worst case.  I think you'd be much better
> off with a simple linear concatenation of the RAID6 devices, even if you
> can split them into multiple filesystems
> 
> > The specific symptom was that 'sync' hung, a dpkg command hung
> > (presumably trying to issue fsync), and experimenting with "killall
> > -STOP" or "kill -STOP" of the workload jobs didn't let the system
> > drain I/O enough to finish the sync.  I probably did not wait long
> > enough, however.
> 
> It really sounds like you're simply killloing the MD setup with a
> log of log I/O that does to all the devices.

And this is one of the reasons why I originally suggested that
storage at this scale really should be using hardware RAID with
large amounts of BBWC to isolate the backend from such problematic
IO patterns.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03  0:39       ` Dave Chinner
@ 2011-06-03  2:11         ` Phil Karn
  2011-06-03  2:54           ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Phil Karn @ 2011-06-03  2:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Paul Anderson, Linux fs XFS


[-- Attachment #1.1: Type: text/plain, Size: 3768 bytes --]

On Thu, Jun 2, 2011 at 5:39 PM, Dave Chinner <david@fromorbit.com> wrote:

>
> You're ignoring the fact that delayed allocation effectively does
> this for you without needing to physically allocate the blocks.
> So when you have files that are short lived, you don't actually do
> any allocation at all, Further delayed allocation results in
> allocation order according to writeback order rather than write()
> order, so I/O patterns are much nicer when using delayed allocation.
>

Oh, I'm well aware of delayed allocation. I've just noticed that, in my
experience, it doesn't seem to work nearly as well as fallocate(). And why
should it? If you know in advance how big a file you're writing, how can it
hurt to inform your file system? I suppose the FS implementer could always
ignore that information if he felt he could somehow do a better job, but
it's hard to see how. Isn't it always better to know than to guess?

I'm talking here about the genuine fallocate() system call, not the POSIX
hack that falls back to first conventionally writing zeroes over the file.
The true fallocate() call seems very fast, and if your file system doesn't
support it then it will simply fail without harm. I still can't see any
reason not to use it.

I did know that xfs can avoid the disk allocation and writes entirely when
the files are short-lived, but Paul was talking about writing large,
long-lived files so that's what I had in mind. And when I use fallocate(),
my files are not likely to be short-lived either. Like most people I write
the vast majority of my short-lived files to /tmp, which is tmpfs, not xfs.

But you do raise an interesting point -- is there any serious performance
degradation from using fallocate() on a short-lived file? The written data
still lives in the buffer cache for a while, so if you delete the file
before it gets flushed the disk writes will still be avoided. The file
system may have a little extra work to undo the unnecessary allocation but
that doesn't seem to be a big deal.

Basicaly you are removing one of the major IO optimisation
> capabilities of XFS by preallocating everything like this.
>

"Remove" it? How is giving it the correct answer worse than letting it guess
-- even if it usually guesses correctly?

I still rely on preallocation to keep log files and mailboxes from getting
too badly fragmented.

>So you don't have any idea of how well XFS minimises fragmentation

> without needing to use preallocation? Sounds like you have a classic
> case of premature optimisation. ;)
>
>
As I said, I've tried it both ways. I found that the simple act of adding
fallocate() to rsync (which I use for practically all copying) vastly
reduces xfs fragmentation. Just as I expected it would.

Maybe I'm a little more sensitive to fragmentation than most because I've
been experimenting with storing SHA1 hashes of all my files in external
attributes. This grew out of a data deduplication tool; at first I simply
cached the hashes so I wouldn't have to recompute them on another run, but
then I just added them to every file. This lets me get a warm and fuzzy
feeling by periodically verifying that my files haven't been corrupted,
especially when I began to use SSDs with trim tools.

XFS stores both attributes and extent lists directly in the inode when
there's room, and it turns out that a default-sized xfs inode can store my
hashes provided that the extent list is small. So I now when I walk through
my file system statting everything I can read the hashes too at absolutely
no extra cost. This makes deduplication really fast.

I haven't experimented to see how many extents a file can have before the
attributes get pushed out of the inode, but by keeping most everything
contiguous I simply avoid the problem.

[-- Attachment #1.2: Type: text/html, Size: 4573 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03  2:11         ` Phil Karn
@ 2011-06-03  2:54           ` Dave Chinner
  2011-06-03 22:28             ` Phil Karn
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2011-06-03  2:54 UTC (permalink / raw)
  To: karn; +Cc: Paul Anderson, Linux fs XFS

On Thu, Jun 02, 2011 at 07:11:15PM -0700, Phil Karn wrote:
> On Thu, Jun 2, 2011 at 5:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> >
> > You're ignoring the fact that delayed allocation effectively does
> > this for you without needing to physically allocate the blocks.
> > So when you have files that are short lived, you don't actually do
> > any allocation at all, Further delayed allocation results in
> > allocation order according to writeback order rather than write()
> > order, so I/O patterns are much nicer when using delayed allocation.
> >
> 
> Oh, I'm well aware of delayed allocation. I've just noticed that, in my
> experience, it doesn't seem to work nearly as well as fallocate(). And why
> should it? If you know in advance how big a file you're writing, how can it
> hurt to inform your file system? I suppose the FS implementer could always
> ignore that information if he felt he could somehow do a better job, but
> it's hard to see how. Isn't it always better to know than to guess?

There are definitely cases where it helps for preventing
fragmenting, but as a sweeping generalisation it is very, very
wrong.

> I'm talking here about the genuine fallocate() system call, not the POSIX
> hack that falls back to first conventionally writing zeroes over the file.
> The true fallocate() call seems very fast, and if your file system doesn't
> support it then it will simply fail without harm. I still can't see any
> reason not to use it.
> 
> I did know that xfs can avoid the disk allocation and writes entirely when
> the files are short-lived, but Paul was talking about writing large,
> long-lived files so that's what I had in mind. And when I use fallocate(),
> my files are not likely to be short-lived either. Like most people I write
> the vast majority of my short-lived files to /tmp, which is tmpfs, not xfs.

Do you do that for temporary object files when you build <program X>
from source?

> But you do raise an interesting point -- is there any serious performance
> degradation from using fallocate() on a short-lived file?

Allocation and freeing has CPU overhead, transaction overhead, log
space overhead, can cause free space fragmentation when you have a
mix of short- and long-lived files being preallocated at the same
time, IO for long lived data does not get packed together closely so
requires more seeks to issue which leads to significantly worse IO
performance on RAID5/6 storage sub-systems, etc.

I could go one for quite some time, but the overal effect of such
behaviour is that it speeds up filesystem aging degradation
significantly. You might not notice that for 6 months or a year, but
when you do....

> The written data
> still lives in the buffer cache for a while, so if you delete the file
> before it gets flushed the disk writes will still be avoided. The file
> system may have a little extra work to undo the unnecessary allocation but
> that doesn't seem to be a big deal.
> 
> Basicaly you are removing one of the major IO optimisation
> > capabilities of XFS by preallocating everything like this.
> >
> 
> "Remove" it? How is giving it the correct answer worse than letting it guess
> -- even if it usually guesses correctly?

See above.

> I still rely on preallocation to keep log files and mailboxes from getting
> too badly fragmented.
> 
> >So you don't have any idea of how well XFS minimises fragmentation
> 
> > without needing to use preallocation? Sounds like you have a classic
> > case of premature optimisation. ;)
> >
> >
> As I said, I've tried it both ways. I found that the simple act of adding
> fallocate() to rsync (which I use for practically all copying) vastly
> reduces xfs fragmentation. Just as I expected it would.
> 
> Maybe I'm a little more sensitive to fragmentation than most because I've
> been experimenting with storing SHA1 hashes of all my files in external
> attributes. This grew out of a data deduplication tool; at first I simply
> cached the hashes so I wouldn't have to recompute them on another run, but
> then I just added them to every file. This lets me get a warm and fuzzy
> feeling by periodically verifying that my files haven't been corrupted,
> especially when I began to use SSDs with trim tools.
> 
> XFS stores both attributes and extent lists directly in the inode when
> there's room, and it turns out that a default-sized xfs inode can store my
> hashes provided that the extent list is small. So I now when I walk through
> my file system statting everything I can read the hashes too at absolutely
> no extra cost. This makes deduplication really fast.

/me slaps his forehead.

You do realise that your "attr out of line" problem would have gone
away by simply increasing the XFS inode size at mkfs time? And that
there is almost no performance penalty for doing this?  Instead, it
seems you found a hammer named fallocate() and proceeded to treat
every tool you have like a nail. :)

Changing a single mkfs parameter is far less work than maintaining
your own forks of multiple tools....

> I haven't experimented to see how many extents a file can have
> before the attributes get pushed out of the inode, but by keeping
> most everything contiguous I simply avoid the problem.

Until aging has degraded your filesystem til free space is
sufficiently fragmented that you can't allocate large extents any
more. Then you are completely screwed. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03  1:39   ` Dave Chinner
@ 2011-06-03 15:59     ` Paul Anderson
  2011-06-04  3:15       ` Dave Chinner
  2011-06-04  8:14       ` Stan Hoeppner
  0 siblings, 2 replies; 25+ messages in thread
From: Paul Anderson @ 2011-06-03 15:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs-oss

On Thu, Jun 2, 2011 at 9:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote:
>> On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote:
>> > This morning, I had a symptom of a I/O throughput problem in which
>> > dirty pages appeared to be taking a long time to write to disk.
>> >
>> > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from
>> > kernel.org - the basic workload was data intensive - concurrent large
>> > NFS (with high metadata/low filesize), rsync/lftp (with low
>> > metadata/high file size) all working in a 200TiB XFS volume on a
>> > software MD raid0 on top of 7 software MD raid6, each w/18 drives.  I
>> > had mounted the filesystem with inode64,largeio,logbufs=8,noatime.
>>
>> A few comments on the setup before trying to analze what's going on in
>> detail.  I'd absolutely recommend an external log device for this setup,
>> that is buy another two fast but small disks, or take two existing ones
>> and use a RAID 1 for the external log device.  This will speed up
>> anything log intensive, which both NFS, and resync workloads are lot.
>>
>> Second thing if you can split the workloads into multiple volumes if you
>> have two such different workloads, so thay they don't interfear with
>> each other.
>>
>> Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case
>> for almost any type of I/O.  You end up doing even relatively small I/O
>> to all of the disks in the worst case.  I think you'd be much better
>> off with a simple linear concatenation of the RAID6 devices, even if you
>> can split them into multiple filesystems
>>
>> > The specific symptom was that 'sync' hung, a dpkg command hung
>> > (presumably trying to issue fsync), and experimenting with "killall
>> > -STOP" or "kill -STOP" of the workload jobs didn't let the system
>> > drain I/O enough to finish the sync.  I probably did not wait long
>> > enough, however.
>>
>> It really sounds like you're simply killloing the MD setup with a
>> log of log I/O that does to all the devices.
>
> And this is one of the reasons why I originally suggested that
> storage at this scale really should be using hardware RAID with
> large amounts of BBWC to isolate the backend from such problematic
> IO patterns.

> Dave Chinner
> david@fromorbit.com
>

Good HW RAID cards are on order - seems to be backordered at least a
few weeks now at CDW.  Got the batteries immediately.

That will give more options for test and deployment.

Not sure what I can do about the log - man page says xfs_growfs
doesn't implement log moving.  I can rebuild the filesystems, but for
the one mentioned in this theread, this will take a long time.

I'm guessing we'll need to split out the workload - aside from the
differences in file size and use patterns, they also have
fundamentally different values (the high metadata dataset happens to
be high value relative to the low metadata/large file dataset).

Paul

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-02 21:24   ` Paul Anderson
  2011-06-02 23:59     ` Phil Karn
@ 2011-06-03 22:19     ` Peter Grandi
  2011-06-06  7:29       ` Michael Monnerie
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Grandi @ 2011-06-03 22:19 UTC (permalink / raw)
  To: Linux fs XFS

> All JBOD chassis (SuperMicro SC 847's)... been experimenting
> with the flusher, will look at the others.

I think that from the symptoms you describe the hang happens
in the first instance because the number of dirty pages has hit
'dirty_background_ratio', after which all writes become
synchronous and this really works badly, especially with XFS.

To prevent that, and in general to prevent the accumulation of
lots of dirty pages, and sudden latency killing large bursts of
IO, it is quite important to tell the flusher to sync pretty
often and constantly.

For the Linux kernel by default permits the buildup of a mass of
dirty pages proportional to memory, which is a very bad idea, as
it should be proportional to write speed, with the idea that one
should not buffer more than 1 second or perhaps less of dirty
pages. In your case that's probably a few hundred MBs, and even
that is pretty bad in case of crashes.

The sw solution is to set the 'vm/dirty_*' tunables accordingly.

  vm/dirty_ratio=2
  vm/dirty_bytes=400000000

  vm/dirty_background_ratio=60
  vm/dirty_background_bytes=0

  vm/dirty_expire_centisecs=200
  vm/dirty_writeback_centisecs=400

The hw solution is to do that *and* use SAS/SATA host adapters
with (large) battery-backed buffers/cache (but still keeping
very few dirty pages in the Linux page cache). I would not use
them in hw RAID mode, also because so many hw RAID cards have
abominably buggy firmware, and I trust Linxu MD rather more.
Unfortunately it is difficult to recommend any specific host
adapter.

> The rsync job currently appear to be causing the issue - it
> was rsyncing around 250,000 files.  If the copy had already
> been done, the rsync is fast (i.e. stat is fast, despite the
> numbers), but when it starts moving data, the IOPS pegs and
> seems to be the limiting factor.

That's probably also some effect related to writing to the
intent log, and a RAID60 makes that very painful.

[ ... ]

> We most likely live in different worlds - this is a pure
> research group with "different" constraints than those you're
> probably used to.  Not my choice, but 4-10X the cost per unit
> of storage is currently not an option.

Then lots lots more smaller RAID5 sets, or even RAID6 if you are
sufficiently desparate.

Joined together at the namespace level, not with a RAID0. Do you
really need a single free space pool? I doubt it: you probably
are reading/generating data and storing it, so you instead of
having a single 200TB storage pool, you could have 20x10TB ones
and fill one after the other.

Also ideally much smaller RAID sets: 18 wide with double parity
beckons a world of read-modify-write pain, especially if the
metadata intent log is on the same logical block device.

The MD maintainer thinks that for his much smaller needs putting
the metadata intent logs on a speedy small RAID1 is good enough,
but I think that scales a fair bit. After all the maximum log
size for XFS is not that large (fortunately) and smaller is
better.

Having multiple smaller filesystems also help with having
multiple smaller metadata intent logs.

> With XFS freshly installed, it was doing around 1400MiB/sec
> write, and around 1900MiB/sec read - 10 parallel high
> throughput processes read or writing as fast as possible
> (which actually is our use case).

>> Also, your 1000MiB/s set probably is not full yet, so that's
>> outer tracks only, and when it fills up, data gets into the
>> inner tracks, and get a bit churned, then the real
>> performances will "shine" through.

> Yeah - overall, I expect it to drop - perhaps 50%?  I dunno.
> The particular filesystem being discussed is 80% full at the
> moment.

That's then fairly realistic, as it is getting well into the
inner tracks. Getting avobe 90% will cause trouble.

[ ... ]

>> But you seem to have 8-10GiB of dirty pages in your 192GiB
>> system. Extraordinarily imaginative.

> No, I do not want lots of dirty pages, however, I'm also aware
> that if those are just data pages, it represents a few seconds
> of system operation.

Only if written entirely sequentially. IOPS in random and
sequential are quite different.

> All other approaches I am aware of cost more. I favor Lustre,
> but the infrastructure costs alone for a 2-5PB system will
> tend to be exceptional.

Why? Lustre can run on your existing hw, and you need the
network anyhow (unless you compute several TB on one host and
store them on that host's disks, in which case you are lucky).

>> [ ... ] is Lustre or one of its forks (or much simpler
>> imitators like DPM), and that has its own downsides (it takes
>> a lot of work), but a single large storage pool is almost
>> never needed, at most a single large namespace, and that can
>> be instantiated with an automounter (and Lustre/DPM/.... is
>> in effect a more sophisticated automounter).

> "It takes a lot of work" is another reason we aren't readily
> able to go to other architectures, despite their many
> advantages.

But creating a 200TB volume and formatting it as XFS seems a
quick thing to do now, but soon you will need to cope with the
consequences.

Setting up Lustre takes more at the beginning, but will handle
your workload a lot better, and it handles much better having a
lot of smaller independently fsck-able pools and highly parallel
network operation.

It handles small files not so well, so some kind of NFS server
with XFS or better JFS for that would be nice.

There is a high throughput genomic data system at thre Sanger
Institute in Cambridge UK based on Lustre and it might inspire
you. This is a relatively old post, it has been in production
for a long time:

  http://threebit.net/mail-archive/bioclusters/msg00188.html
  http://www.slideshare.net/gcoates

Alternatively a number of smaller XFS filesystems as suggested
above, but you lose the extra integration/parallelism Lustre
gives.

[ ... ]

> fsck happens in less than a day,

It takes less than a day *if there is essentially no damage*,
otherwise it might take weeks.

> likewise rebuilding all RAIDs...

But the impact on performance will be terrifying, and if you
reduce resync speed, it will take much longer, and while it
rebuilds further failures will be far more likely, and that will
be a very long day.

Also consider that you have a 7-wide RAID0 of RAID6 sets; if one
of the RAID6 sets becomes much slower because of rebuild, odds
are this will impact *all* IO because of the RAID0.

If you are unlucky, you could end up with one of the RAID6
members of the RAID0 set being in rebuild quite a good
percentage of the time.

> backups are interesting - it is impossible in the old scenario
> (our prior generation storage) - possible now due to higher
> disk and network bandwidth.

But many people forget that a backup is often the most stressful
operation that can happen.

> Keep in mind our ultimate backup is tissue samples.

If you can regenerate the data even if expensively then avoid
RAID6. Two 8+1 RAID5 sets are better than a 16+2 RAID6 set, and
losing a bit more spare, three 5+1 RAID5 sets (10TB each) are
better still.

The reason are much smaller RMW stripe width, the ability to do
non-full-width RMW updates, much nicer rebuilds (1/2 or 1/3 of
the drives would be slowed down).

> 2TB drives are mandatory - there simply isn't enough available
> space in the data center otherwise.

Ah that's a pretty hard constraint then.

> The bulk of the work is not small-file - almost all is large
> files.

Then perhaps put the large file on XFS or Lustre and the small
file on JFS.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03  2:54           ` Dave Chinner
@ 2011-06-03 22:28             ` Phil Karn
  2011-06-04  3:12               ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Phil Karn @ 2011-06-03 22:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Paul Anderson, Linux fs XFS

On 6/2/11 7:54 PM, Dave Chinner wrote:

> There are definitely cases where it helps for preventing
> fragmenting, but as a sweeping generalisation it is very, very
> wrong.

Well, if I ever see that in practice I'll change my procedures.

> Do you do that for temporary object files when you build <program X>
> from source?

No, that would involve patching gcc to use fallocate(). I could be wrong
-- I don't know much about gcc internals -- but I think most temp files
go on /tmp, which is not xfs. As I clearly said, I patched only a few
file copy programs like rsync that I use to create long-lived files. I
can't see why the upstream maintainers of those programs shouldn't
accept patches to incorporate fallocate() as long as care is taken to
avoid calling the POSIX version and no other harm is done on file
systems or OSes that don't support it.

> Allocation and freeing has CPU overhead, transaction overhead, log
> space overhead, can cause free space fragmentation when you have a
> mix of short- and long-lived files being preallocated at the same
> time, IO for long lived data does not get packed together closely so
> requires more seeks to issue which leads to significantly worse IO
> performance on RAID5/6 storage sub-systems, etc.

I'll believe that when I see it. Like a lot of people I am moving away
from RAID 5/6.

It is hard to see how keeping files contiguous can lead to free space
fragmentation. Seems to me that when a file is severely fragmented, so
is the free space around it. Keeping a file contiguous also keeps free
space in fewer, larger pieces.

> You do realise that your "attr out of line" problem would have gone
> away by simply increasing the XFS inode size at mkfs time? And that
> there is almost no performance penalty for doing this?  Instead, it
> seems you found a hammer named fallocate() and proceeded to treat
> every tool you have like a nail. :)

You do realize that I started experimenting with attributes well *after*
I had built XFS on a 6 GB (net) RAID5 that took over a week of solid
copying to load to 50%? I had noticed the inode size parameter to
mkfs.xfs but I wasn't about to buy four more disks, mkfs a whole new
file system with bigger inodes and copy all my data (again) just to
waste more space on largely empty inodes and, more importantly, require
many more disk seeks and reads to walk through them all.

The default xfs inode is 256 bytes. That means a single 4KiB block read
fetches 16 inodes at once. Making each inode 512 bytes means reading
only 8 inodes in each 4KiB block. That's arithmetic.

And I'd still have no guarantee of keeping my attributes in the inodes
without some limit on the size of the extent list.

> Changing a single mkfs parameter is far less work than maintaining
> your own forks of multiple tools....

See above. I've since built a new RAID1 array with bigger and faster
drives and am abandoning RAID5, but I still see no reason to waste disk
space and seeks on larger data structures that are mostly empty space. A
long extent table contains overhead information that is useless -- noise
-- to me, the user. Defragmenting a file discards that information and
allows more of the disk's storage and I/O capacity to be used for user data.

The only drawback I can see to keeping a file system defragmented is
that I give up an opportunity for steganography, i.e., hiding
information in the locations and sizes of those seemingly random
sequences of extent allocations. I know this has been done.

> Until aging has degraded your filesystem til free space is
> sufficiently fragmented that you can't allocate large extents any
> more. Then you are completely screwed. :/

Once again, it is very difficult to see how keeping my long-lived files
contiguous causes free space to become more fragmented, not less. Help
me out here; it's highly counter intuitive, and more importantly I
haven't seen that problem, at least not yet.

I have a few extremely large files (many GB) that cannot be allocated a
contiguous area. That's probably because of xfs's strategy of scattering
files around disk to allow room for growth, which fragments the free
space. But that's not a big problem since I don't have very many such
files. Each extent is still pretty big, so sequential I/O is still quite
fast, and if their attributes are squeezed out of their inodes it's not
a big performance hit either.

You seem to take personal offense to my use of fallocate(), which is
hardly my intention. Did you perhaps write the xfs preallocation code
that I'm bypassing? As I said, I still rely on it for log files,
mailboxes and temporary files, and it is much appreciated.

--Phil

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03 22:28             ` Phil Karn
@ 2011-06-04  3:12               ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2011-06-04  3:12 UTC (permalink / raw)
  To: Phil Karn; +Cc: Paul Anderson, Linux fs XFS

On Fri, Jun 03, 2011 at 03:28:54PM -0700, Phil Karn wrote:
> On 6/2/11 7:54 PM, Dave Chinner wrote:
> 
> > There are definitely cases where it helps for preventing
> > fragmenting, but as a sweeping generalisation it is very, very
> > wrong.
> 
> Well, if I ever see that in practice I'll change my procedures.
> 
> > Do you do that for temporary object files when you build <program X>
> > from source?
> 
> No, that would involve patching gcc to use fallocate(). I could be wrong
> -- I don't know much about gcc internals -- but I think most temp files
> go on /tmp, which is not xfs. As I clearly said, I patched only a few
> file copy programs like rsync that I use to create long-lived files. I
> can't see why the upstream maintainers of those programs shouldn't
> accept patches to incorporate fallocate() as long as care is taken to
> avoid calling the POSIX version and no other harm is done on file
> systems or OSes that don't support it.

They are trying, but, well, the file corruption problems seen on
2.6.38/.39 kernels that are the result of them using
fiemap/fallocate don't inspire me with confidence....

> > away by simply increasing the XFS inode size at mkfs time? And that
> > there is almost no performance penalty for doing this?  Instead, it
> > seems you found a hammer named fallocate() and proceeded to treat
> > every tool you have like a nail. :)
> 
> You do realize that I started experimenting with attributes well *after*
> I had built XFS on a 6 GB (net) RAID5 that took over a week of solid
> copying to load to 50%? I had noticed the inode size parameter to
> mkfs.xfs but I wasn't about to buy four more disks, mkfs a whole new
> file system with bigger inodes and copy all my data (again) just to
> waste more space on largely empty inodes and, more importantly, require
> many more disk seeks and reads to walk through them all.
> 
> The default xfs inode is 256 bytes. That means a single 4KiB block read
> fetches 16 inodes at once. Making each inode 512 bytes means reading
> only 8 inodes in each 4KiB block. That's arithmetic.

XFS does not do inode IO like that, so your logic is flawed.

Firstly, inodes are read and written in clusters of 8k, and
contiguous inode clusters are merged during IO by the elevator.
Metadata blocks are heavily sorted before being issued by for
writeback, so we get excellent large IO patterns even for metadata
IO.  Under heavy file create workloads, I'm seeing XFS consistently
write metadata to disk in 320k IOs - the maximum IO size my storage
subsystem will allow.

e.g. a couple of instructive graphs from Chris
Mason for a parallel file create workload:

http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.png
http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.ogg

The fact that ~5000 IOPS is being sustained with only 30-100 seeks/s
indicates that the elevator merging is merging roughly 50-100
individual IOs together into each physical IO. This will happen
regardless of inode size, so inode/metadata writeback under these
workloads tends to be limited by bandwidth, not IOPS....

Reads might be a bit more random, but due to inodes being allocated
in larger chunks (64 inodes at a time) and temporal locality effects
due to sequential allocation by apps like rsync, then typically
reads occur to localised areas as well and hit track caches or RAID
controller readahead windows.

> And I'd still have no guarantee of keeping my attributes in the inodes
> without some limit on the size of the extent list.

going from 256 -> 512 byte inodes gives you 256 bytes more space for
attributes and extents, which in your case woul dbe entirely for
data extents. In hat space you can fit another 16 extent records,
which is more than enough for 99.9% of normal files.

> > Changing a single mkfs parameter is far less work than maintaining
> > your own forks of multiple tools....
> 
> See above. I've since built a new RAID1 array with bigger and faster
> drives and am abandoning RAID5, but I still see no reason to waste disk
> space and seeks on larger data structures that are mostly empty space.

Well, if you think that inodes are taking too much space, then I
guess you'd be really concerned about the amount of space that
directories consume and how badly they get fragmented ;)

> > Until aging has degraded your filesystem til free space is
> > sufficiently fragmented that you can't allocate large extents any
> > more. Then you are completely screwed. :/
> 
> Once again, it is very difficult to see how keeping my long-lived files
> contiguous causes free space to become more fragmented, not less. Help
> me out here; it's highly counter intuitive, and more importantly I
> haven't seen that problem, at least not yet.

Initial allocations are done via the "allocate near" algorithm.  It
starts by finding the largest freespace extent that will hold the
allocation via a -size- match i.e. it will look for a match on the
size you are asking for. If there isn't a free space extent large
enough, it will fall back to searching for a large enough extent
near to where you are asking with an increasing search radius.

Once a free space extent is found, it then trims it for alignment to
stripe unit/stripe width. This generally leaves small, isolated
chunks of free space behind, as allocations are typically not stripe
unit/width length. Hence you end up with lots of little holes
around.

Subsequent sequential allocations use an exact block allocation
target to try to extend the contiguous allocation each file does.
For large files, this tends to keep the files contiguous, or at
least with multiple large extents rather than lots of small extents.

Then things like unrelated metadata allocations will tend to fill
those little holes, be it inodes, btree blocks, directory blocks or
attributes. If there aren't little holes (or you aren't using
alignment), they will simply sit between data extents. When you then
free the allocated data space, you've still got that unrelated
metadata lying around, and the free space is now somewhat
fragmented. This pattern gets worse as the filesystem ages.

Delayed allocation reduces the impact of this problem because it
reduces the amount of on-disk metadata modifications that occur
during normal operations. It also allows things like directory and
inode extent allocation during creates (e.g. untaring) to avoid
interleaving with data allocations, so directory and inode extents
tend to cluster and be more contiguous and not fill holes between
data extents. This means that you are less likely to get sparse
metadata blocks fragmenting free space, metadata read and write IO
is more likely to be clustered effectively (better IO performance),
and so on. IOWs, there are many reasons why delayed allocation
reduces the effects of filesystem aging compared to up-front
preallocation....

> I have a few extremely large files (many GB) that cannot be allocated a
> contiguous area. That's probably because of xfs's strategy of scattering
> files around disk to allow room for growth, which fragments the free
> space.

I doubt it. An extent canbe at most 8GB on a 4kB filesystem, so
that's why you see multiple extents for large files. i.e. they
require multiple allocations....

> You seem to take personal offense to my use of fallocate(), which is
> hardly my intention.

Nothing personal at all.

> Did you perhaps write the xfs preallocation code
> that I'm bypassing?

No. People much smarter than me designed and wrote all this stuff.

What I'm commenting on is your implication (sweeping generalisation)
that preallocation should be used everywhere because it seems to
work for you. I don't like to let such statements stand
unchallenged, especially when there are very good reaѕons why it is
likely to be wrong.

I don't do this for my benefit - and I don't really care if you
benefit from it or not - but there's a lot of XFS users on this list
that might have be wondering "why isn't that done by default?".
Those people learn a lot from someone trying to explain why what one
person says is beneficial for their use cases might be considered
harmful to everyone else...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03 15:59     ` Paul Anderson
@ 2011-06-04  3:15       ` Dave Chinner
  2011-06-04  8:14       ` Stan Hoeppner
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2011-06-04  3:15 UTC (permalink / raw)
  To: Paul Anderson; +Cc: Christoph Hellwig, xfs-oss

On Fri, Jun 03, 2011 at 11:59:02AM -0400, Paul Anderson wrote:
> On Thu, Jun 2, 2011 at 9:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote:
> >> On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote:
> >> > This morning, I had a symptom of a I/O throughput problem in which
> >> > dirty pages appeared to be taking a long time to write to disk.
> >> >
> >> > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from
> >> > kernel.org - the basic workload was data intensive - concurrent large
> >> > NFS (with high metadata/low filesize), rsync/lftp (with low
> >> > metadata/high file size) all working in a 200TiB XFS volume on a
> >> > software MD raid0 on top of 7 software MD raid6, each w/18 drives.  I
> >> > had mounted the filesystem with inode64,largeio,logbufs=8,noatime.
> >>
> >> A few comments on the setup before trying to analze what's going on in
> >> detail.  I'd absolutely recommend an external log device for this setup,
> >> that is buy another two fast but small disks, or take two existing ones
> >> and use a RAID 1 for the external log device.  This will speed up
> >> anything log intensive, which both NFS, and resync workloads are lot.
> >>
> >> Second thing if you can split the workloads into multiple volumes if you
> >> have two such different workloads, so thay they don't interfear with
> >> each other.
> >>
> >> Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case
> >> for almost any type of I/O.  You end up doing even relatively small I/O
> >> to all of the disks in the worst case.  I think you'd be much better
> >> off with a simple linear concatenation of the RAID6 devices, even if you
> >> can split them into multiple filesystems
> >>
> >> > The specific symptom was that 'sync' hung, a dpkg command hung
> >> > (presumably trying to issue fsync), and experimenting with "killall
> >> > -STOP" or "kill -STOP" of the workload jobs didn't let the system
> >> > drain I/O enough to finish the sync.  I probably did not wait long
> >> > enough, however.
> >>
> >> It really sounds like you're simply killloing the MD setup with a
> >> log of log I/O that does to all the devices.
> >
> > And this is one of the reasons why I originally suggested that
> > storage at this scale really should be using hardware RAID with
> > large amounts of BBWC to isolate the backend from such problematic
> > IO patterns.
> 
> > Dave Chinner
> > david@fromorbit.com
> >
> 
> Good HW RAID cards are on order - seems to be backordered at least a
> few weeks now at CDW.  Got the batteries immediately.
> 
> That will give more options for test and deployment.
> 
> Not sure what I can do about the log - man page says xfs_growfs
> doesn't implement log moving.  I can rebuild the filesystems, but for
> the one mentioned in this theread, this will take a long time.

Once you have BBWC, the log IO gets aggregated into stripe width
writes to the back end (because it is always sequential IO), so it's
generally not a significant problem for HW RAID subsystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03 15:59     ` Paul Anderson
  2011-06-04  3:15       ` Dave Chinner
@ 2011-06-04  8:14       ` Stan Hoeppner
  2011-06-04 10:32         ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Stan Hoeppner @ 2011-06-04  8:14 UTC (permalink / raw)
  To: Paul Anderson; +Cc: Christoph Hellwig, xfs-oss

On 6/3/2011 10:59 AM, Paul Anderson wrote:

Hi Paul,

When I first replied to this thread I didn't recognize your name, thus
forgot our off-list conversation.  Sorry bout that.

> Good HW RAID cards are on order - seems to be backordered at least a
> few weeks now at CDW.  Got the batteries immediately.

As I mentioned, the 9285-8E is very new product, but I didn't realize it
was *that* new.  Sorry you're having to wait for them.

> That will give more options for test and deployment.

Others have made valid points WRT the down sides of wide stripe parity
arrays.  I've mentioned many times I loathe parity RAID due to those
reasons, and others, but it's mandatory in your case due to the reasons
you previously stated.

If such arguments are sufficiently convincing, and you can afford to
lose the capacity of 2 more disks per chassis to parity, and increase
complexity a bit, you may want to consider 3 x 7 drive RAID5 arrays per
backplane, 6 drive stripe width, 18 total arrays concatenated,  216 AGs,
6 AGs per array, 216TB raw storage per server, if my math is correct.
That instead of the concatenated 6 x 21 drive RAID6 arrays I previously
mentioned.

You'd have 3 arrays per backplane/cable and thus retain some isolation
advantages for troubleshooting, with the same spares arrangement.  Your
overall resiliency, mathematical/theoretical anyway, to drive failure
should actually increase slightly as you would have 3 drives per
backplane worth of parity instead of 2, and array rebuild time would be
~1/3rd that of the 21 drive array, somewhat negating the dual parity
advantage of RAID6 as the odds of drive failure during a rebuild tend to
increase with the duration of the rebuild.

> Not sure what I can do about the log - man page says xfs_growfs
> doesn't implement log moving.  I can rebuild the filesystems, but for
> the one mentioned in this theread, this will take a long time.

See the logdev mount option.  Using two mirrored drives was recommended,
I'd go a step further and use two quality "consumer grade", i.e. MLC
based, SSDs, such as:

http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx

Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive.

> I'm guessing we'll need to split out the workload - aside from the
> differences in file size and use patterns, they also have
> fundamentally different values (the high metadata dataset happens to
> be high value relative to the low metadata/large file dataset).

LSI is touting significantly better parity performance for the 9265/9285
vs LSI's previous generation cards for which they claim peaks of ~2700
MB/s sequential read and ~1800 MB/s write.  The new cards have double
the cache of the previous, so I would think write performance would
increase more than read.  I'm really interested in seeing your test
results with your workloads Paul.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-04  8:14       ` Stan Hoeppner
@ 2011-06-04 10:32         ` Dave Chinner
  2011-06-04 12:11           ` Stan Hoeppner
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2011-06-04 10:32 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss

On Sat, Jun 04, 2011 at 03:14:53AM -0500, Stan Hoeppner wrote:
> On 6/3/2011 10:59 AM, Paul Anderson wrote:
> > Not sure what I can do about the log - man page says xfs_growfs
> > doesn't implement log moving.  I can rebuild the filesystems, but for
> > the one mentioned in this theread, this will take a long time.
> 
> See the logdev mount option.  Using two mirrored drives was recommended,
> I'd go a step further and use two quality "consumer grade", i.e. MLC
> based, SSDs, such as:
> 
> http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx
> 
> Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive.

If you are using delayed logging, then a pair of mirrored 7200rpm
SAS or SATA drives would be sufficient for most workloads as the log
bandwidth rarely gets above 50MB/s in normal operation.

If you have fsync heavy workloads, or are not using delayed logging,
then you really need to use the RAID5/6 device behind a BBWC because
the log is -seriously- bandwidth intensive. I can drive >500MB/s of
log throughput on metadata intensive workloads on 2.6.39 when not
using delayed logging or I'm regularly forcing the log via fsync.
You sure as hell don't want to be running a sustained long term
write load like that on consumer grade SSDs.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-04 10:32         ` Dave Chinner
@ 2011-06-04 12:11           ` Stan Hoeppner
  2011-06-04 23:10             ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Stan Hoeppner @ 2011-06-04 12:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss

On 6/4/2011 5:32 AM, Dave Chinner wrote:
> On Sat, Jun 04, 2011 at 03:14:53AM -0500, Stan Hoeppner wrote:
>> On 6/3/2011 10:59 AM, Paul Anderson wrote:
>>> Not sure what I can do about the log - man page says xfs_growfs
>>> doesn't implement log moving.  I can rebuild the filesystems, but for
>>> the one mentioned in this theread, this will take a long time.
>>
>> See the logdev mount option.  Using two mirrored drives was recommended,
>> I'd go a step further and use two quality "consumer grade", i.e. MLC
>> based, SSDs, such as:
>>
>> http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx
>>
>> Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive.
> 
> If you are using delayed logging, then a pair of mirrored 7200rpm
> SAS or SATA drives would be sufficient for most workloads as the log
> bandwidth rarely gets above 50MB/s in normal operation.

Hi Dave.  I made the first reply to Paul's post, recommending he enable
delayed logging as a possible solution to his I/O hang problem.  I
recommended this due to his mention of super heavy metadata operations
at the time on his all md raid60 on plain HBA setup.  Paul did not list
delaylog when he submitted his 2.6.38.5 mount options:

inode64,largeio,logbufs=8,noatime

Being the author of the delayed logging code, I had expected you to
comment on this, either expounding on my recommendation, or shooting it
down, and giving the reasons why.

So, would delayed logging have possibly prevented his hang problem or
no?  I always read your replies at least twice, and I don't recall you
touching on delayed logging in this thread.  If you did and I missed it,
my apologies.

Paul will have 3 of LSI's newest RAID cards with a combined 3GB BBWC to
test with, hopefully soon.  With that much cache is an external log
device still needed?  With and/or without delayed logging enabled?

> If you have fsync heavy workloads, or are not using delayed logging,
> then you really need to use the RAID5/6 device behind a BBWC because
> the log is -seriously- bandwidth intensive. I can drive >500MB/s of
> log throughput on metadata intensive workloads on 2.6.39 when not
> using delayed logging or I'm regularly forcing the log via fsync.
> You sure as hell don't want to be running a sustained long term
> write load like that on consumer grade SSDs.....

Given that the max log size is 2GB, IIRC, and that most recommendations
I've seen here are against using a log that big, I figure such MLC
drives would be fine.  AIUI, modern wear leveling will spread writes
throughout the entire flash array before going back and over writing the
first sector.  Published MTBF on most MLC drives rates are roughly
equivalent to enterprise SRDs, 1+ million hours.

Do you believe MLC based SSDs are simply never appropriate for anything
but consumer use, and that only SLC devices should be used for real
storage applications?  AIUI SLC flash cells do have about a 10:1 greater
lifetime than MLC cells.  However, there have been a number of
articles/posts demonstrating math which shows a current generation
SandForce based MLC SSD, under a constant 100MB/s write stream, will run
for 20+ years, IIRC, before sufficient live+reserved spare cells burn
out to cause hard write errors, thus necessitating drive replacement.
Under your 500MB/s load, assuming that's constant, the drives would
theoretically last 4+ years.  If that 500MB/s load was only for 12 hours
each day, the drives would last 8+ years.  I wish I had one of those
articles bookmarked...

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-04 12:11           ` Stan Hoeppner
@ 2011-06-04 23:10             ` Dave Chinner
  2011-06-05  1:31               ` Stan Hoeppner
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2011-06-04 23:10 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss

On Sat, Jun 04, 2011 at 07:11:50AM -0500, Stan Hoeppner wrote:
> On 6/4/2011 5:32 AM, Dave Chinner wrote:
> > On Sat, Jun 04, 2011 at 03:14:53AM -0500, Stan Hoeppner wrote:
> >> On 6/3/2011 10:59 AM, Paul Anderson wrote:
> >>> Not sure what I can do about the log - man page says xfs_growfs
> >>> doesn't implement log moving.  I can rebuild the filesystems, but for
> >>> the one mentioned in this theread, this will take a long time.
> >>
> >> See the logdev mount option.  Using two mirrored drives was recommended,
> >> I'd go a step further and use two quality "consumer grade", i.e. MLC
> >> based, SSDs, such as:
> >>
> >> http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx
> >>
> >> Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive.
> > 
> > If you are using delayed logging, then a pair of mirrored 7200rpm
> > SAS or SATA drives would be sufficient for most workloads as the log
> > bandwidth rarely gets above 50MB/s in normal operation.
> 
> Hi Dave.  I made the first reply to Paul's post, recommending he enable
> delayed logging as a possible solution to his I/O hang problem.  I
> recommended this due to his mention of super heavy metadata operations
> at the time on his all md raid60 on plain HBA setup.  Paul did not list
> delaylog when he submitted his 2.6.38.5 mount options:
> 
> inode64,largeio,logbufs=8,noatime
> 
> Being the author of the delayed logging code, I had expected you to
> comment on this, either expounding on my recommendation, or shooting it
> down, and giving the reasons why.
> 
> So, would delayed logging have possibly prevented his hang problem or
> no?  I always read your replies at least twice, and I don't recall you
> touching on delayed logging in this thread.  If you did and I missed it,
> my apologies.

It might, but I delayed logging iћ not he solution to every problem,
and NFS servers are notoriously heavy on log forces due to COMMIT
operations during writes. So it's a good bet that delyed logging
won't fix the problem entirely.

> > If you have fsync heavy workloads, or are not using delayed logging,
> > then you really need to use the RAID5/6 device behind a BBWC because
> > the log is -seriously- bandwidth intensive. I can drive >500MB/s of
> > log throughput on metadata intensive workloads on 2.6.39 when not
> > using delayed logging or I'm regularly forcing the log via fsync.
> > You sure as hell don't want to be running a sustained long term
> > write load like that on consumer grade SSDs.....
> 
> Given that the max log size is 2GB, IIRC, and that most recommendations
> I've seen here are against using a log that big, I figure such MLC
> drives would be fine.  AIUI, modern wear leveling will spread writes
> throughout the entire flash array before going back and over writing the
> first sector.  Published MTBF on most MLC drives rates are roughly
> equivalent to enterprise SRDs, 1+ million hours.
> 
> Do you believe MLC based SSDs are simply never appropriate for anything
> but consumer use, and that only SLC devices should be used for real
> storage applications?  AIUI SLC flash cells do have about a 10:1 greater
> lifetime than MLC cells.  However, there have been a number of
> articles/posts demonstrating math which shows a current generation
> SandForce based MLC SSD, under a constant 100MB/s write stream, will run
> for 20+ years, IIRC, before sufficient live+reserved spare cells burn
> out to cause hard write errors, thus necessitating drive replacement.
> Under your 500MB/s load, assuming that's constant, the drives would
> theoretically last 4+ years.  If that 500MB/s load was only for 12 hours
> each day, the drives would last 8+ years.  I wish I had one of those
> articles bookmarked...

That's the theory, anyway. Let's call it an expected 4 year life
cycle under this workload (which is highly optimistic, IMO). Now you
have two drives in RAID1, that means one will fail in 2 years, or if
you need more drives to sustain that performance the log needs (*)
you might be looking at 4 or more drives, and that brings the expet
failure rate down under one drive per year. Multiply that across
5-10 servers, and that's a drive failure every month just on the log
devices.

That failure rate would make me extremely nervous - losing the log
is a -major- filesystem corruption event - and make me want to spend
more money or change the config to reduce the risk of a double
failure causing the log device to be lost. Especially if there are
hundreds of terabytes of data at risk.

Cheers,

Dave.

(*) You have to consider that sustained workloads mean that the
drives don't get idle time to trigger background garbage collection,
which is one of the key features that current consumer level drives
rely on for maintaining performance and even wear levelling. The
"spare" area in the drives is kept small because it is assumed that
there won't be long term sustained IO so that the garbage collection
can clean up before spare area is exhausted.

Enterprise drives have a much larger relative percentage of flash in
the drive reserved as spare to avoid severe degradation in such
sustained (common enterprise) workloads.  Hence performance on
consumer MLC drives tails off much more quickly than SLC drives.

Hence performance on consumer MLC drives may not be sustainable, and
wear leveling may not be optimal, resulting in flash failure earlier
than you expect.  To maintain performance, you'll need more MLC
drives to maintain baseline performance.  And with more drives, the
chance of failure goes up...
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-04 23:10             ` Dave Chinner
@ 2011-06-05  1:31               ` Stan Hoeppner
  0 siblings, 0 replies; 25+ messages in thread
From: Stan Hoeppner @ 2011-06-05  1:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss

On 6/4/2011 6:10 PM, Dave Chinner wrote:
> On Sat, Jun 04, 2011 at 07:11:50AM -0500, Stan Hoeppner wrote:

>> So, would delayed logging have possibly prevented his hang problem or
>> no?  I always read your replies at least twice, and I don't recall you
>> touching on delayed logging in this thread.  If you did and I missed it,
>> my apologies.
> 
> It might, but I delayed logging iћ not he solution to every problem,
> and NFS servers are notoriously heavy on log forces due to COMMIT
> operations during writes. So it's a good bet that delyed logging
> won't fix the problem entirely.

So the solution in this case will likely require a multi pronged
approach, including XFS optimization, and RAID card and/or RAID level
reconfiguration that has been mentioned.

>> Do you believe MLC based SSDs are simply never appropriate for anything
>> but consumer use, and that only SLC devices should be used for real
>> storage applications?  AIUI SLC flash cells do have about a 10:1 greater
>> lifetime than MLC cells.  However, there have been a number of
>> articles/posts demonstrating math which shows a current generation
>> SandForce based MLC SSD, under a constant 100MB/s write stream, will run
>> for 20+ years, IIRC, before sufficient live+reserved spare cells burn
>> out to cause hard write errors, thus necessitating drive replacement.
>> Under your 500MB/s load, assuming that's constant, the drives would
>> theoretically last 4+ years.  If that 500MB/s load was only for 12 hours
>> each day, the drives would last 8+ years.  I wish I had one of those
>> articles bookmarked...
> 
> That's the theory, anyway. Let's call it an expected 4 year life
> cycle under this workload (which is highly optimistic, IMO). Now you
> have two drives in RAID1, that means one will fail in 2 years, or if
> you need more drives to sustain that performance the log needs (*)
> you might be looking at 4 or more drives, and that brings the expet
> failure rate down under one drive per year. Multiply that across
> 5-10 servers, and that's a drive failure every month just on the log
> devices.

Very good point.  I was looking at single system probabilities instead
of farm scale (shame on me for that newbish oversight).

> That failure rate would make me extremely nervous - losing the log
> is a -major- filesystem corruption event - and make me want to spend
> more money or change the config to reduce the risk of a double
> failure causing the log device to be lost. Especially if there are
> hundreds of terabytes of data at risk.

> Cheers,
> 
> Dave.
> 
> (*) You have to consider that sustained workloads mean that the
> drives don't get idle time to trigger background garbage collection,
> which is one of the key features that current consumer level drives
> rely on for maintaining performance and even wear levelling. The
> "spare" area in the drives is kept small because it is assumed that
> there won't be long term sustained IO so that the garbage collection
> can clean up before spare area is exhausted.
> 
> Enterprise drives have a much larger relative percentage of flash in
> the drive reserved as spare to avoid severe degradation in such
> sustained (common enterprise) workloads.  Hence performance on
> consumer MLC drives tails off much more quickly than SLC drives.

Ahh, I didn't realize the SLC drives have much larger reserved areas.
Shame on me again.  A hardwarefreak should know such things. :(

> Hence performance on consumer MLC drives may not be sustainable, and
> wear leveling may not be optimal, resulting in flash failure earlier
> than you expect.  To maintain performance, you'll need more MLC
> drives to maintain baseline performance.  And with more drives, the
> chance of failure goes up...

Are the enterprise SLC drives able to perform garbage collection etc
while under such constant load?  If not, is it always better to use SRDs
for the log, either internal on a BBWC array, or an external mirrored pair?

I previously mentioned I always read your posts twice.  You are a deep
well of authoritative information and experience.  Keep up the great
work and contribution to the knowledge base of this list.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-03 22:19     ` Peter Grandi
@ 2011-06-06  7:29       ` Michael Monnerie
  2011-06-07 14:09         ` Peter Grandi
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Monnerie @ 2011-06-06  7:29 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: Text/Plain, Size: 1261 bytes --]

On Samstag, 4. Juni 2011 Peter Grandi wrote:
>   vm/dirty_ratio=2
>   vm/dirty_bytes=400000000
> 
>   vm/dirty_background_ratio=60
>   vm/dirty_background_bytes=0
> 
>   vm/dirty_expire_centisecs=200
>   vm/dirty_writeback_centisecs=400

Why dirty_background_ratio=60? This would mean you start to write dirty 
pages only after it reaches 60% of total system memory... Setting it to 
=1 would be the thing you want I guess.
Also, setting both dirty_background_(ratio|bytes) is not supported. The 
latter wins, according to sysctl/vm.txt

Similarly, dirty_ratio and dirty_bytes belong together and exclude each 
other. Maybe you specified both to fit older and newer kernels in one 
example?

dirty_expire_centisecs to 200 means a sync every 2s, which might be good 
in this specific setup mentioned here, but not for a generic server. 
That would defeat XFS's in-memory grouping of blocks before writeout, 
and in case of many parallel (slow|ftp) uploads could lead to much more 
data fragmentation, or no?


-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-06  7:29       ` Michael Monnerie
@ 2011-06-07 14:09         ` Peter Grandi
  2011-06-08  5:18           ` Dave Chinner
  2011-06-08  8:32           ` Michael Monnerie
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Grandi @ 2011-06-07 14:09 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

[ ... ]

>> vm/dirty_ratio=2
>> vm/dirty_bytes=400000000
>> 
>> vm/dirty_background_ratio=60
>> vm/dirty_background_bytes=0

> Why dirty_background_ratio=60? This would mean you start to
> write dirty pages only after it reaches 60% of total system
> memory...

Oops, invert 'dirty_background_*' and 'dirty_*', I was writing
from memory and got it the wrong way round. These are BTW my
notes in my 'sysctl.conf', with pointer to a nice discussion:

  # http://www.westnet.com/~gsmith/content/linux-pdflush.htm

  # dirty_ratio
  #   If more than this percentage of active memory is unflushed then
  #   *all* processes that are writing start writing synchronously.
  # dirty_background_ratio
  #   If more than this percentage of active memory is unflushed the
  #   system starts flushing.
  # dirty_expire_centisecs
  #   How long a page can be dirty before it gets flushed.
  # dirty_writeback_centisecs
  #   How often the flusher runs.

  # In 'mm/pagewriteback.c' there is code that makes sure that in effect
  # the 'dirty_background_ratio' must be smaller (half if larger or equal)
  # than the 'dirty_ratio', and other code to put lower limits on
  # 'dirty_writeback_centisecs' and whatever.

> [ ... '*_bytes' and '*_ratio' Maybe you specified both to fit
> older and newer kernels in one example?

Yes. I had written what I thought was a much simpler/neater
change here:

  http://www.sabi.co.uk/blog/0707jul.html#070701

but I currently put in both versions and let the better one win
:-).

>> vm/dirty_expire_centisecs=200
>> vm/dirty_writeback_centisecs=400

> dirty_expire_centisecs to 200 means a sync every 2s, which
> might be good in this specific setup mentioned here,

Not quite, see above. There are times where I think the values
should be the other way round (run the flusher every 2s and
flush pages dirty for more than 4s).

> but not for a generic server.

Uhmmm, I am not so sure. Because I think that flushes should be
related to IO speed, and even on a smaller system 2 seconds of
IO are a lot of data. Quite a few traditional Linux (and Unix)
tunables are set to defaults from a time where hardware was much
slower. I started using UNIX when there was no 'update' daemon,
and I got into the habit which I still have of typing 'sync'
explicitly every now and then, and then when 'update' was
introduced to do 'sync' every 30s there was not a lot of data
one could lose in those 30s.

> That would defeat XFS's in-memory grouping of blocks before
> writeout, and in case of many parallel (slow|ftp) uploads
> could lead to much more data fragmentation, or no?

Well, it depends on what "fragmentation" means here. It is a
long standing item of discussion. It is nice to see a 10GB file
all in one extent, but is it *necessary*?

As long as a file is composed of fairly large contiguous extents
and they are not themselves widely scattered, things are going
to be fine. What matter is the ratio of long seeks to data
reads, and minimizing that is not the same as reducing seeks to
zero.

Now consider two common cases:

  * A file that is written out at speed, say 100-500MB/s. 2-4s
    means that there is an opportunity to allocate 200MB-2GB
    contiguous extents, and with any luck much larger ones.
    Conversely any larger intervals means potentially losing
    200MB-2GB of data. Sure, if they did not want to lose the
    data the user process should be doing 'fdatasync()', but XFS
    in particular is sort of pretty good at doing a mild version
    of 'O_PONIES' where there is a balance between going as fast
    as possible (buffer a lot in memory) and offering *some*
    level of safety (as shown in the tests I did for a fair
    comparison with 'ext3').

  * A file that is written slowly in small chunks. Well,
    *nothing* will help that except preallocate or space
    reservations.

Personally I'd rather have a file system design with space
reservations (on detecting an append-like access pattern) and
truncate-on-close than delayed allocation like XFS; while
delayed allocation seems to work well enough in many cases, it
is not quit "the more the merrier".

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-07 14:09         ` Peter Grandi
@ 2011-06-08  5:18           ` Dave Chinner
  2011-06-08  8:32           ` Michael Monnerie
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2011-06-08  5:18 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Michael Monnerie, xfs

On Tue, Jun 07, 2011 at 03:09:09PM +0100, Peter Grandi wrote:
> Personally I'd rather have a file system design with space
> reservations (on detecting an append-like access pattern) and
> truncate-on-close than delayed allocation like XFS;

Welcome to the 1990s, Peter. XFS has been doing this for 15 years.
It is an optimisation used by the delayed allocation mechanism,
not a replacement for it. You might have heard the term
"speculative preallocation" before - this is what it does.

FYI, ext3 has a space reservation infrastructure to try to ensure
contiguous allocation occurs without using delayed allocation. It
doesn't work nearly as well as delayed allocation in ext4, btrfs or
XFS...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: I/O hang, possibly XFS, possibly general
  2011-06-07 14:09         ` Peter Grandi
  2011-06-08  5:18           ` Dave Chinner
@ 2011-06-08  8:32           ` Michael Monnerie
  1 sibling, 0 replies; 25+ messages in thread
From: Michael Monnerie @ 2011-06-08  8:32 UTC (permalink / raw)
  To: xfs; +Cc: Peter Grandi


[-- Attachment #1.1: Type: Text/Plain, Size: 2430 bytes --]

On Dienstag, 7. Juni 2011 Peter Grandi wrote:
>   * A file that is written out at speed, say 100-500MB/s. 2-4s
>     means that there is an opportunity to allocate 200MB-2GB
>     contiguous extents, and with any luck much larger ones.
>     Conversely any larger intervals means potentially losing
>     200MB-2GB of data. Sure, if they did not want to lose the
>     data the user process should be doing 'fdatasync()', but XFS
>     in particular is sort of pretty good at doing a mild version
>     of 'O_PONIES' where there is a balance between going as fast
>     as possible (buffer a lot in memory) and offering some
>     level of safety (as shown in the tests I did for a fair
>     comparison with 'ext3').

On a PC, that "loosing 2GB of data" is loosing a single file under 
normal use. It's quite seldom that people are copying data around. And 
even if, when the crash happens they usually know what they just did, 
and restart the copy after a crash.

If we speak about a server normally there should be a HW RAID card in it 
with good cache, and then it's true you should limit Linux write cache 
and flush early and often, as the card has BBWC and therefore data is 
protected once in the RAID card. People tend to forget to set writeback 
lower when using RAID controllers + BBWC, and it's almost nowhere 
documented. Maybe good for a FAQ entry on XFS, even if it's not XFS 
specific?

I wonder if there is a good document for "best practise" on VMs? I've 
never seen someone testing a VMware/XEN host with 20 Linux VMs, and what 
the settings should be for vm.dirty* and net.ipv4.* values. I've seen 
crashes on VM servers, where afterwards databases in VMs were broken 
despite using a RAID card +BBWC...
 
>   * A file that is written slowly in small chunks. Well,
>     nothing will help that except preallocate or space
>     reservations.

Now for a common webserver we use, as a guideline there are about 8 
uploads parallel all the time. Most of them are slow, as people are on 
ADSL. If you sync quite often, you're lucky when using XFS to get 
preallocation and all that. Otherwise, you'd have chunks of all files 
scattered on disk.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-06-08  8:33 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson
2011-06-02 16:17 ` Stan Hoeppner
2011-06-02 18:56 ` Peter Grandi
2011-06-02 21:24   ` Paul Anderson
2011-06-02 23:59     ` Phil Karn
2011-06-03  0:39       ` Dave Chinner
2011-06-03  2:11         ` Phil Karn
2011-06-03  2:54           ` Dave Chinner
2011-06-03 22:28             ` Phil Karn
2011-06-04  3:12               ` Dave Chinner
2011-06-03 22:19     ` Peter Grandi
2011-06-06  7:29       ` Michael Monnerie
2011-06-07 14:09         ` Peter Grandi
2011-06-08  5:18           ` Dave Chinner
2011-06-08  8:32           ` Michael Monnerie
2011-06-03  0:06   ` Phil Karn
2011-06-03  0:42 ` Christoph Hellwig
2011-06-03  1:39   ` Dave Chinner
2011-06-03 15:59     ` Paul Anderson
2011-06-04  3:15       ` Dave Chinner
2011-06-04  8:14       ` Stan Hoeppner
2011-06-04 10:32         ` Dave Chinner
2011-06-04 12:11           ` Stan Hoeppner
2011-06-04 23:10             ` Dave Chinner
2011-06-05  1:31               ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.