All of lore.kernel.org
 help / color / mirror / Atom feed
* frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
@ 2011-08-06 12:25 Marc Lehmann
  2011-08-06 14:20 ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-06 12:25 UTC (permalink / raw)
  To: xfs

I get frequent (for servers) lockups and crashes when using 2.6.39. I saw the
same problems using 3.0.0rc5, 5 and 6, and I think also 2.6.38. I don't see
this lockups on 2.6.30 or 2.6.26 (all the respetcive latest debian kernels).

The symtpom slightly differs - sometimes I get thousands of backtraces
before the machine locks up, usually I get only one, and either the
machine locks up completely, or only the processes using the filesystem in
question (presumably) lock - all unkillable.

The backtraces look all very similar:

   http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt

this is from a desktop system - it tends to be harder to get these from
servers.

all the backtraces crash with a null pointer dereference in xfs_iget, or
in xfs_trans_log_inode, and always for process xfs_fsr.

I haven't seen a crash without xfs_fsr.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-06 12:25 frequent kernel BUG and lockups - 2.6.39 + xfs_fsr Marc Lehmann
@ 2011-08-06 14:20 ` Dave Chinner
  2011-08-07  1:42   ` Marc Lehmann
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2011-08-06 14:20 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Sat, Aug 06, 2011 at 02:25:56PM +0200, Marc Lehmann wrote:
> I get frequent (for servers) lockups and crashes when using 2.6.39. I saw the
> same problems using 3.0.0rc5, 5 and 6, and I think also 2.6.38. I don't see
> this lockups on 2.6.30 or 2.6.26 (all the respetcive latest debian kernels).
> 
> The symtpom slightly differs - sometimes I get thousands of backtraces
> before the machine locks up, usually I get only one, and either the
> machine locks up completely, or only the processes using the filesystem in
> question (presumably) lock - all unkillable.
> 
> The backtraces look all very similar:
> 
>    http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt

Tainted kernel. Please reproduce without the NVidia binary drivers.

> this is from a desktop system - it tends to be harder to get these from
> servers.
> 
> all the backtraces crash with a null pointer dereference in xfs_iget, or
> in xfs_trans_log_inode, and always for process xfs_fsr.

and when you do, please record an event trace of the
xfs_swap_extent* trace points while xfs_fsr is running and triggers
a crash. That will tell me if xfs_fsr is corrupting inodes,

> I haven't seen a crash without xfs_fsr.

Then don't use xfs_fsr until we know if it is the cause of the
problem (except to reproduce the problem).

And as I always ask - why do you need to run xfs_fsr so often?  Do
you really have filesystems that get quickly fragmented (or are you
just running it from a cron-job because having on-line
defragmentation is what all the cool kids do ;)? If you are getting
fragmentation, what is the workload that is causing it?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-06 14:20 ` Dave Chinner
@ 2011-08-07  1:42   ` Marc Lehmann
  2011-08-07 10:26     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-07  1:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sun, Aug 07, 2011 at 12:20:05AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > The backtraces look all very similar:
> > 
> >    http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt
> 
> Tainted kernel. Please reproduce without the NVidia binary drivers.

This is just because it is form my desktop system. None of my other
machines have a tainted kernel, but getting backtraces from there is much
harder.

> > all the backtraces crash with a null pointer dereference in xfs_iget, or
> > in xfs_trans_log_inode, and always for process xfs_fsr.
> 
> and when you do, please record an event trace of the
> xfs_swap_extent* trace points while xfs_fsr is running and triggers
> a crash. That will tell me if xfs_fsr is corrupting inodes,

Ah - how do I do that?

> > I haven't seen a crash without xfs_fsr.
> 
> Then don't use xfs_fsr until we know if it is the cause of the
> problem (except to reproduce the problem).

Why so defensive? xfs_fsr is an advertised feature and should just work
(and does so with older kernels).

> And as I always ask - why do you need to run xfs_fsr so often?  Do

Did I say I am running it often? IT typically runs once a day for an hour.

> you really have filesystems that get quickly fragmented (or are you

Yes, fragmentation with xfs is enourmous - I have yet to see whether
the changes in recent kernels make a big difference, but for log files,
reading through a log file with 60000 fragments tends to be much slower
than reading through one with just a few fragments (or just one...).

Freenet and other daemons are also creating enourmous fragmentation.

As such. xfs is much much worse at fragmentation than ext4, but at least
it has xfs_fsr, which at least reduces file fragmentation.

> just running it from a cron-job because having on-line
> defragmentation is what all the cool kids do ;)?

Didn't know that, maybe I should run it more often then... Or maybe not,
now that you tell me I shouldn't because xfs implementation quality is so
much lower than for other filesystems?

> If you are getting fragmentation, what is the workload that is causing
> it?

Basically, anything but the OS itself. Copying large video files while the
disk is busy with other things causes lots of fragmentation (usually 30
fragments for a 100mb file), which in turn slows down things enourmously once
the disk reaches 95% full.

Freenet is also a good test case.

As are logfiles.

Or a news spool.

Or database files for databases that grow files (such as mysql myisam) -
fortunately I could move of all those to SSDs this year.

Or simply unpacking an archive.

Simple example - the www.deliantra.net gameserver writes logs to a logfile
and stdout, which is redirected to another logfile in the same directory
(which gets truncated on each restart).

Today I had to reboot the server because of buggy xfs (which prompted the
bugreport, as I am seeing this bug for a while now, but so far didn't want
to exclude e.g. bad ram or simply a corrupt filesystem), and in the 4
hours uptime, I got a 4MB logfile with 8 fragments.

This is clearly an improvement over the 2.6.26 kernel I used before on
that machine. But over a few months this still leads to thousands of
fragments, and scanning through a few gigabytes of log file that has 60000
fragments on a disk that isn't completely idle is not exactly fast.

The webserver accesslog on that machine which is a file on its own in its
own directory is 15MB big (it was restarted beginning last month) and has
1043 fragments (it doesn't get defragmented by xfs_fsr because it is in
use).

OTOH, that filesystem isn't used much and has 300gb free out of 500, so
it is surprising that I still get so many fragments (the files are only
closed when runing xfs_fsr on them, which is once every few weeks).

Freenet fares much worse. The persistent blob has 1757 fragments for 13gb
(not that bad), and the download database has 22756 for 600mb, fragments
(that sucks).

On my tv, the recorded video files that haven't been defragmented yet
have between 11 and 63 fragments (all smaller than 2gb), which is almost
acceptable, but I do not think that without a regular xfs_fsr the fs would
be in that good shape after one or two years of usage.

The cool thing about xfs_fsr is not that the cool kids run it, but that,
unlike other filesystems that also fragment a lot (ext3 is absolutely
horrible for example), it can mostly be fixed.

Given that xfs is clearly the lowest quality of the common filesystems
on linux (which I mean to be reiserfs, ext2/3/4 - and before you ask,
literally each time I run a file system check xfs_repair crashes or hangs,
and the filesystems have some issues, on all my numerous machines, and
the number of bugs I have hit with xfs is easily twice the amount of
bugs I hit with reiserfs and extX together, and I was an early adopter
of reiserfs, before it even had a fsck), it is important to have some
features left that cancel this general lack of quality.

Right now, these features for me are the very tunable nature of xfs (for
example, 512b block size for news spools), the very fast xfs_repair and
the long-term maintainability of the filesystem - a heavily used ext3
filesystem basically becomes unusable after a year.

Another feature was the very good feedback I got from this list in the
past w.r.t. bugs and fixes (while nowadays I have to listen to "xfs is
optimised for nfs not for your use" or "then don't use it" replies to bug
reports).

All that and the fact that I haven't lost a single important file and the
steady improvements to performance in XFS make xfs currently my filesystem
of choice, especially for heavy-duty applications.

PS: I run xfs on a total of about 40TB of filesystems at the moment.

PPS: sorry for being so forcefully truthful about xfs above, but you
really need an attitude change - don't tell people to not use a feature,
or tell them they probably just want to be cool kids - the implementation
quality of xfs is far from that of reiserfs or ext3 (not sure about ext4
yet, but I do expect e2fsck to not let me down as often as xfs_repair),
there are things to do, and I contribute what little I can by testing xfs
with actual workloads.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-07  1:42   ` Marc Lehmann
@ 2011-08-07 10:26     ` Dave Chinner
  2011-08-08 19:02       ` Marc Lehmann
  2011-08-09  9:16       ` Marc Lehmann
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2011-08-07 10:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Sun, Aug 07, 2011 at 03:42:38AM +0200, Marc Lehmann wrote:
> On Sun, Aug 07, 2011 at 12:20:05AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > > The backtraces look all very similar:
> > > 
> > >    http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt
> > 
> > Tainted kernel. Please reproduce without the NVidia binary drivers.
> 
> This is just because it is form my desktop system. None of my other
> machines have a tainted kernel, but getting backtraces from there is much
> harder.
> 
> > > all the backtraces crash with a null pointer dereference in xfs_iget, or
> > > in xfs_trans_log_inode, and always for process xfs_fsr.
> > 
> > and when you do, please record an event trace of the
> > xfs_swap_extent* trace points while xfs_fsr is running and triggers
> > a crash. That will tell me if xfs_fsr is corrupting inodes,
> 
> Ah - how do I do that?

Use trace-cmd or do it manually via:

# echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/enable
# echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/after
# cat /sys/kernel/debug/tracing/trace_pipe > trace.out

> > > I haven't seen a crash without xfs_fsr.
> > 
> > Then don't use xfs_fsr until we know if it is the cause of the
> > problem (except to reproduce the problem).
> 
> Why so defensive? xfs_fsr is an advertised feature and should just work

Defensive? Sure - to protect -your systems- from further corruption
problems until we know what the problem is.

To use a car analogy: I know the brakes on your car have a fault
that could cause a catastrophic failure, and I know you are taking a
drive over a mountain. Don't you think I should tell you not to
drive your car over the mountain, but to get the brakes looked at
first?

But it's your data, so if you want to risk catastrophic corruption
by continuing to run xfs_fsr then that's your choice.

> (and does so with older kernels).

On older kernels (2.6.34 and earlier) I can corrupt filesystems
using xfs-fsr just by crafting a file with a specific layout. It's
easy and doesn't require any special privileges to do. IOWs, xfs_fsr
on old kernels is actually dangerous and should not be used if you
have anything that stores information in attributes (like selinux).
We made quite a lot of fixes to the swap extent code to fix those
problems, along with regression tests so the problem doesn't arise
again.

It's entirely possible that a problem was introduced by these fixes.
Perhaps there's a case that I didn't fully understand and fix
properly or there's some other as yet unknown problem. Until I know
what it is, then the safest thing is not to run xfs_fsr.  Indeed, if
you get new corruptions showing up without running xfs_fsr, than
that's also something worth knowing.

> > And as I always ask - why do you need to run xfs_fsr so often?  Do
> 
> Did I say I am running it often? IT typically runs once a day for an hour.

Yes, that is often. I don't run xfs_fsr at all on any of my
machines (except for the test VMs when testing xfs_fsr).

The problem with running xfs_fsr is that while it defragments files,
it fragments free space, i.e. xfs_fsr turns large contiguous free
space ranges into smaller, non-contiguous free space ranges.
IOWs, using xfs_fsr accelerates filesystem aging effects, meaning
that new files are much more likely to be get fragemented as they
grow because they cannot be located in large contiguous free space
extents. Then you run xfs_fsr to reduce the number of fragments in
the file, thereby converting free space into more smaller, less
contiguous extents. It's a downward spiral....

That's why running xfs-fsr regularly out of a cron job is not
advisable. This lesson was learn on Irix more than 10 years ago when
it was defaulted to running once a week for two hours on Sunday
night.  Running it more frequently like is happening on your systems
will only make things worse.

FWIW, this comes up often enough that I think I need to add a FAQ
entry for it.

> > you really have filesystems that get quickly fragmented (or are you
> 
> Yes, fragmentation with xfs is enourmous - I have yet to see whether
> the changes in recent kernels make a big difference, but for log files,
> reading through a log file with 60000 fragments tends to be much slower
> than reading through one with just a few fragments (or just one...).

So you've got a problem with append only workloads.

2.6.38 and more recent kernels should be much more resistent to
fragmentation under such conditions thanks to the dynamic
speculative allocation changes that went into 2.6.38.

Alternatively, you can use the allocsize mount option, or set the
append-only inode flag, or set the preallocated flag on the inode
so that truncation of specualtive allocation beyond EOF doesn't
occur every time the file is closed.

.....

> > If you are getting fragmentation, what is the workload that is causing
> > it?
> 
> Basically, anything but the OS itself. Copying large video files while the
> disk is busy with other things causes lots of fragmentation (usually 30
> fragments for a 100mb file), which in turn slows down things enourmously once
> the disk reaches 95% full.

Another oft-repeated rule of thumb - filling XFS filesystems over
85-90% full causes increased fragmentation because of the lack of
large contiguous free space extents. That's exactly the same problem
that excessive use of xfs_fsr causes.....

> Freenet is also a good test case.

Not for a filesystem developer. Running internet facing, anonymous,
encrypted peer-to-peer file storage servers anywhere is not
something I'll ever do on my network.

If you think it's a good workload that we should use, then capture a
typical directory profile and the IO/filesystem operations made on a
busy server for an hour or so. Then write a script to reproduce that
directory structure and IO pattern.....

> As are logfiles.
> 
> Or a news spool.

append only workloads.

> Or database files for databases that grow files (such as mysql myisam) -
> fortunately I could move of all those to SSDs this year.

I thought mysql as capable of preallocating regions when files grow.
Perhaps it isn't configured to do so?

> Or simply unpacking an archive.

That should not cause fragmentation unless you have already
fragmented free space...

Use xfs_db -r -c "freesp -s" <dev> to get an idea of what your
freespace situation looks like.

> Simple example - the www.deliantra.net gameserver writes logs to a logfile
> and stdout, which is redirected to another logfile in the same directory
> (which gets truncated on each restart).
> 
> Today I had to reboot the server because of buggy xfs (which prompted the
> bugreport, as I am seeing this bug for a while now, but so far didn't want
> to exclude e.g. bad ram or simply a corrupt filesystem), and in the 4
> hours uptime, I got a 4MB logfile with 8 fragments.

What kernel, and what is the xfs_bmap -vp output for the file?

> This is clearly an improvement over the 2.6.26 kernel I used before on
> that machine. But over a few months this still leads to thousands of
> fragments,

Have you seen this, or are you extrapolating from the 4MB file
you've seen above?

....

> Freenet fares much worse. The persistent blob has 1757 fragments for 13gb
> (not that bad), and the download database has 22756 for 600mb, fragments
> (that sucks).

You're still talking about how 2.6.26 kernels behave, right?

> On my tv, the recorded video files that haven't been defragmented yet
> have between 11 and 63 fragments (all smaller than 2gb), which is almost
> acceptable, but I do not think that without a regular xfs_fsr the fs would
> be in that good shape after one or two years of usage.

For old kernels, allocsize should have mostly solved that problem.
For current kernels that shouldn't even be necessary.

> The cool thing about xfs_fsr is not that the cool kids run it, but that,
> unlike other filesystems that also fragment a lot (ext3 is absolutely
> horrible for example), it can mostly be fixed.

"fixed" is not really true - all it has done is trade file
fragementation for freespace fragementation. That bites you
eventually.

> Given that xfs is clearly the lowest quality of the common filesystems
> on linux (which I mean to be reiserfs, ext2/3/4 - and before you ask,
> literally each time I run a file system check xfs_repair crashes or hangs,
> and the filesystems have some issues, on all my numerous machines, and
> the number of bugs I have hit with xfs is easily twice the amount of
> bugs I hit with reiserfs and extX together, and I was an early adopter
> of reiserfs, before it even had a fsck), it is important to have some
> features left that cancel this general lack of quality.

Quality will only improve if you report bugs and help trace their
root cause. Then we can fix them.  If you don't, we don't know about
them, can't fid them and hence can't fix them.

> Right now, these features for me are the very tunable nature of xfs (for
> example, 512b block size for news spools), the very fast xfs_repair and
> the long-term maintainability of the filesystem - a heavily used ext3
> filesystem basically becomes unusable after a year.
> 
> Another feature was the very good feedback I got from this list in the
> past w.r.t. bugs and fixes (while nowadays I have to listen to "xfs is
> optimised for nfs not for your use" or "then don't use it" replies to bug
> reports).

<sigh>

Ok, now I remember you. I hope this time you'll provide me with the
information I ask you for to triage your problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-07 10:26     ` Dave Chinner
@ 2011-08-08 19:02       ` Marc Lehmann
  2011-08-09 10:10         ` Michael Monnerie
  2011-08-09  9:16       ` Marc Lehmann
  1 sibling, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-08 19:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sun, Aug 07, 2011 at 08:26:25PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> Use trace-cmd or do it manually via:
> 
> # echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/enable
> # echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/after
> # cat /sys/kernel/debug/tracing/trace_pipe > trace.out

Thanks, I'll have a look at enabling this with a regular xfs_fsr on a few
machines.

> To use a car analogy: I know the brakes on your car have a fault
> that could cause a catastrophic failure, and I know you are taking a
> drive over a mountain. Don't you think I should tell you not to
> drive your car over the mountain, but to get the brakes looked at
> first?

To take your car analogy - if I went to my car dealer and told him my
brakes just malfunctioned, but fortunately it was uphill and I could
safely stop with my handbrake, he would most decisively not reply with
"then don't use your car".

No, he would presumably offer me to take back the car and replace the
brakes, for free.

I am not sure what you want to say with your analogy, but it doesn'T seem
to be sensible.

> > (and does so with older kernels).
> 
> On older kernels (2.6.34 and earlier) I can corrupt filesystems
> using xfs-fsr just by crafting a file with a specific layout.

Wow, and it's not mentioned anywhere in the status updates, unlike all
those nice performance upgrades, especially those dirty NFS hacks.

Yes, I am a bit sarcastic, but this corruption bug is either pretty
harmless or the xfs team is really somewhat irressponsible in not giving
information out about this harmful bug.

> easy and doesn't require any special privileges to do.

Wow, so any kernel before 2.6.34 can have its xfs corrupted by an
untrusted user?

Seriously, shouldn't this be mentioned at least in the fAQ or somewhere
else?

> IOWs, xfs_fsr on old kernels is actually dangerous and should not be
> used if you

Logic error - if I can corrupt an XFS without special privileges then
this is not a problem with xfs_fsr, but simply a kernel bug in the xfs
code. And a rather big one, one step below a remote exploit.

> The problem with running xfs_fsr is that while it defragments files,
> it fragments free space, i.e. xfs_fsr turns large contiguous free

While that is true in *some* cases, it can also be countered in userspace,
and will not happen if files get removed regulalry, e.g. for a cache
partition.

However, if you have those famous append-style loads, and this causes files
to have thousdands of fragments, these are most likely interleaved with other
files.

xfs_fsr can, if it manages to defragment the file completely (which is
the norm in my case), introduce at most one fragment, while, in the acse
of non-static files, it will likely remove thousands of small free space
fragments.

Sure, xfs_fsr can be detrimental, but so be doing nothing, letting your disk
gte full accidentally and many other actions.

there is definitely no clear cut "xfs_fsr causes your fs to detoriate",
and as always, you have to know what you are doing.

> That's why running xfs-fsr regularly out of a cron job is not
> advisable. This lesson was learn on Irix more than 10 years ago when
> it was defaulted to running once a week for two hours on Sunday
> night.  Running it more frequently like is happening on your systems
> will only make things worse.

Yes, I remember that change - however, running it once week and daily is
not a big difference. Quite obviously, the difference in workloads can and
will easily dominate any difference in effects.

And to me, it doesn't make a difference if xfs_fsr causes a crash every
week or every other month.

> FWIW, this comes up often enough that I think I need to add a FAQ
> entry for it.

Yes, thats a good idea in any case.

> > > you really have filesystems that get quickly fragmented (or are you
> > 
> > Yes, fragmentation with xfs is enourmous - I have yet to see whether
> > the changes in recent kernels make a big difference, but for log files,
> > reading through a log file with 60000 fragments tends to be much slower
> > than reading through one with just a few fragments (or just one...).
> 
> So you've got a problem with append only workloads.

Basically everything is append only on unix, because preallocating files
isn't being done except by special tools really, and the only way to
create file contents is to append (well, you can do random writes, as e.g.
vmware does, which causes havoc with XFS, but thats just a stupid way to
create files...).

> 2.6.38 and more recent kernels should be much more resistent to
> fragmentation under such conditions thanks to the dynamic
> speculative allocation changes that went into 2.6.38.

I would tend to agree.

> Alternatively, you can use the allocsize mount option, or set the

Well, not long ago somebody (you) told me that the allocsize option is
designed to eat all diskspace on servers with lots of run because of a NFS
optimisation hack that didn't go into the nfs server but the filesystem.

Has this been redesigned (I would say, fixed)?

> append-only inode flag, or set the preallocated flag on the inode
> so that truncation of specualtive allocation beyond EOF doesn't
> occur every time the file is closed.

Or use ext4, which fares much better without having to patch programs.

> > Basically, anything but the OS itself. Copying large video files while the
> > disk is busy with other things causes lots of fragmentation (usually 30
> > fragments for a 100mb file), which in turn slows down things enourmously once
> > the disk reaches 95% full.
> 
> Another oft-repeated rule of thumb - filling XFS filesystems over
> 85-90% full causes increased fragmentation because of the lack of
> large contiguous free space extents. That's exactly the same problem
> that excessive use of xfs_fsr causes.....

On a 39% full disk (my examples)?

> > Freenet is also a good test case.
> 
> Not for a filesystem developer. Running internet facing, anonymous,
> encrypted peer-to-peer file storage servers anywhere is not
> something I'll ever do on my network.

You are entitled to your political opinions, but why poison a purely
technical discussion with it?

Based on technical merits, freenet is a very good test case, because it
causes all kinds of I/O patterns. Your personal opinions on politics or laws
or whateverdon't make it a bad testcase, just soemthing _you_ don't want to
use yourself (which is ok).

Claiming it is a bad testcase based on your political views is just
unprofessional.

> If you think it's a good workload that we should use, then capture a
> typical directory profile and the IO/filesystem operations made on a
> busy server for an hour or so. Then write a script to reproduce that
> directory structure and IO pattern.....

I'll consider it, but is a major committment of worktime I might not be
able to commit to.

> > Or a news spool.
> 
> append only workloads.

Or anything else that creates files, i.e. *everything*.

A news spool is extremely different to logfiles - files are static and
never appended to after they have been created. They do get deleted in
irregular order, and can cause lots of free space fragmentation.

Calling everything "append only" workload is not very useful. If XFS is
bad at append-only workloads, which is *the* most common type of workload,
then XFS fails to be very relevant for the real world.

> > Or database files for databases that grow files (such as mysql myisam) -
> > fortunately I could move of all those to SSDs this year.
> 
> I thought mysql as capable of preallocating regions when files grow.

It's not. Maybe the effect isn't so bad on most filesystems (it certainly
isn't so bad on ext4):

-rw-rw---- 1 mysql mysql 3665891328 Aug  8 20:00 art.MYI
-rw------- 1 mysql mysql 2328898560 Aug  8 17:45 file.MYI
-rw-rw---- 1 mysql mysql 1098302464 Aug  8 17:45 image.MYI

art.MYI: 38 extents found
file.MYI: 20 extents found
image.MYI: 10 extents found

Thats after about 12 months of usage, during which time the file sizes
grew by about 50%.

> > Or simply unpacking an archive.
> 
> That should not cause fragmentation unless you have already
> fragmented free space...

I even get multiple fragments for lots of files when unpacking a big (>>
memory) tar on a freshly mkfs'ed filesystem. It's mostly 2-3 fragments,
affects maybe 5% of the files, and might not be a real issue, but
fragmentation it is.

> Use xfs_db -r -c "freesp -s" <dev> to get an idea of what your
> freespace situation looks like.

FWIW, this is on the disk with the 22k fragment 650mb freenet database:

   http://ue.tst.eu/edc5324f68b98076c9419ab0267ad9d6.txt

> > Today I had to reboot the server because of buggy xfs (which prompted the
> > bugreport, as I am seeing this bug for a while now, but so far didn't want
> > to exclude e.g. bad ram or simply a corrupt filesystem), and in the 4
> > hours uptime, I got a 4MB logfile with 8 fragments.
> 
> What kernel, and what is the xfs_bmap -vp output for the file?

2.6.39-2, and the crash took it with it :/

> > This is clearly an improvement over the 2.6.26 kernel I used before on
> > that machine. But over a few months this still leads to thousands of
> > fragments,
> 
> Have you seen this, or are you extrapolating from the 4MB file
> you've seen above?

These logfiles in particular had over 60000 fragments each (60k, not 6k)
before I started to regularly xfs_fsr them. Grepping through them took
almost an hour, now it takes less than a minute.

> > Freenet fares much worse. The persistent blob has 1757 fragments for 13gb
> > (not that bad), and the download database has 22756 for 600mb, fragments
> > (that sucks).
> 
> You're still talking about how 2.6.26 kernels behave, right?

No, thats with either 3.0.0-rc4/5/6 or 2.6.39-2. I am running 3.0.0-1 now
for other reasons.

> > On my tv, the recorded video files that haven't been defragmented yet
> > have between 11 and 63 fragments (all smaller than 2gb), which is almost
> > acceptable, but I do not think that without a regular xfs_fsr the fs would
> > be in that good shape after one or two years of usage.
> 
> For old kernels, allocsize should have mostly solved that problem.
> For current kernels that shouldn't even be necessary.

Yeha, I used allocsize=64m on all those storage filesystems. It certainly
helped the video fragmentation.

> > The cool thing about xfs_fsr is not that the cool kids run it, but that,
> > unlike other filesystems that also fragment a lot (ext3 is absolutely
> > horrible for example), it can mostly be fixed.
> 
> "fixed" is not really true - all it has done is trade file
> fragementation for freespace fragementation. That bites you
> eventually.

No, it might bite me, but that very much depends on the type of files. A
news spool mostly has two sizes of files for example, so it would be
surprising if that would bite me.

> Quality will only improve if you report bugs and help trace their
> root cause. Then we can fix them.  If you don't, we don't know about
> them, can't fid them and hence can't fix them.

Your are preaching to the wrong person, and this is not very
encouraging. In the past, I was often seeking the wisdom of this list, and
got good replies (and bugfixes).

It would tremendously helped if the obfuscation option actually worked -
which is the main reaosn why I sometimes can't provide metadumps. In this
case, I can because there is nothing problematic on those filesystems.

> Ok, now I remember you. I hope this time you'll provide me with the
> information I ask you for to triage your problem....

Sorry, but this is not the way you get people to help. I *always* provided
all information that I could provide and was asked for.

You are now pretending that I didn't do that in the past. Thats both
insulting and frustrating - to me, it means I can just stop interacting
with you - quite obviously, you are asking for the impossible.

I can understand if you dislike negative but true comments about XFS,
but thats niot a reason to misrepresent my contributions to track down
problems.

Or to put it differently, instead of making vague accusations, what
exactly did you ask for that I could provide, but didn't? Can you back up
your statement?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-07 10:26     ` Dave Chinner
  2011-08-08 19:02       ` Marc Lehmann
@ 2011-08-09  9:16       ` Marc Lehmann
  2011-08-09 11:35         ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-09  9:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

This just in, this was on screen, xfs_fsr was active at the time, kernel
is tainted:

[248359.646330] CPU 1
[248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate
[248359.646323] Oops: 0000 [#1] SMP
[248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0
[248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs]
[248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-08 19:02       ` Marc Lehmann
@ 2011-08-09 10:10         ` Michael Monnerie
  2011-08-09 11:15           ` Marc Lehmann
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Monnerie @ 2011-08-09 10:10 UTC (permalink / raw)
  To: xfs; +Cc: Marc Lehmann


[-- Attachment #1.1: Type: Text/Plain, Size: 2848 bytes --]

On Montag, 8. August 2011 Marc Lehmann wrote:

First of all, please calm down. Getting personal is not bringing us 
anywhere.

> > On older kernels (2.6.34 and earlier) I can corrupt filesystems
> > using xfs-fsr just by crafting a file with a specific layout.
[snip]
> > IOWs, xfs_fsr on old kernels is actually dangerous and should not
> > be used if you
> 
> Logic error - if I can corrupt an XFS without special privileges then
> this is not a problem with xfs_fsr, but simply a kernel bug in the
> xfs code. And a rather big one, one step below a remote exploit.

No, it's not a kernel bug because as long as you don't use xfs_fsr, 
nothing will ever happen.

And the rest of the mail goes into lots of details which look very 
strange to me. I've double checked with our servers, which generally 
have these xfs mount options: 
(rw,nodiratime,relatime,logbufs=8,logbsize=256k,attr2,barrier,largeio,swalloc)
and sometimes also 
,allocsize=64m
and I can't find evidence for fragmentation that would be harmful.Yes 
they are fragmented, of course. When you write to ~500 log files a time 
via syslog (as we do on some servers), there must be some fragmentation. 
The allocsize option helps a lot there. I looked at one webserver access 
log, it has 640MB with 99 fragments, but that's not a lot. On our 
Spamgate I see 250MB logs with 374 fragments. That's a bit more, but we 
don't use the allocsize option there, which I changed now that I looked 
at it ;-)

But your words
> If XFS is bad at append-only workloads, which is the most common type
> of workload, then XFS fails to be very relevant for the real world.

may be valid for your world, not mine. We have webservers, fileservers 
and database servers, all of which are not really append style, but more 
delete-and-recreate. Well, db-servers are rather exceptional here.

Append style is mostly for log files, at least on our servers.

But if the numbers for fragmentation on your servers are true, you must 
have a very good test case for fragmentation prevention. Therefore it 
could be really interesting if you could grab what Dave Chinner asked 
for:
> If you think it's a good workload that we should use, then capture a
> typical directory profile and the IO/filesystem operations made on a
> busy server for an hour or so. Then write a script to reproduce that
> directory structure and IO pattern.....

And maybe he could use it for optimizations. Is there any tool on Linux 
to record such I/O patterns? Would need to keep all metadata and data 
operations for a partition to be interesting.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-09 10:10         ` Michael Monnerie
@ 2011-08-09 11:15           ` Marc Lehmann
  2011-08-10  6:59             ` Michael Monnerie
  2011-08-10 14:16             ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Marc Lehmann @ 2011-08-09 11:15 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

On Tue, Aug 09, 2011 at 12:10:48PM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote:
> First of all, please calm down. Getting personal is not bringing us 
> anywhere.

Well, it's not me who's getting personal, so...?

> > Logic error - if I can corrupt an XFS without special privileges then
> > this is not a problem with xfs_fsr, but simply a kernel bug in the
> > xfs code. And a rather big one, one step below a remote exploit.
> 
> No, it's not a kernel bug because as long as you don't use xfs_fsr, 
> nothing will ever happen.

"As long as you don't boot, it will not crash".

xfs_fsr uses syscalls, just like other applications. According to your
(wrong) logic, if an application uses chown and this causes a kernel oops,
this is also not a kernel bug.

Thats of course wrong - it's the kernel that crashes when an applicaiton
does certain access patterns.

> (rw,nodiratime,relatime,logbufs=8,logbsize=256k,attr2,barrier,largeio,swalloc)
> and sometimes also 
> ,allocsize=64m

As has been reported on this list, this option is really harmful on
current xfs - in my case, it lead to xfs causing ENOSPC even when the disk
was 40% empty (~188gb).

> and I can't find evidence for fragmentation that would be harmful.Yes 

Well, define "harmful" - slow logfile reads aren't what I consider
"harmful" either. It's just very very slow.

> The allocsize option helps a lot there. I looked at one webserver access 
> log, it has 640MB with 99 fragments, but that's not a lot. On our 
> Spamgate I see 250MB logs with 374 fragments.

Well, if it were one fragment, you could read that in 4-5 seconds, at 374
fragments, it's probably around 6-7 seconds. Thats not harmful, but if you
extrapolate this to a few gigabytes and a lot of files, it becomes quite
the overhead.

> don't use the allocsize option there, which I changed now that I looked 

That allocsize option is no longer reasonable with newer kernels, as the
kernel will reserve 64m diskspace even for 1kb files indefinitely.

> > If XFS is bad at append-only workloads, which is the most common type
> > of workload, then XFS fails to be very relevant for the real world.
> 
> may be valid for your world, not mine. We have webservers, fileservers 
> and database servers, all of which are not really append style, but more 
> delete-and-recreate.

If you find a way of recreating files without appending to them, let me
know.

The problem with fragmentatioon is that it happens even for a few writers
for "create file" workloads (which do append...).

You probably make a distinction between "writing a file fast" and "writing
a file slow", but the distinction is not a qualitative difference. On busy
servers thta create a lot of files, you get fragmentation the same way
as on less busy servers that write files slower. There is little to no
difference in the resulting patterns.

> Well, db-servers are rather exceptional here.

Yes, append style is what makes up for the vast majority of disk writes on
a normal system, db-servers excepted indeed.

> But if the numbers for fragmentation on your servers are true, you must 
> have a very good test case for fragmentation prevention. Therefore it 
> could be really interesting if you could grab what Dave Chinner asked 
> for:

I'll keep it in mind.

> And maybe he could use it for optimizations. Is there any tool on Linux 
> to record such I/O patterns?

I presume strace would do, but thats where the "lot of work" comes in. If
there is a ready-to-use tool, that would of course make it easy.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-09  9:16       ` Marc Lehmann
@ 2011-08-09 11:35         ` Dave Chinner
  2011-08-09 16:35           ` Marc Lehmann
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2011-08-09 11:35 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Tue, Aug 09, 2011 at 11:16:43AM +0200, Marc Lehmann wrote:
> This just in, this was on screen, xfs_fsr was active at the time, kernel
> is tainted:
> 
> [248359.646330] CPU 1
> [248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate
> [248359.646323] Oops: 0000 [#1] SMP
> [248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0
> [248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs]
> [248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

And the event trace to go along with the xfs-fsr run?

I don't need to know the dmesg output - I need the information in
the event trace from the xfs-fsr run when the problem occurs....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-09 11:35         ` Dave Chinner
@ 2011-08-09 16:35           ` Marc Lehmann
  2011-08-09 22:31             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-09 16:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> > [248359.646330] CPU 1
> > [248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate
> > [248359.646323] Oops: 0000 [#1] SMP
> > [248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0
> > [248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs]
> > [248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> 
> And the event trace to go along with the xfs-fsr run?

It wasn't enabled yet, I didn't expect it to lock up so soon, but even if,
we would have to wait for those rare occurances where the kernel oopses
without the box locking up (can take months).

> I don't need to know the dmesg output - I need the information in
> the event trace from the xfs-fsr run when the problem occurs....

And I need an XFS that doesn't oops and takes the box with it to deliver
that :)

In any case, I am confident it will happen sooner or later.

I will then not send any kernel oopses, although I had hoped that 0-ptr
dereferences in a specific part of a function could have been a good hint.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-09 16:35           ` Marc Lehmann
@ 2011-08-09 22:31             ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2011-08-09 22:31 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Tue, Aug 09, 2011 at 06:35:25PM +0200, Marc Lehmann wrote:
> > > [248359.646330] CPU 1
> > > [248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate
> > > [248359.646323] Oops: 0000 [#1] SMP
> > > [248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0
> > > [248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs]
> > > [248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> > 
> > And the event trace to go along with the xfs-fsr run?
> 
> It wasn't enabled yet, I didn't expect it to lock up so soon, but even if,
> we would have to wait for those rare occurances where the kernel oopses
> without the box locking up (can take months).
> 
> > I don't need to know the dmesg output - I need the information in
> > the event trace from the xfs-fsr run when the problem occurs....
> 
> And I need an XFS that doesn't oops and takes the box with it to deliver
> that :)
> 
> In any case, I am confident it will happen sooner or later.
> 
> I will then not send any kernel oopses, although I had hoped that 0-ptr
> dereferences in a specific part of a function could have been a good hint.

They tell me where the crash occurred - they don't tell me the root
cause of the problem. Understanding the root cause and fixing that
is more important that putting a bandaid over the resultant panic
(which I'll probably do anyway at the same time).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-09 11:15           ` Marc Lehmann
@ 2011-08-10  6:59             ` Michael Monnerie
  2011-08-11 22:04               ` Marc Lehmann
  2011-08-10 14:16             ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Monnerie @ 2011-08-10  6:59 UTC (permalink / raw)
  To: xfs; +Cc: Marc Lehmann


[-- Attachment #1.1: Type: Text/Plain, Size: 2332 bytes --]

On Dienstag, 9. August 2011 Marc Lehmann wrote:
> On Tue, Aug 09, 2011 at 12:10:48PM +0200, Michael Monnerie 
<michael.monnerie@is.it-management.at> wrote:
> > First of all, please calm down. Getting personal is not bringing us
> > anywhere.
> 
> Well, it's not me who's getting personal, so...?

A single rant from a dev shouldn't hurt one too much. He might have been 
sitting in front of some code during 72 hours, his eyes already being in 
16:9 format staring at a weird bug... It's OK to strike back once, but 
then be cool again and work at the problem.
 
> As has been reported on this list, this option is really harmful on
> current xfs - in my case, it lead to xfs causing ENOSPC even when the
> disk was 40% empty (~188gb).

Was this the "NFS optimization" stuff? I don't like that either.
 
> Well, if it were one fragment, you could read that in 4-5 seconds, at
> 374 fragments, it's probably around 6-7 seconds. Thats not harmful,
> but if you extrapolate this to a few gigabytes and a lot of files,
> it becomes quite the overhead.

True, if you have to read tons of log files all day. That's not my 
normal use case, so I didn't bother about that until now.

> That allocsize option is no longer reasonable with newer kernels, as
> the kernel will reserve 64m diskspace even for 1kb files
> indefinitely.

Just "as long as the inode is cached" or something, I remember that 
"echo 3 >drop_caches" cleans that up. Still ugly, I'd say.
 
> If you find a way of recreating files without appending to them, let
> me know.

Seems we have a different meaning of "append". For me, append is when an 
existing file is re-opened, and data added just to the end of it.
 
> > And maybe he could use it for optimizations. Is there any tool on
> > Linux to record such I/O patterns?
> 
> I presume strace would do, but thats where the "lot of work" comes
> in. If there is a ready-to-use tool, that would of course make it
> easy.

It's a pity that such a generic tool doesn't existing. I can't believe 
that. Doesn't anybody have such a tool at hand?

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-09 11:15           ` Marc Lehmann
  2011-08-10  6:59             ` Michael Monnerie
@ 2011-08-10 14:16             ` Dave Chinner
  2011-08-11 22:07               ` Marc Lehmann
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2011-08-10 14:16 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Michael Monnerie, xfs

On Tue, Aug 09, 2011 at 01:15:27PM +0200, Marc Lehmann wrote:
> On Tue, Aug 09, 2011 at 12:10:48PM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote:
> > (rw,nodiratime,relatime,logbufs=8,logbsize=256k,attr2,barrier,largeio,swalloc)
> > and sometimes also 
> > ,allocsize=64m
> 
> As has been reported on this list, this option is really harmful on
> current xfs - in my case, it lead to xfs causing ENOSPC even when the disk
> was 40% empty (~188gb).

Seeing you keep stating this is a problem, I'll ask again whether
commit 778e24b ("xfs: reset inode per-lifetime state when recycling
it") fixed this problem for you?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-10  6:59             ` Michael Monnerie
@ 2011-08-11 22:04               ` Marc Lehmann
  2011-08-12  4:05                 ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-11 22:04 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

On Wed, Aug 10, 2011 at 08:59:26AM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote:
> > current xfs - in my case, it lead to xfs causing ENOSPC even when the
> > disk was 40% empty (~188gb).
> 
> Was this the "NFS optimization" stuff? I don't like that either.

The NFS server apparently opens and closes files very often (probably on
every read/write or so, I don't know the details), so XFS was
benchmark-improved by keeping the preallocation as long as the inode is in
memory.

Practical example: on my box (8GB ram), I upgraded the kernel and started a
buildroot build. When I came back 8 hours later the disk was full (some
hundreds of gigabytes), even though df showed 300gb or so of free space.

That was caused by me setting allocsize=64m and this causing every 3kb
object file to use 64m of diskspace (which du showed, but df didn't).

To me, thats an obvious bug, and a dirty hack (you shouldn't fix the NFS
server by hacking some band-aid into XFS), but to my surprise I was told
on this list that this is important for performance, and my use case
isn't what XFS is designed for, but thta XFS is designed for good NFS server
performance.

> > Well, if it were one fragment, you could read that in 4-5 seconds, at
> > 374 fragments, it's probably around 6-7 seconds. Thats not harmful,
> > but if you extrapolate this to a few gigabytes and a lot of files,
> > it becomes quite the overhead.
> 
> True, if you have to read tons of log files all day. That's not my 
> normal use case, so I didn't bother about that until now.

I am well aware that there are lots of different use cases. I see that
myself because I have so diverse usages on my disks and servers (desktop,
media server, news server, web server, game server... all quite different).

It'r clear that XFS can't handle all this magically, and that this is not
a problem in XFS itself, what I do find a bit scary is this "XFS is not
made for you" attitude that I was recently confronted with.

> Just "as long as the inode is cached" or something, I remember that 
> "echo 3 >drop_caches" cleans that up. Still ugly, I'd say.

Yeah, the more ram you have, the more diskspace is lost.

> > If you find a way of recreating files without appending to them, let
> > me know.
> 
> Seems we have a different meaning of "append". For me, append is when an 
> existing file is re-opened, and data added just to the end of it.

That rules out many, if not most, log file write patterns, which are
classical examples of "append workloads" - most apps do not reopen log
files, they create/open them once and then wrote them, often, but always,
relatively slowly.

Syslog is a good example of something that wouldn't be an "append"
according to your definition, but typically is seen as such.

Speed is the really only differentiating factor between "append" and
"create only", and in practise a filesystem can only catch this by seeing
if something is sitll in ram ("recent use, fast writes") or not, or
keeping this information on-disk (which can be a dangerous trade-off).

And yes, your deifntiino is valid - I don't think there is an obvious
consensus on which is used, but I think my definition (which includes log
files) is more common.

> > I presume strace would do, but thats where the "lot of work" comes
> > in. If there is a ready-to-use tool, that would of course make it
> > easy.
> 
> It's a pity that such a generic tool doesn't existing. I can't believe 
> that. Doesn't anybody have such a tool at hand?

Yeah, I'm listening :) I hope it doesn't boil down to an instrumented
kernel :(

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-10 14:16             ` Dave Chinner
@ 2011-08-11 22:07               ` Marc Lehmann
  0 siblings, 0 replies; 18+ messages in thread
From: Marc Lehmann @ 2011-08-11 22:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Michael Monnerie, xfs

On Thu, Aug 11, 2011 at 12:16:19AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > As has been reported on this list, this option is really harmful on
> > current xfs - in my case, it lead to xfs causing ENOSPC even when the disk
> > was 40% empty (~188gb).
> 
> Seeing you keep stating this is a problem,

I can only go by what _you_ told me earlier, namely that this works as
designed and no change is needed. If you changed your mind without telling
me, how should I find out?

If you say one thing and do another, you shouldn't be surprised when
people trust you and go by what you say.

> commit 778e24b ("xfs: reset inode per-lifetime state when recycling
> it") fixed this problem for you?

If you tell me in which kernel version this is included, I can find out
easily.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-11 22:04               ` Marc Lehmann
@ 2011-08-12  4:05                 ` Dave Chinner
  2011-08-26  8:08                   ` Marc Lehmann
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2011-08-12  4:05 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Michael Monnerie, xfs

On Fri, Aug 12, 2011 at 12:04:19AM +0200, Marc Lehmann wrote:
> On Wed, Aug 10, 2011 at 08:59:26AM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote:
> > > current xfs - in my case, it lead to xfs causing ENOSPC even when the
> > > disk was 40% empty (~188gb).
> > 
> > Was this the "NFS optimization" stuff? I don't like that either.
> 
> The NFS server apparently opens and closes files very often (probably on
> every read/write or so, I don't know the details), so XFS was
> benchmark-improved by keeping the preallocation as long as the inode is in
> memory.

It only does that if the pattern of writes are such that keeping the
preallocation around for longer periods of time will reduce
potential fragmentation.  Indeed, it's not a NFS specific
optimisation, but it is one that directly benefits NFS server IO
patterns.

e.g. it can also help reduce fragmentation on slow append-only
workloads if the necessary conditions are triggered by the log
writers (which is the other problem you are complaining noisily
about). Given that inodes for log files will almost always remain in
memory as they are regularly referenced, it seems like the right
solution to that problem, too...

FWIW, you make it sound like "benchmark-improved" is a bad thing.
However, I don't hear you complaining about the delayed logging
optimisations at all. I'll let you in on a dirty little secret: I
tested delayed logging on nothing but benchmarks - it is -entirely-
a "benchmark-improved" class optimisation.

But despite how delayed logging was developed and optimised, it
has significant real-world impact on performance under many
different workloads. That's because the  benchmarks I use accurately
model the workloads that cause the problem that needs to be solved.

Similarly, the "NFS optimisation" in a significant and measurable
reduction in fragmentation on NFS-exported XFS filesystems across a
wide range of workloads. It's a major win in the real world - I
just wish I had of thought of it 4 or 5 years ago back when I was at
SGI when we first started seeing serious NFS related fragmentation
problems at customer sites.

Yes, there have been regressions caused by both changes (though
delayed logging had far more serious ones) - that's a
fact of life in software development. However, the existence of
regressions does not take anything away from the significant
real-world improvements that are the result of the changes.

> > > I presume strace would do, but thats where the "lot of work" comes
> > > in. If there is a ready-to-use tool, that would of course make it
> > > easy.
> > 
> > It's a pity that such a generic tool doesn't existing. I can't believe 
> > that. Doesn't anybody have such a tool at hand?
> 
> Yeah, I'm listening :) I hope it doesn't boil down to an instrumented
> kernel :(

GFGI.

http://code.google.com/p/ioapps/wiki/ioreplay

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-12  4:05                 ` Dave Chinner
@ 2011-08-26  8:08                   ` Marc Lehmann
  2011-08-31 12:45                     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-26  8:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Aug 12, 2011 at 02:05:30PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> It only does that if the pattern of writes are such that keeping the
> preallocation around for longer periods of time will reduce
> potential fragmentation.

That can only be false. Here is a an example that I saw *just now*:

I have a process that takes a directory with jpg files (in this case,
all around 64kb in size) and loslessly recompresses them. This works
by reading a file, writing it under another name (single write() call)
and using rename to replace the original file *iff* it got smaller. The
typical reduction is 5%. no allocsize option is used. Kernel used was
2.6.39.

This workload would obviously benefit most by having no preallocaiton
anywhere, i.e. have all files tightly packed.

Here is a "du" on a big directory where this process is running, every few
minutes:

   6439892 .
   6439888 .
   6620168 .
   6633156 .
   6697588 .
   6729092 .
   6755808 .
   6852192 .
   6816632 .
   6250824 .

Instead of decreasing, the size increased, until just before the last
du. Thats where I did echo 3 >drop_caches, which presumably cleared all
those inodes that have not been used for an hour and would never have been
used again for writing.

Since XFS obviously keeps quite a bit of preallocation here (or some other
magic, but what?), and this workload definitely does not benefit from any
preallocaiton (because xfs has perfect knowledge about the file size at
every point in time), what you say is simply not true: The files will not
be touched anymore, neither read, nor written, so preallocation is just
bad.

Also, bickering about extra fragmentation caused by xfs_fsr when running
it daily instead of weekly is weird - the amount of external fragmentation
caused by preallocation must be overwhelming with large amounts of ram.

> Indeed, it's not a NFS specific optimisation, but it is one that
> directly benefits NFS server IO patterns.

I'd say it's a grotesque deoptimisation, and definitely doesn't work the
way you describe it.

In fact, it can't work the way you describe it, because XFS would have
to be clairvoyant to make it work. How else would it know that keeping
preallocation indefinitely will be useful?

In any case, XFS detects a typical "open write, file file, close
file, never tochu it again" pattern as something that somehow needs
preallocation.

I can see how that helps NFS, but in all other cases, this is simply a
bug.

> about). Given that inodes for log files will almost always remain in
> memory as they are regularly referenced, it seems like the right
> solution to that problem, too...

Given that, with enough ram, everything stays in ram, most of which is
not log files, this behaviour is simply broken.

> FWIW, you make it sound like "benchmark-improved" is a bad thing.

If it costs regular performance or eats diskspace like mad, it's clearly a
bad thing yes.

Benchmark performance is irrelevant, what counts is actual performance.

If the two coincide, thats great. This is clearly not the case here, of
course.

> However, I don't hear you complaining about the delayed logging
> optimisations at all.

I wouldn't be surprised if the new xfs_fsr crashes are caused by these
changes, actually. But yes, otherwise they are great - I do keep external
journals for most of my filesystems, and the write load for these has
decreased by a factor of 10-100 in some metadata-heavy cases (such as lots
of renames).

Of course, XFS is still way behind other filesystems in managing journal
devices.

> I'll let you in on a dirty little secret: I tested delayed logging on
> nothing but benchmarks - it is -entirely- a "benchmark-improved" class
> optimisation.

As a good engineer one would expect you to actually think about whether
this optimiation is useful outside of some benchmark setup, too. I am sure
you did that, how else would you have come up with the idea in the first
place?

> But despite how delayed logging was developed and optimised, it

The difference to the new preallocation is that it's not obviously a bad
algorithm.

However, the preallocation strategy of wasting some diskspace for every
file that has been opened in the last 24 hours or so (depending on ram) is
*obviously* wrong, regardless of what your microbenchmarks say.

What it does is basically introduce big clusters allocation, just like
with god old FAT, except that people with more RAM get punished more.

> different workloads. That's because the  benchmarks I use accurately
> model the workloads that cause the problem that needs to be solved.

That means you will optimise a single problem at the expense of any other
workload. This indeed seems to be the case here.

Good engineering would make sure that typical use cases that were not the
"problem" before wouldn't get unduly affected.

Apart from potentially helping with NFS in your benchmarks, I cannot
see any positive aspect of this change. However, I keep hitting the bad
aspects of it. It seems that with this change, XFS will degrade much
faster due to the insane amounts of useless preallocation tied to files
that have been closed and will never be written again, which is by far
*most* files.

In the example above, roughly 32kb (+-50%) overallocation are associated
with each file. FAT, here we come :(

Don't get me wrong, it is great that XFS is now optimised for slow log
writing over NFS, and this surely is important for some people, but this
comes at an enourmous cost to every other workload.

A benchmark that measures additional fragmentation introduced by all those
32kb blocks over some months would be nice.

> Similarly, the "NFS optimisation" in a significant and measurable
> reduction in fragmentation on NFS-exported XFS filesystems across a

It's the dirtiest hack I have seen in a filesystem. Making an optimisaiton
that only helps with the extremely bad access patterns of NFS (and only
sometimes) and forcing this on even for non-NFS filesystems where it only
causes negative effects.

It's a typical case of "a is broken, so apply some hack to b", while good
engineering dictates "a is broken, let's fix a".

Again: Your rationale is that NFS doesn't give you enough information about
whether a file is in use, because it doesn't keep it open.

This leads you to consider all files whose inode is cached in memory as
being "in use" for unlimited amounts of time.

Sure, those idiot applications such as cp or mv cannot be trusted. Surely,
when mv'ing a file, this means the file will be appended later. Because if
not, XFS wouldn't keep the preallocation.

> Yes, there have been regressions caused by both changes (though

The whole thing is a regression - slow appender processes that close a
file after each write basically don't exist - close is an extremely good
hint that a file has been finalised, and because NFS doesn't give the
notion of close (nfsv4 has it, to some extent), suddenly it's ignored for
all applications.

This is simply a completely, utterly, totally broken algorithm.

> regressions does not take anything away from the significant
> real-world improvements that are the result of the changes.

I gave plenty of real-world examples where these changes are nothing but
bad. I have yet to see a *single* real-world example where this isn't the
case.

All you achieved is that now every workload works as bad as NFS, lots
and lots of disk space is wasted, and an enourmous amount of external
fragmentation is introduced. And thats just with an 8GB box. I can only
imagine how many months files will be considered "in use" just because the
box has enough ram to cache their inodes.

> http://code.google.com/p/ioapps/wiki/ioreplay

Since "cp" and "mv" already cause problems in current versions of
XFS, I guess we are far from needing those. It seems XFS has been so
fundamentally deoptimised w.r.t. preallocation now that there are much
bigger fish to catch than freenet. Basically anything thct creates files,
even when it's just a single open/write/close, is now affected.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
  2011-08-26  8:08                   ` Marc Lehmann
@ 2011-08-31 12:45                     ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2011-08-31 12:45 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Fri, Aug 26, 2011 at 10:08:41AM +0200, Marc Lehmann wrote:
> On Fri, Aug 12, 2011 at 02:05:30PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > It only does that if the pattern of writes are such that keeping the
> > preallocation around for longer periods of time will reduce
> > potential fragmentation.
> 
> That can only be false. Here is a an example that I saw *just now*:
> 
> I have a process that takes a directory with jpg files (in this case,
> all around 64kb in size) and loslessly recompresses them. This works
> by reading a file, writing it under another name (single write() call)
> and using rename to replace the original file *iff* it got smaller. The
> typical reduction is 5%. no allocsize option is used. Kernel used was
> 2.6.39.
> 
> This workload would obviously benefit most by having no preallocaiton
> anywhere, i.e. have all files tightly packed.
> 
> Here is a "du" on a big directory where this process is running, every few
> minutes:
> 
>    6439892 .
>    6439888 .
>    6620168 .
>    6633156 .
>    6697588 .
>    6729092 .
>    6755808 .
>    6852192 .
>    6816632 .
>    6250824 .
> 
> Instead of decreasing, the size increased, until just before the last
> du. Thats where I did echo 3 >drop_caches, which presumably cleared all
> those inodes that have not been used for an hour and would never have been
> used again for writing.

That's the case of the unlinked inode being reused immediately and
no having all it's state cleared correctly when recycled. That's the
problem that was diagnosed and fixed when you reported the first
problem. Can you tell me if your kernel has the bug fix or not, and
if not, does applying the fix make the problem go away?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-08-31 12:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-06 12:25 frequent kernel BUG and lockups - 2.6.39 + xfs_fsr Marc Lehmann
2011-08-06 14:20 ` Dave Chinner
2011-08-07  1:42   ` Marc Lehmann
2011-08-07 10:26     ` Dave Chinner
2011-08-08 19:02       ` Marc Lehmann
2011-08-09 10:10         ` Michael Monnerie
2011-08-09 11:15           ` Marc Lehmann
2011-08-10  6:59             ` Michael Monnerie
2011-08-11 22:04               ` Marc Lehmann
2011-08-12  4:05                 ` Dave Chinner
2011-08-26  8:08                   ` Marc Lehmann
2011-08-31 12:45                     ` Dave Chinner
2011-08-10 14:16             ` Dave Chinner
2011-08-11 22:07               ` Marc Lehmann
2011-08-09  9:16       ` Marc Lehmann
2011-08-09 11:35         ` Dave Chinner
2011-08-09 16:35           ` Marc Lehmann
2011-08-09 22:31             ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.