All of lore.kernel.org
 help / color / mirror / Atom feed
* drastic changes to allocsize semantics in or around 2.6.38?
@ 2011-05-20  0:55 Marc Lehmann
  2011-05-20  2:56 ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Marc Lehmann @ 2011-05-20  0:55 UTC (permalink / raw)
  To: xfs

Hi!

I have "allocsize=64m" (or simialr sizes, such as 1m, 16m etc.) on many of my
xfs filesystems, in an attempt to fight fragmentation on logfiles.

I am not sure about it's effectiveness, but in 2.6.38 (but not in 2.6.32),
this leads to very unexpected and weird behaviour, namely that files being
written have semi-permanently allocated chunks of allocsize to them.

I realised this when I did a make clean and a make in a buildroot directory,
which cross-compiles uclibc, gcc, and lots of other packages, leading to a
lot of mostly small files.

After a few minutes, the job stopped because it ate 180GB of disk space and
the disk was full. When I came back in the mornng (about 8 hours later), the
disk was still full, and investigation showed that even 3kb files were
allocated the full 64m (as seen with du).

Atfer I deleted some files to get some space and rebooted, I suddenly had
180GB of space again, so it seems an unmount "fixes" this issue.

I often do these kind of build,s and I have allocsize on thee high values for
a very long time, without ever having run into this kind of problem.

It seems that files get temporarily allocated much larger chunks (which is
expoected behaviour), but xfs doesn't free them until there is a unmount
(which is unexpected).

Is this the desired behaviour? I would assume that any allocsize > 0 could
lead to a lot of fragmentation if files that are closed and no longer being
in-use always have extra space allocated for expansion for extremely long
periods of time.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-20  0:55 drastic changes to allocsize semantics in or around 2.6.38? Marc Lehmann
@ 2011-05-20  2:56 ` Dave Chinner
  2011-05-20 15:49   ` Marc Lehmann
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2011-05-20  2:56 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Fri, May 20, 2011 at 02:55:11AM +0200, Marc Lehmann wrote:
> Hi!
> 
> I have "allocsize=64m" (or simialr sizes, such as 1m, 16m etc.) on many of my
> xfs filesystems, in an attempt to fight fragmentation on logfiles.
> 
> I am not sure about it's effectiveness, but in 2.6.38 (but not in 2.6.32),
> this leads to very unexpected and weird behaviour, namely that files being
> written have semi-permanently allocated chunks of allocsize to them.

The change that will be causing this was to how the preallocation is
dropped. In normal use cases, the preallocation should be dropped
when the file descriptor is closed. The change in 2.6.38 was to make
this conditional on whether the inode had been closed multiple times
while dirty. If the inode is closed (.release is called) multiple
times while dirty, then the preallocation is not truncated away
until the inode is dropped from the caches, rather than immediately
on close.  This prevents writes on NFS servers from doing excessive
work and triggering excessive fragmentation, as the NFS server does
an "open-write-close" for every write that comes across the wire.

This was also coupled witha change to the default speculative
allocation behaviour to do more and larger specualtive preallocation
and so in most cases remove the need for ever using the allocsize
mount option. It dynamically increases the preallocation size as the
file size increases, so small file writes behave like pre-2.6.38
without the allocsize mount option, large file writes behave like
they have a large allocsize mount option set and thereby preventing
most known delayed allocation fragmentation cases from occurring.

> I realised this when I did a make clean and a make in a buildroot directory,
> which cross-compiles uclibc, gcc, and lots of other packages, leading to a
> lot of mostly small files.

So the question there: how is your workload accessing the files? Is
it opening and closing them multiple times in quick succession after
writing them? I think it is triggering the "NFS server access
pattern" logic and so keeping speculative preallocation around for
longer.

> Atfer I deleted some files to get some space and rebooted, I suddenly had
> 180GB of space again, so it seems an unmount "fixes" this issue.
> 
> I often do these kind of build,s and I have allocsize on thee high values for
> a very long time, without ever having run into this kind of problem.
> 
> It seems that files get temporarily allocated much larger chunks (which is
> expoected behaviour), but xfs doesn't free them until there is a unmount
> (which is unexpected).

"echo 3 > /proc/sys/vm/drop_caches" should free up the space as the
preallocation will be truncated as the inodes are removed from the
VFS inode cache.

> Is this the desired behaviour? I would assume that any allocsize > 0 could
> lead to a lot of fragmentation if files that are closed and no longer being
> in-use always have extra space allocated for expansion for extremely long
> periods of time.

I'd suggest removing the allocsize mount option - you shouldn't need
it anymore because the new default behaviour resists fragmentation a
whole lot better than pre-2.6.38 kernels.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-20  2:56 ` Dave Chinner
@ 2011-05-20 15:49   ` Marc Lehmann
  2011-05-21  0:45     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Marc Lehmann @ 2011-05-20 15:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, May 20, 2011 at 12:56:59PM +1000, Dave Chinner <david@fromorbit.com> wrote:
[thanks for the thorough explanation]
> 
> So the question there: how is your workload accessing the files? Is
> it opening and closing them multiple times in quick succession after
> writing them?

I don't think so, but of course, when compiling a file, it will be linked
afterwards, so I guess it would be accessed at least once.

> I think it is triggering the "NFS server access pattern" logic and so
> keeping speculative preallocation around for longer.

Longer meaning practically infinitely :)

> I'd suggest removing the allocsize mount option - you shouldn't need
> it anymore because the new default behaviour resists fragmentation a
> whole lot better than pre-2.6.38 kernels.

I did remove it already, and will actively try this on our production
server which suffer from severe fragmentation (but xfs_fsr fixes that with
some work (suspending the logfile writing) anyway).

However, I would suggest that whatever heuristic 2.6.38 uses is deeply
broken at the momment, as NFS was not involved here at all (so no need for
it), the usage pattern was a simple compile-then-link-pattern (which is
very common), and there is really no need to cache this preallocation for
files that have been closed 8 hours ago and never touched since then.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-20 15:49   ` Marc Lehmann
@ 2011-05-21  0:45     ` Dave Chinner
  2011-05-21  1:36       ` Marc Lehmann
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2011-05-21  0:45 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Fri, May 20, 2011 at 05:49:20PM +0200, Marc Lehmann wrote:
> On Fri, May 20, 2011 at 12:56:59PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> [thanks for the thorough explanation]
> > 
> > So the question there: how is your workload accessing the files? Is
> > it opening and closing them multiple times in quick succession after
> > writing them?
> 
> I don't think so, but of course, when compiling a file, it will be linked
> afterwards, so I guess it would be accessed at least once.

Ok, I'll see if I can reporduce it localy.

> > I think it is triggering the "NFS server access pattern" logic and so
> > keeping speculative preallocation around for longer.
> 
> Longer meaning practically infinitely :)

No, longer meaning the in-memory lifecycle of the inode.

> > I'd suggest removing the allocsize mount option - you shouldn't need
> > it anymore because the new default behaviour resists fragmentation a
> > whole lot better than pre-2.6.38 kernels.
> 
> I did remove it already, and will actively try this on our production
> server which suffer from severe fragmentation (but xfs_fsr fixes that with
> some work (suspending the logfile writing) anyway).

log file writing - append only workloads - is one where the dynamic
speculative preallocation can make a significant difference.

> However, I would suggest that whatever heuristic 2.6.38 uses is deeply
> broken at the momment,

One bug report two months after general availability != deeply
broken.

> as NFS was not involved here at all (so no need for
> it), the usage pattern was a simple compile-then-link-pattern (which is
> very common),

While using a large allocsize mount option, which is relatively
rare. Basically, you've told XFS to optimise allocation for large
files and then are running workloads with lots of small files. It's
not surprise that there are issues, and you don't need the changes
in 2.6.38 to get bitten by this problem....

> and there is really no need to cache this preallocation for
> files that have been closed 8 hours ago and never touched since then.

If the preallocation was the size of the dynamic behaviour, you
wouldn't have even noticed this. So really what you are saying is
that it is excessive for your current configuration and workload.

If I can reproduce it, I'll have a think about how to tweak it
better for allocsize filesystems. However, I'm not going to start to
add lots of workload-dependent tweaks to this code - the default
behaviour is much better and in most cases removes the problems that
led to using allocsize in the first place. So removing allocsize
from your config is, IMO, the correct fix, not tweaking heuristics in
the code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-21  0:45     ` Dave Chinner
@ 2011-05-21  1:36       ` Marc Lehmann
  2011-05-21  3:15         ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Marc Lehmann @ 2011-05-21  1:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, May 21, 2011 at 10:45:44AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > Longer meaning practically infinitely :)
> 
> No, longer meaning the in-memory lifecycle of the inode.

That makes no sense - if I have twice the memory I suddenly have half (or
some other factor) free diskspace.

The lifetime of the preallocated area should be tied to something sensible,
really - all that xfs has now is a broken heuristic that ties the wrong
statistic to the extra space allocated.

Or in other words, tieing the amount of preallocations to the amount of
free ram (for the inode) is not a sensible heuristic.

> log file writing - append only workloads - is one where the dynamic
> speculative preallocation can make a significant difference.

Thats absolutely fantastic, as that will apply to a large range of files
that are problematic (while xfs performs really well in most cases).

> > However, I would suggest that whatever heuristic 2.6.38 uses is deeply
> > broken at the momment,
> 
> One bug report two months after general availability != deeply
> broken.

That makes no sense - I only found out about this broken behaviour because I
specified a large allocsize manually, which is rare.

However, the behaviour happens even without that. but might not be
immediately noticable (how would you find out if you lost a few gigabytes
of disk space unless the disk runs full? most people would have no clue
where to look for).

Just because the breakage is not obviously visible doesn't mean it's not
deeply broken.

Also, I just looked more thoroughly through the list - the problem has
been reported before, but was basically ignored, so you are wrong in that
there is only one report.

> While using a large allocsize mount option, which is relatively
> rare. Basically, you've told XFS to optimise allocation for large
> files and then are running workloads with lots of small files.

The allocsize isn't "optimise for large files", it's to reduce
fragmentation. 64MB is _hardly_ a big size for logfiles.

Note also that the breakage occurs with smaller allocsize values as well.,
it's just less obvious All you do right now is make up fantasy reasons on
why to ignore this report - the problem applies to any allocsize, and,
unless xfs uses a different heuristic for dynamic preallocation, even
without the mount option.

> It's not surprise that there are issues, and you don't need the changes
> in 2.6.38 to get bitten by this problem....

Really? I do know (by measuring it) that older kernels do not have this
problem, and you basically said the same thing, namely that there was a
behaviour change.

If your goal is to argue for yourself that the breakage has to stay, thats
fine, but don't expect users (like me) to follow your illogical train of
thought.

> > and there is really no need to cache this preallocation for
> > files that have been closed 8 hours ago and never touched since then.
> 
> If the preallocation was the size of the dynamic behaviour, you
> wouldn't have even noticed this.

Maybe, it certainly is a lot less noticable. But the new xfs behaviour
basically means you have less space (potentially a lot less) on your disk
when you have more memory, and that disk space is lost indefinitely just
because I have some free ram.

This is simply not a sensible heuristic - more ram must not mean that
potentialy large amounts of diskspace are lost forever (if you have enough
ram).

> So really what you are saying is that it is excessive for your current
> configuration and workload.

No, what I am saying is that the heuristic is simply buggy - it ties one
value (available ram for cache) to a completely unrelated one (amount of free
space used for preallocation).

It also doesn't only happen in my workload only.

> better for allocsize filesystems. However, I'm not going to start to
> add lots of workload-dependent tweaks to this code - the default
> behaviour is much better and in most cases removes the problems that
> led to using allocsize in the first place. So removing allocsize
> from your config is, IMO, the correct fix, not tweaking heuristics in
> the code...

I am fine with not using allocsize if the fragmentation problems in xfs (for
append-only cases) has been improved.

But you aid the heuristic applies regardless of whether allocsize was
specified or not.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-21  1:36       ` Marc Lehmann
@ 2011-05-21  3:15         ` Dave Chinner
  2011-05-21  4:16           ` Marc Lehmann
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2011-05-21  3:15 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Sat, May 21, 2011 at 03:36:04AM +0200, Marc Lehmann wrote:
> On Sat, May 21, 2011 at 10:45:44AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > > Longer meaning practically infinitely :)
> > 
> > No, longer meaning the in-memory lifecycle of the inode.
> 
> That makes no sense - if I have twice the memory I suddenly have half (or
> some other factor) free diskspace.
> 
> The lifetime of the preallocated area should be tied to something sensible,
> really - all that xfs has now is a broken heuristic that ties the wrong
> statistic to the extra space allocated.

So, instead of tying it to the lifecycle of the file descriptor, it
gets tied to the lifecycle of the inode. There isn't much in between
those that can be easily used.  When your workload spans hundreds of
thousands of inodes and they are cached in memory, switching to the
inode life-cycle heuristic works better than anything else that has
been tried.  One of those cases is large NFS servers, and the
changes made in 2.6.38 are intended to improve performance on NFS
servers by switching it to use inode life-cycle to control
speculative preallocation.

As it is, regardless of this change, we already have pre-existing
circumstances where specualtive preallocation is controlled by the
inode life-cycle - inodes with manual preallocation (e.g fallocate)
and append only files - so this problem with allocsize causing
premature ENOSPC raises it's head every couple of years regardless
of whether there's been any recent changes or not.

FWIW, I remember reading bug reports for Irix from 1998 about such
problems w.r.t. manual preallocation. In all cases that I can
remember, the problems went away with small configuration tweaks....

> > > However, I would suggest that whatever heuristic 2.6.38 uses
> > > is deeply broken at the momment,
> > 
> > One bug report two months after general availability != deeply
> > broken.
> 
> That makes no sense - I only found out about this broken behaviour
> because I specified a large allocsize manually, which is rare.
> 
> However, the behaviour happens even without that.  but might not be
> immediately noticable (how would you find out if you lost a few
> gigabytes of disk space unless the disk runs full? most people
> would have no clue where to look for).

If most people never notice it and it reduces fragmentation
and improves performance, then I don't see a problem. Right now
evidence points to the "most people have not noticed it".

Just to point out what people do notice: when the dynamic
functionality was introduced into 2.6.38-rc1, it had a bug in a
calculation that was resulting in 32bit machines always preallocing
8GB extents. That was noticed _immediately_ and reported by several
people independently. Once that bug was fixed there have been no
further reports until yours. That tells me that the new default
behaviour is not actually causing ENOSPC problems for most people.

I've already said I'll look into the allocsize interaction with the
new heuristic you've reported, and told you how to work around the
problem in the mean time. I can't do any more than that.

> Just because the breakage is not obviously visible doesn't mean it's not
> deeply broken.
> 
> Also, I just looked more thoroughly through the list - the problem has
> been reported before, but was basically ignored, so you are wrong in that
> there is only one report.

I stand corrected. I get at least 1000-1500 emails a day and I
occasionally forget/miss/delete one I shouldn't. Or maybe it was one
I put down to the above bug.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-21  3:15         ` Dave Chinner
@ 2011-05-21  4:16           ` Marc Lehmann
  2011-05-22  2:00             ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Marc Lehmann @ 2011-05-21  4:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, May 21, 2011 at 01:15:37PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > The lifetime of the preallocated area should be tied to something sensible,
> > really - all that xfs has now is a broken heuristic that ties the wrong
> > statistic to the extra space allocated.
> 
> So, instead of tying it to the lifecycle of the file descriptor, it
> gets tied to the lifecycle of the inode.

That's quite the difference, though - the former is in some relation to
the actual in-use files, while the latter is in no relation to it.

> those that can be easily used.  When your workload spans hundreds of
> thousands of inodes and they are cached in memory, switching to the
> inode life-cycle heuristic works better than anything else that has
> been tried.

The problem is that this is not anything like the normal case.

It simply doesn't make any sense to preallocate disk space for files that
are not in use and are unlikely to be in use again.

> One of those cases is large NFS servers, and the changes made in 2.6.38
> are intended to improve performance on NFS servers by switching it to
> use inode life-cycle to control speculative preallocation.

It's easy to get some gains in special situations at the expense of normal
ones - keep in mind that this optimisation makes little sense for non-NFS
cases, which is the majority of use cases.

The problem here is that XFS doesn't get enough feedback in the case of
an NFS server which might open and close files much more often than local
processes.

However, the solution to this is a better nfs server, not some dirty hacks
in some filesystem code in the hope that it works in the special case of
an NFS server, to the detriment of all other workloads which give better
feedback.

This heuristic is just that: a bad hack to improve benchmarks in a special
case.

The preallocation makes sense in relation to the working set, which can be
characterised by the open files, or recently opened files.

Tieing it to the (in-memory) inode lifetime is an abysmal approximation to
this.

I understand that XFS does this to please a very suboptimal case - the NFS
server code which doesn't give you enough feedback on which files are open.

But keep in mind that in my case, XFS cached a large number of inodes that
have been closed many hours ago - and haven't been accessed for many hours
as well.

I have 8GB of ram, which is plenty, but not really an abnormal amount of
memory.

If I unpack a large tar file, this means that I get a lot of (internal)
fragmentation because all files are spread over a large area than
necesssary, and diskspace is used for a potentially indefinite time.

> > However, the behaviour happens even without that.  but might not be
> > immediately noticable (how would you find out if you lost a few
> > gigabytes of disk space unless the disk runs full? most people
> > would have no clue where to look for).
> 
> If most people never notice it and it reduces fragmentation
> and improves performance, then I don't see a problem. Right now

Preallocation sure also increases fragmentation when its never going to be
used.

> evidence points to the "most people have not noticed it".

The problem with these statements is that they have no meaning. Most
people don't even notice filesystem fragmentation - or corruption, or bugs
in xfs_repair.

If I apply your style of arguing that means it's not big deal - msot people
don't even notice when a few files get corrupted, they will just reinstall
their box. And ehy, who uses xfs_repasir and notices some bugs in it.

Sorry, but this kind of arguing makes no sense to me.

> 8GB extents. That was noticed _immediately_ and reported by several
> people independently. Once that bug was fixed there have been no
> further reports until yours. That tells me that the new default
> behaviour is not actually causing ENOSPC problems for most people.

You of curse know well enough that ENOSPC was just one symptom, and that
the real problem is allocating free disk space semi-permanently. Why do
you bring up this strawmen of ENOSPC?

> I've already said I'll look into the allocsize interaction with the
> new heuristic you've reported, and told you how to work around the
> problem in the mean time. I can't do any more than that.

The problem is that you are selectively ignoring facts to downplay this
problem. That doesn't instill confidence, you really sound like "don't
insult my toy allocation heuristic, I'll just ignore the facts and claim
there is no problem lalala".

You simply ignore most of what I wrote - the problem is also clearly not
allocsize interaction, but the broken logic behind the heuristic - "NFS
servers have bad access patterns, so we assume every workload is like an
NFS server". It's simply wrong.

The heuristic clearly doesn't make sense with any normal workload, where
files that were closed long ago will not be used. Heck, in most workloads,
files that are closed will almost never be written to soon afterwards,
simply because it is a common sense optimisations to not do unnecessary
operations.

If XFS contains dirty hacks that are meant for specific workloads only (to
workaround bad access patterns by NFS servers), then it would make sense
to disable these to not hurt the common cases.

And this heuristic clearly is just a hack to suit a specific need. I know
that, and I am sure you know that too, otherwise you wouldn't be hammering
home the NFS server case :)

Hacking some NFS server access pattern heuristic into XFS is, however,
just a workaround for that case, not a fix, or a sensible thing to do in
the general case.

I would certainly appreciate that XFS has such hacks and heuristics, and
would certainly try them out (having lots of NFS servers :), but it's
clear that enforcing workarounds for uncommon cases at the expense of
normal workloads is a bad idea, in general.

So please give this a bit considerations: is it really worth to keep
preallocstion for files that are not used by anything on a computer just
to improve benchmark numbers for a client with bad access patterns (the
NFS server code)?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-21  4:16           ` Marc Lehmann
@ 2011-05-22  2:00             ` Dave Chinner
  2011-05-22  7:59               ` Matthias Schniedermeyer
  2011-05-23 13:35               ` Marc Lehmann
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2011-05-22  2:00 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Sat, May 21, 2011 at 06:16:52AM +0200, Marc Lehmann wrote:
> On Sat, May 21, 2011 at 01:15:37PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > > The lifetime of the preallocated area should be tied to something sensible,
> > > really - all that xfs has now is a broken heuristic that ties the wrong
> > > statistic to the extra space allocated.
> > 
> > So, instead of tying it to the lifecycle of the file descriptor, it
> > gets tied to the lifecycle of the inode.
> 
> That's quite the difference, though - the former is in some relation to
> the actual in-use files, while the latter is in no relation to it.
> 
> > those that can be easily used.  When your workload spans hundreds of
> > thousands of inodes and they are cached in memory, switching to the
> > inode life-cycle heuristic works better than anything else that has
> > been tried.
> 
> The problem is that this is not anything like the normal case.

For you, maybe.

> It simply doesn't make any sense to preallocate disk space for files that
> are not in use and are unlikely to be in use again.

That's why the normal close case truncates it away. But there are
other cases where we don't want this to happen.

> > One of those cases is large NFS servers, and the changes made in 2.6.38
> > are intended to improve performance on NFS servers by switching it to
> > use inode life-cycle to control speculative preallocation.
> 
> It's easy to get some gains in special situations at the expense of normal
> ones - keep in mind that this optimisation makes little sense for non-NFS
> cases, which is the majority of use cases.

XFS is used extensively in NAS products, from small $100 ARM/MIPS
embedded NAS systems all the way up to high end commercial NAS
products. It is one of the main use cases we optimise XFS for.

> The problem here is that XFS doesn't get enough feedback in the case of
> an NFS server which might open and close files much more often than local
> processes.
> 
> However, the solution to this is a better nfs server, not some dirty hacks
> in some filesystem code in the hope that it works in the special case of
> an NFS server, to the detriment of all other workloads which give better
> feedback.

Sure, that would be my preferred approach. However, if you followed
the discussion when this first came up, you'd realise that we've
been trying to get NFS server changes to fix this operation for the
past 5 years, and I've just about  given up trying.  Hell, the NFS
OFC (open file cache) proposal that would have mostly solved this
(and other problems like readahead state thrashing) from 2-3 years
ago went nowhere...

> This heuristic is just that: a bad hack to improve benchmarks in a special
> case.

It wasn't aimed at improving benchmark performance - these changes
have been measured to reduce large file fragmentation in real-world
workloads on the default configuration by at least an order of
magnitude.

> The preallocation makes sense in relation to the working set, which can be
> characterised by the open files, or recently opened files.
> Tieing it to the (in-memory) inode lifetime is an abysmal approximation to
> this.

So you keep saying, but you keep ignoring the fact that the inode
cache represents the _entire_ working set of inodes. It's not an
approximation - it is the _exact_ current working set of files we
currently have.

Hence falling back to "preallocation lasts for as long as the inode
is part of the working set" is an extremely good heuristic to use -
we move from preallocation for only the L1 cache lifecycle (open
fd's) to using the L2 cache lifecycle (recently opened inodes)
instead.

> If I unpack a large tar file, this means that I get a lot of (internal)
> fragmentation because all files are spread over a large area than
> necesssary, and diskspace is used for a potentially indefinite time.

So you can reproduce this using an tar? Any details on size, # of
files, the untar command, etc? How do you know you get internal
fragmentation and tha tit is affecting fragmentation? Please provide
concrete examples (e.g. copy+paste the command lines and any
relevant output) that so I might be able to reproduce your problem
myself?

I don't really care what you think the problem is based on what
you've read in this email thread, or for that matter how you think
we should fix it. What I really want is your test cases that
reproduce the problem so I can analyse it for myself. Once I
understand what is going on, then we can talk about what the real
problem is and how to fix it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-22  2:00             ` Dave Chinner
@ 2011-05-22  7:59               ` Matthias Schniedermeyer
  2011-05-23  1:20                 ` Dave Chinner
  2011-05-23 13:35               ` Marc Lehmann
  1 sibling, 1 reply; 14+ messages in thread
From: Matthias Schniedermeyer @ 2011-05-22  7:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Marc Lehmann, xfs

On 22.05.2011 12:00, Dave Chinner wrote:
> 
> I don't really care what you think the problem is based on what
> you've read in this email thread, or for that matter how you think
> we should fix it. What I really want is your test cases that
> reproduce the problem so I can analyse it for myself. Once I
> understand what is going on, then we can talk about what the real
> problem is and how to fix it.

What would interest me is why the following creates files with large 
preallocations.

cp -a <somedir> target
rm -rf target
cp -a <somedir> target

After the first copy everything looks normal, `du` is about the 
original value.

After the second run a `du` shows a much higher value, until the 
preallocation is shrunk away.





Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-22  7:59               ` Matthias Schniedermeyer
@ 2011-05-23  1:20                 ` Dave Chinner
  2011-05-23  9:01                   ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2011-05-23  1:20 UTC (permalink / raw)
  To: Matthias Schniedermeyer; +Cc: Marc Lehmann, xfs

On Sun, May 22, 2011 at 09:59:55AM +0200, Matthias Schniedermeyer wrote:
> On 22.05.2011 12:00, Dave Chinner wrote:
> > 
> > I don't really care what you think the problem is based on what
> > you've read in this email thread, or for that matter how you think
> > we should fix it. What I really want is your test cases that
> > reproduce the problem so I can analyse it for myself. Once I
> > understand what is going on, then we can talk about what the real
> > problem is and how to fix it.
> 
> What would interest me is why the following creates files with large 
> preallocations.
> 
> cp -a <somedir> target
> rm -rf target
> cp -a <somedir> target
> 
> After the first copy everything looks normal, `du` is about the 
> original value.
> 
> After the second run a `du` shows a much higher value, until the 
> preallocation is shrunk away.

That's obviously a bug. It's also a simple test case that is easy to
reproduce - exactly what I like in a bug report. ;)

The inodes are being recycled off the reclaimable list in the second
case i.e. we're short-circuiting the inode lifecycle and making it
new again because it has been reallocated. The XFS_IDIRTY_RELEASE
flag is not being cleared in this case, so we are not removing the
speculative preallocation when the fd is closed for the second copy.

The patch below fixes this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: clear inode dirty release flag when recycling it

From: Dave Chinner <dchinner@redhat.com>

The state used to track dirty inode release calls is not reset when
an inode is reallocated and reused from the reclaimable state. This
leads to specualtive preallocation not being truncated away in the
expected manner for local files until the inode is subsequently
truncated, freed or cycles out of the cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iget.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index cb9b6d1..e75e757 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -241,6 +241,13 @@ xfs_iget_cache_hit(
 		 */
 		ip->i_flags |= XFS_IRECLAIM;
 
+		/*
+		 * clear the dirty release state as we are now effectively a
+		 * new inode and so we need to treat speculative preallocation
+		 * accordingly.
+		 */
+		ip->i_flags &= ~XFS_IDIRTY_RELEASE;
+
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-23  1:20                 ` Dave Chinner
@ 2011-05-23  9:01                   ` Christoph Hellwig
  2011-05-24  0:20                     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2011-05-23  9:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Marc Lehmann, xfs

On Mon, May 23, 2011 at 11:20:34AM +1000, Dave Chinner wrote:
> 
> The state used to track dirty inode release calls is not reset when
> an inode is reallocated and reused from the reclaimable state. This
> leads to specualtive preallocation not being truncated away in the
> expected manner for local files until the inode is subsequently
> truncated, freed or cycles out of the cache.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iget.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index cb9b6d1..e75e757 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -241,6 +241,13 @@ xfs_iget_cache_hit(
>  		 */
>  		ip->i_flags |= XFS_IRECLAIM;
>  
> +		/*
> +		 * clear the dirty release state as we are now effectively a
> +		 * new inode and so we need to treat speculative preallocation
> +		 * accordingly.
> +		 */
> +		ip->i_flags &= ~XFS_IDIRTY_RELEASE;

Btw, don't we need to clear even more flags here?  To me it seems we
need to clear XFS_ISTALE, XFS_IFILESTREAM and XFS_ITRUNCATED as well.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-22  2:00             ` Dave Chinner
  2011-05-22  7:59               ` Matthias Schniedermeyer
@ 2011-05-23 13:35               ` Marc Lehmann
  2011-05-24  1:30                 ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Marc Lehmann @ 2011-05-23 13:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sun, May 22, 2011 at 12:00:24PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > The problem is that this is not anything like the normal case.
> 
> For you, maybe.

For the majority of boxes that use xfs - most desktop boxes are not heavy
NFS servers.

> > It's easy to get some gains in special situations at the expense of normal
> > ones - keep in mind that this optimisation makes little sense for non-NFS
> > cases, which is the majority of use cases.
> 
> XFS is used extensively in NAS products, from small $100 ARM/MIPS
> embedded NAS systems all the way up to high end commercial NAS
> products. It is one of the main use cases we optimise XFS for.

Thats really sad - maybe people like me who use XFS on their servers
should rethink that decision that, if XFS mainly optimises for commercial
nas boxes only.

You aren't serious, are you?

> Sure, that would be my preferred approach. However, if you followed
> the discussion when this first came up, you'd realise that we've
> been trying to get NFS server changes to fix this operation for the
> past 5 years, and I've just about  given up trying.  Hell, the NFS
> OFC (open file cache) proposal that would have mostly solved this
> (and other problems like readahead state thrashing) from 2-3 years
> ago went nowhere...

In other words, if you can't do it right, you make ugly broken hacks, and
then tell people that it's exepcted behaviour, because xfs is optimised
for commercial NFS server boxes.

> > The preallocation makes sense in relation to the working set, which can be
> > characterised by the open files, or recently opened files.
> > Tieing it to the (in-memory) inode lifetime is an abysmal approximation to
> > this.
> 
> So you keep saying, but you keep ignoring the fact that the inode
> cache represents the _entire_ working set of inodes. It's not an
> approximation - it is the _exact_ current working set of files we
> currently have.

I am sory, but that is wrong and shows a serious lack of
understanding. The cached inode set is just that, a cache. It is
definitely not corresponding to any working set, simply because it is a
*cache*.

ls -l in a directory will cache all inodes, but that doesn't mean that
those files are the working set 8 hours later.

Open files are in the working set, because applications open files to use
them.

The inode cache probably contains stuff that was in the working set
before, but is no longer.

> Hence falling back to "preallocation lasts for as long as the inode
> is part of the working set" is an extremely good heuristic to use -

It's of course extremely broken, because all it does is improve the
(fragmentation) performance for broken clients - for normal clients it
will reduce performance of course.

> we move from preallocation for only the L1 cache lifecycle (open
> fd's) to using the L2 cache lifecycle (recently opened inodes)
> instead.

That comparison is seriously flawed, as a cache is transparent, but the
xfs behaviour is not.

> > If I unpack a large tar file, this means that I get a lot of (internal)
> > fragmentation because all files are spread over a large area than
> > necesssary, and diskspace is used for a potentially indefinite time.
> 
> So you can reproduce this using an tar? Any details on size, # of
> files, the untar command, etc?

I can reproduce it simply by running make in the uclibc source tree.

Since gas has the same access behaviour as tar, why would it be different?
What kind of broken heuristic is it that XFS now uses that these two use
cases would make a difference?

> How do you know you get internal fragmentation and tha tit is affecting
> fragmentation?

If what you say is true, thats a logical conclusion, it doesn'T need
evidence, it follows from your claims.

XFS can't preallocate for basically all files that are beign written and
at the same time avoid fragmentation.

> Please provide concrete examples (e.g. copy+paste the command lines and
> any relevant output) that so I might be able to reproduce your problem
> myself?

"make" - I already told you in my first e-mail.

> we should fix it. What I really want is your test cases that
> reproduce the problem so I can analyse it for myself. Once I
> understand what is going on, then we can talk about what the real
> problem is and how to fix it.

Being a good citizen wanting to improve XFS I of course dleivered that in
my first e-mail. Again, I used allocsize=64m and then made a buildroot
build, which stopped after a few minutes because 180GB of disk space were
gone.

The disk space was all used up by the buildroot, which is normally a few
gigabytes (after a successful build).

I found that the uclibc object directory uses 50GB of space, about 8 hours
after the compile - the object files were typically a few kb in size, but
du showed 64mb of usage, even though nobody was using that file more than
once, or ever after the make stopped.

I am sorry, I think you are more interested in forcing your personal
toy heuristic through reality - thats how you come across, because you
selectively ignore the bits that you don't like. It's also pretty telling
that XFS mainly optimises for commercial NAS boxes now, and no longer for
good performance on local boxes.

:(

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-23  9:01                   ` Christoph Hellwig
@ 2011-05-24  0:20                     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2011-05-24  0:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Marc Lehmann, xfs

On Mon, May 23, 2011 at 05:01:44AM -0400, Christoph Hellwig wrote:
> On Mon, May 23, 2011 at 11:20:34AM +1000, Dave Chinner wrote:
> > 
> > The state used to track dirty inode release calls is not reset when
> > an inode is reallocated and reused from the reclaimable state. This
> > leads to specualtive preallocation not being truncated away in the
> > expected manner for local files until the inode is subsequently
> > truncated, freed or cycles out of the cache.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_iget.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> > index cb9b6d1..e75e757 100644
> > --- a/fs/xfs/xfs_iget.c
> > +++ b/fs/xfs/xfs_iget.c
> > @@ -241,6 +241,13 @@ xfs_iget_cache_hit(
> >  		 */
> >  		ip->i_flags |= XFS_IRECLAIM;
> >  
> > +		/*
> > +		 * clear the dirty release state as we are now effectively a
> > +		 * new inode and so we need to treat speculative preallocation
> > +		 * accordingly.
> > +		 */
> > +		ip->i_flags &= ~XFS_IDIRTY_RELEASE;
> 
> Btw, don't we need to clear even more flags here?  To me it seems we
> need to clear XFS_ISTALE, XFS_IFILESTREAM and XFS_ITRUNCATED as well.

XFS_ISTALE is cleared unconditionally at the end of the function,
which means that any lookup on a stale inode will clear it. I'm not
absolutely sure this is right now that I think about it but that's a
different issue.

XFS_ITRUNCATED is mostly harmless, so it isn't a but issue, but we
probably should clear it. I'm not sure what the end result of not
clearing XFS_IFILESTREAM is, but you are right in that it should not
pass through here, either. I'll respin the patch to clear all the
state flags that hold sub-lifecycle state.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drastic changes to allocsize semantics in or around 2.6.38?
  2011-05-23 13:35               ` Marc Lehmann
@ 2011-05-24  1:30                 ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2011-05-24  1:30 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Mon, May 23, 2011 at 03:35:48PM +0200, Marc Lehmann wrote:
> On Sun, May 22, 2011 at 12:00:24PM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > > The problem is that this is not anything like the normal case.
> > 
> > For you, maybe.
> 
> For the majority of boxes that use xfs - most desktop boxes are not heavy
> NFS servers.

Desktops are not a use case we optimise XFS for. We make sure XFS
works adequately on the desktop, but other than that we focus of
server workloads as optimisation targets.

> > > It's easy to get some gains in special situations at the expense of normal
> > > ones - keep in mind that this optimisation makes little sense for non-NFS
> > > cases, which is the majority of use cases.
> > 
> > XFS is used extensively in NAS products, from small $100 ARM/MIPS
> > embedded NAS systems all the way up to high end commercial NAS
> > products. It is one of the main use cases we optimise XFS for.
> 
> Thats really sad - maybe people like me who use XFS on their servers
> should rethink that decision that, if XFS mainly optimises for commercial
> nas boxes only.

Nice twist - you're trying to imply we do something very different
to what I said.

So to set the record straight, we optimise for several different
overlapping primary use cases. We make optimisation decisions that
benefit systems and workloads that fall into the following
categories:

	- large filesystems (e.g > 100TB)
	- large storage subsystems (hundreds to thousands of
	  spindles)
	- large amounts of RAM (tens of GBs to TBs of RAM)
	- high concurrency from large numbers of CPUs (thousands of
	  CPU cores)
	- high throughput, both IOPS and bandwidth
	- low fragmentation of large files
	- robust error detection and handling

IOWs, we optimise for high performance, high end servers and
workloads.  And that means that Just because we make changes that
help high performance, high end NFS servers acheive these goals
_does not mean_ we only optimise for NFS servers.

I'm not going to continue this part of this thread - it's just a
waste of my time. If you want the regression fixed, then stop
trying to tell us what the bug is and instead try to help diagnose
the cause of the problem.

> > we should fix it. What I really want is your test cases that
> > reproduce the problem so I can analyse it for myself. Once I
> > understand what is going on, then we can talk about what the real
> > problem is and how to fix it.
> 
> Being a good citizen wanting to improve XFS I of course dleivered that in
> my first e-mail. Again, I used allocsize=64m and then made a buildroot
> build, which stopped after a few minutes because 180GB of disk space were
> gone.
> 
> The disk space was all used up by the buildroot, which is normally a few
> gigabytes (after a successful build).
> 
> I found that the uclibc object directory uses 50GB of space, about 8 hours
> after the compile - the object files were typically a few kb in size, but
> du showed 64mb of usage, even though nobody was using that file more than
> once, or ever after the make stopped.

A vaguely specified 8 hour long test involving building some large number of packages
is not a useful test case.  There are too many variables, too much
setup time, too much data to analyse and taking 8 hours to get a
result is far too long.  I did try a couple of kernel builds and
didn't see the problem you reported. Hence I came to the conclusion
that it was something specific to your build environment and asked
for more a more exact test case.

Indeed, someone else presented a 100% reproducable test case in a 3
line script using cp and rm that took 10s to run. It then took me 15
minutes to analyse, then write, test and post a patch that fixes the
problem their test case demonstrated. Does the patch in the
following email fix your buildroot space usage problem?

http://oss.sgi.com/pipermail/xfs/2011-May/050651.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-05-24  1:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-20  0:55 drastic changes to allocsize semantics in or around 2.6.38? Marc Lehmann
2011-05-20  2:56 ` Dave Chinner
2011-05-20 15:49   ` Marc Lehmann
2011-05-21  0:45     ` Dave Chinner
2011-05-21  1:36       ` Marc Lehmann
2011-05-21  3:15         ` Dave Chinner
2011-05-21  4:16           ` Marc Lehmann
2011-05-22  2:00             ` Dave Chinner
2011-05-22  7:59               ` Matthias Schniedermeyer
2011-05-23  1:20                 ` Dave Chinner
2011-05-23  9:01                   ` Christoph Hellwig
2011-05-24  0:20                     ` Dave Chinner
2011-05-23 13:35               ` Marc Lehmann
2011-05-24  1:30                 ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.