Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>, linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
Date: Thu, 14 Feb 2019 21:35:21 -0500	[thread overview]
Message-ID: <20190215023520.GA50265@bfoster> (raw)
In-Reply-To: <20190214215124.GU14116@dastard>

On Fri, Feb 15, 2019 at 08:51:24AM +1100, Dave Chinner wrote:
> On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> > On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > > I don't recall the exact history, but I don't think this was always the
> > > > > > case for extsz hints.
> > > > > 
> > > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > > could corrupt or expose stale data and caused all sorts of problems
> > > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > > requirements for extent size hints. Hence to make extent size hints
> > > > > work for buffered writes, I simply made them work the same way as
> > > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > > 
> > > > > > So would the close time eofb trim be as
> > > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > > changed back to using delayed allocation?
> > > > > 
> > > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > > write-many, then we'd retain the post-eof blocks...
> > > > > 
> > > > > > I think a patch for that was
> > > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > > functionality which still had unresolved issues (IIRC).
> > > > > 
> > > > > *nod*
> > > > > 
> > > > > > From another angle, would a system that held files open for a
> > > > > > significant amount of time relative to a one-time write such that close
> > > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > > susceptible to the same level of free space fragmentation as shown
> > > > > > above?
> > > > > 
> > > > > If the file is held open for writing for a long time, we have to
> > > > > assume that they are going to write again (and again) so we should
> > > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > > gc, then there's nothing more we can really do in that case...
> > > > > 
> > > > 
> > > > I'm not necessarily sure that this condition is always a matter of
> > > > writing too slow. It very well may be true, but I'm curious if there are
> > > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > > where we could end up doing a large number of one time file writes and
> > > > not doing the release time trim until an underlying extent (with
> > > > post-eof blocks) has been converted in more cases than not.
> > > 
> > > I'm not sure I follow what sort of workload and situation you are
> > > describing here. Are you talking about the effect of an EOFB gc pass
> > > during ongoing writes?
> > > 
> > 
> > I'm not sure if it's an actual reproducible situation.. I'm just
> > wondering out loud if there are normal workloads that might still defeat
> > a one-time trim at release time. For example, copy enough files in
> > parallel such that writeback touches most of them before the copies
> > complete and we end up trimming physical blocks rather than delalloc
> > blocks.
> > 
> > This is not so much of a problem if those files are large, I think,
> > because then the preallocs and the resulting trimmed free space is on
> > the larger side as well. If we're copying a bunch of little files with
> > small preallocs, however, then we put ourselves in the pathological
> > situation shown in Darrick's test.
> > 
> > I was originally thinking about whether this could happen or not on a
> > highly parallel small file copy workload, but having thought about it a
> > bit more I think there is a more simple example. What about an untar
> > like workload that creates small files and calls fsync() before each fd
> > is released?
> 
> Which is the same as an O_SYNC write if it's the same fd, which
> means we'll trim allocated blocks on close. i.e. it's no different
> to the current behaviour.  If such files are written in parallel then,
> again, it is no different to the existing behaviour. i.e. it
> largely depends on the timing of allocation in writeback and EOF
> block clearing in close(). If close happens before the next
> allocation in that AG, then they'll pack because there's no EOF
> blocks that push out the new allocation.  If it's the other way
> around, we get some level of freespace fragmentation.
> 

I know it's the same as the current behavior. ;P I think we're talking
past eachother on this. What I'm saying is that the downside to the
current behavior is that a simple copy file -> fsync -> copy next file
workload fragments free space.

Darrick demonstrated this better in his random size test with the
release time trim removed, but a simple loop to write one thousand 100k
files (xfs_io -fc "pwrite 0 100k" -c fsync ...) demonstrates similar
behavior:

# xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp
   from      to extents  blocks    pct
      1       1      18      18   0.00
     16      31       1      25   0.00
     32      63       1      58   0.00
 131072  262143     924 242197739  24.97
 262144  524287       1  365696   0.04
134217728 242588672       3 727292183  74.99

vs. the same test without the fsync:

# xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp 
   from      to extents  blocks    pct
      1       1      16      16   0.00
     16      31       1      20   0.00
4194304 8388607       2 16752060   1.73
134217728 242588672       4 953103627  98.27

Clearly there is an advantage to trimming before delalloc conversion.

Random thought: perhaps a one time trim at writeback (where writeback
would convert delaloc across eof) _or_ release time, whichever happens
first on an inode with preallocation, might help mitigate this problem.

> THis is mitigated by the fact that workloads like this tend to be
> separated into separate AGs - allocation first attempts to be
> non-blocking which means if it races with a free in progress it will
> skip to the next AG and it won't pack sequentially, anyway. So I
> don't think this is a major concern in terms of free space
> fragmentation - the allocator AG selection makes it impossible to
> predict what happens when racing alloc/frees occur in a single AG
> and we really try to avoid that as much as possible, anyway.
> 

Indeed..

> If it's a separate open/fsync/close pass after all the writes (and
> close) then the behaviour is the same with either the existing code
> or the "trim on first close" behaviour - and the delalloc blocks
> beyond EOF will get killed on the first close and not be present for
> the writeback, and all is good.
> 
> > Wouldn't that still defeat a one-time release heuristic and
> > produce the same layout issues as shown above? We'd prealloc,
> > writeback/convert then trim small/spurious fragments of post-eof space
> > back to the allocator.
> 
> Yes, but as far as I can reason (sanely), "trim on first close" is
> no worse that the existing heuristic in these cases, whilst being
> significantly better in others we have observed in the wild that
> cause problems...
> 

Agreed. Again, I'm thinking about this assuming that fix is already in
place.

> I'm not looking for perfect here, just "better with no obvious
> regressions". We can't predict every situation, so if it deals with
> all the problems we've had reported and a few similar cases we don't
> curently handle as well, then we should run with that and not really
> worry about the cases that it (or the existing code) does not
> solve until we have evidence that those workloads exist and are
> causing real world problems. It's a difficult enough issue to reason
> about without making it more complex by playing "what about" games..
> 

That's fair, but we have had users run into this situation. The whole
sparse inodes thing is partially a workaround for side effects of this
problem (free space fragmentation being so bad we can't allocate
inodes). Granted, some of those users may have also been able to avoid
that problem with better fs usage/configuration.

> > > > It sounds like what you're saying is that it doesn't really matter
> > > > either way at this point. There's no perf advantage to keeping the eof
> > > > blocks in this scenario, but there's also no real harm in deferring the
> > > > eofb trim of physical post-eof blocks because any future free space
> > > > fragmentation damage has already been done (assuming no more writes come
> > > > in).
> > > 
> > > Essentially, yes.
> > > 
> > > > The thought above was tip-toeing around the idea of (in addition to the
> > > > one-time trim heuristic you mentioned above) never doing a release time
> > > > trim of non-delalloc post-eof blocks.
> > > 
> > > Except we want to trim blocks in the in the cases where it's a write
> > > once file that has been fsync()d or written by O_DIRECT w/ really
> > > large extent size hints....
> > 
> > We still would trim it. The one time write case is essentially
> > unaffected because the only advantage of that hueristic is to trim eof
> > blocks before they are converted. If the eof blocks are real, the
> > release time heuristic has already failed (i.e., it hasn't provided any
> > benefit that background trim doesn't already provide).
> 
> No, all it means is that the blocks were allocated before the fd was
> closed, not that the release time heuristic failed. The release time
> heuristic is deciding what to do /after/ the writes have been
> completed, whatever the post-eof situation is. It /can't fail/ if it
> hasn't been triggered before physical allocation has been done, it
> can only decide what to do about those extents once it is called...
> 

Confused. By "failed," I mean we physically allocated blocks that were
never intended to be used. This is basically referring to the negative
effect of delalloc conversion -> eof trim behavior on once written files
demonstrated above. If this negative effect didn't exist, we wouldn't
need the release time trim at all and could just rely on background
trim.

> In which case, if it's a write-once file we want to kill the blocks,
> no matter whether they are allocated or not. And if it's a repeated
> open/X/close workload, then a single removal of EOF blocks won't
> greatly impact the file layout because subsequent closes won't
> trigger.
> 
> IOWs, in the general case it does the right thing, and when it's
> wrong the impact is negated by the fact it will do the right thing
> on all future closes on that inode....
> 
> > IOW, what we really want to avoid is trimming (small batches of) unused
> > physical eof blocks.
> 
> For the general case, I disagree. :)
> 
> > > > > IOWs, speculative prealloc beyond EOF is not just about preventing
> > > > > fragmentation - it also helps minimise the per-write CPU overhead of
> > > > > delalloc space accounting. (i.e. allows faster write rates into
> > > > > cache). IOWs, for anything more than a really small files, we want
> > > > > to be doing speculative delalloc on the first time the file is
> > > > > written to.
> > > > > 
> > > > 
> > > > Ok, this paper refers to CPU overhead as it contributes to lack of
> > > > scalability.
> > > 
> > > Well, that was the experiments that were being performed. I'm using
> > > it as an example of how per-write overhead is actually important to
> > > throughput. Ignore the "global lock caused overall throughput
> > > issues" because we don't have that problem any more, and instead
> > > look at it as a demonstration of "anything that slows down a write()
> > > reduces per-thread throughput".
> > > 
> > 
> > Makes sense, and that's how I took it after reading through the paper.
> > My point was just that I think this is more of a tradeoff and caveat to
> > consider than something that outright rules out doing less agressive
> > preallocation in certain cases.
> > 
> > I ran a few tests yesterday out of curiousity and was able to measure (a
> > small) difference in single-threaded buffered writes to cache with and
> > without preallocation. What I found a bit interesting was that my
> > original attempt to test this actually showed _faster_ throughput
> > without preallocation because the mechanism I happened to use to bypass
> > preallocation was an up front truncate.
> 
> So, like:
> 
> $ xfs_io -f -t -c "pwrite 0 1g" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.6582 sec (1.519 GiB/sec and 398224.4691 ops/sec)
> $
> 
> Vs:
> 
> $ xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.7159 sec (1.397 GiB/sec and 366147.4512 ops/sec)
> $
> 
> I get an average of 1.53GiB/s write rate into cache with
> speculative prealloc active, and only 1.36GiB/s write rate in cache
> without it. I've run these 10 times each, ranges are
> 1.519-1.598GiB/s for prealloc and 1.291-1.361GiB/s whenusing
> truncate to prevent prealloc.
> 

Yes, that's basically the same test. I repeated it again and saw pretty
much the same numbers here. Hmm, not sure what happened there. Perhaps I
fat fingered something or mixed in a run with a different buffer size
for the truncate case. I recall seeing right around ~1.5GB/s for the
base case and closer to ~1.6GB/s for the truncate case somehow or
another, but I don't have a record of it to figure out what happened.
Anyways, sorry for the noise.. I did ultimately reproduce the prealloc
boost and still see the diminishing returns at about 64k or so (around
~2.7GB/s with or without preallocation).

> IOWs, on my test machine, the write rate into cache using 4kB writes
> is over 10% faster with prealloc enabled. And to point out the
> dminishing returns, with a write size of 1MB:
> 
> $  xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0.4975 sec (2.010 GiB/sec and 2058.1922 ops/sec)
> 
> The throughput with or without prealloc average out just over
> 2GiB/s and the difference is noise (2.03GiB/s vs 2.05GiB/s)
> 
> Note: I'm jsut measuring write-in rates here, and I've isolated them
> from writeback completely because the file size is small enough
> that dirty throttling (20% of memory) isn't kicking in because I
> have 16GB of RAM on this machine. If I go over 3GB in file size,
> dirty throttling and writeback kicks in and the results are a
> complete crap-shoot because it's dependent on when writeback kicks
> in and how the writeback rate ramps up in the first second of
> writeback.
> 

*nod*

> > I eventually realized that this
> > had other effects (i.e., one copy doing size updates vs. the other not
> > doing so) and just compared a fixed, full size (1G) preallocation with a
> > fixed 4k preallocation to reproduce the boost provided by the former.
> 
> That only matters for *writeback overhead*, not ingest efficiency.
> Indeed, using a preallocated extent:
> 

Not sure how we got into physical preallocation here. Assume any
reference to "preallocation" in this thread by me refers to post-eof
speculative preallocation. The above preallocation tweaks were
controlled in my tests via the allocsize mount option, not physical
block preallocation.

> $ time sudo xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.6610 sec (1.513 GiB/sec and 396557.5926 ops/sec)
> 
> 
> $ xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0.4946 sec (2.022 GiB/sec and 2070.0795 ops/sec)
> 
> The write-in rates are identical to when specualtive prealloc is
> enabled (always hitting a preexisting extent and not having to
> extend an extent on every iomap lookup). If I time the fsync, then
> things are different because there's additional writeback overhead
> (i.e. unwritten extent conversion), but this does not affect the
> per-write cache ingest CPU overhead.
> 
> > The point here is that while there is such a boost, there are also other
> > workload dependent factors that are out of our control. For example,
> > somebody who today cares about preallocation only for a boost on writes
> > to cache can apparently achieve a greater benefit by truncating the file
> > up front and disabling preallocation entirely.
> 
> Can you post your tests and results, because I'm getting very
> different results from running what I think are the same tests.
> 
> > Moving beyond the truncate thing, I also saw the benefit of
> > preallocation diminish as write buffer size was increased. As the write
> > size increases from 4k to around 64k, the pure performance benefit of
> > preallocation trailed off to zero.
> 
> Yes, because we've optimised the overhead out of larger writes with
> the iomap infrastructure. This used to be a whole lot worse when we
> did a delalloc mapping for every page instead of one per write()
> call. IOWs, a significant part of the 30-35% speed increase you see in my numbers above
> going from 4kB writes to 1MB writes is a result of iomap dropping
> the number of delalloc mapping calls by a factor of 256....
> 
> > Part of that could also be effects of
> > less frequent size updates and whatnot due to the larger writes, but I
> > also don't think that's an uncommon thing in practice.
> 
> Again, size updates are only done on writeback, not ingest....
> 
> > My only point here is that I don't think it's so cut and dry that we
> > absolutely need dynamic speculative preallocation for write to cache
> > performance.
> 
> History, experience with both embedded and high end NAS machines
> (where per-write CPU usage really matters!) and my own experiements
> tell me a different story :/
> 
> > > > BTW for historical context.. was speculative preallocation a thing when
> > > > this paper was written?
> > > 
> > > Yes. specualtive prealloc goes way back into the 90s from Irix.  It
> > > was first made configurable in XFS via the biosize mount option
> > > added with v3 superblocks in 1997, but the initial linux port only
> > > allowed up to 64k.
> > > 
> > 
> > Hmm, Ok.. so it was originally speculative preallocation without the
> > "dynamic sizing" logic that we have today. Thanks for the background.
> > 
> > > In 2005, the linux mount option allowed biosize to be extended to
> > > 1GB, which made sense because >4GB allocation groups (mkfs enabled
> > > them late 2003) were now starting to be widely used and so users
> > > were reporting new large AG fragmentation issues that had never been
> > > seen before. i.e.  it was now practical to have contiguous multi-GB
> > > extents in files and the delalloc code was struggling to create
> > > them, so having EOF-prealloc be able to make use of that capability
> > > was needed....
> > > 
> > > And then auto-tuning made sense because more and more people were
> > > having to use the mount option in more general workloads to avoid
> > > fragmentation.
> > > 
> > 
> > "auto-tuning" means "dynamic sizing" here, yes?
> 
> Yes.
> 
> > FWIW, much of this
> > discussion also makes me wonder how appropriate the current size limit
> > (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> > availability), as opposed to something larger (on the order of hundreds
> > of MBs for example, perhaps 256-512MB).
> 
> RAM size is irrelevant. What matters is file size and the impact
> of allocation patterns on writeback IO patterns. i.e. the size limit
> is about optimising writeback, not preventing fragmentation or
> making more efficient use of memory, etc.
> 

I'm just suggesting that the more RAM that is available, the more we're
able to write into cache before writeback starts and thus the larger
physical extents we're able to allocate independent of speculative
preallocation (perf issues notwithstanding).

> i.e. when we have lots of small files, we want writeback to pack
> them so we get multiple-file sequentialisation of the write stream -
> this makes things like untarring a kernel tarball (which is a large
> number of small files) a sequential write workload rather than a
> seek-per-individual-file-write workload. That make a massive
> difference to performance on spinning disks, and that's what the 64k
> threshold (and post-EOF block removal on close for larger files)
> tries to preserve.
> 

Sure, but what's the downside to increasing that threshold to even
something on the order of MBs? Wouldn't that at least help us leave
larger free extents around in those workloads/patterns that do fragment
free space?

> Realistically, we should probably change that 64k threshold to match
> sunit if it is set, so that we really do end up trying to pack
> any write once file smaller than sunit as the allocator won't try
> to align them, anyway....
> 

That makes sense.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com