[RFC PATCH 0/3]: Extreme fragmentation ahoy!

* [RFC PATCH 0/3]: Extreme fragmentation ahoy!
@ 2019-02-07  5:08 Dave Chinner
  2019-02-07  5:08 ` [PATCH 1/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Dave Chinner @ 2019-02-07  5:08 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

I've just finished analysing an IO trace from a application
generating an extreme filesystem fragmentation problem that started
with extent size hints and ended with spurious ENOSPC reports due to
massively fragmented files and free space. While the ENOSPC issue
looks to have previously been solved, I still wanted to understand
how the application had so comprehensively defeated extent size
hints as a method of avoiding file fragmentation.

The key behaviour that I discovered was that specific "append write
only" files that had extent size hints to prevent fragmentation
weren't actually write only.  The application didn't do a lot of
writes to the file, but it kept the file open and appended to the
file (from the traces I have) in chunks of between ~3000 bytes and
~160000 bytes. This didn't explain the problem. I did notice that
the files were opened O_SYNC, however.

I then found was another process that, once every second, opened the
log file O_RDONLY, read 28 bytes from offset zero, then closed the
file. Every second. IOWs, between every appending write that would
allocate an extent size hint worth of space beyond EOF and then
write a small chunk of it, there were numerous open/read/close
cycles being done on the same file.

And what do we do on close()? We call xfs_release() and that can
truncate away blocks beyond EOF. For some reason the close wasn't
triggering the IDIRTY_RELEASE heuristic that preventd close from
removing EOF blocks prematurely. Then I realised that O_SYNC writes
don't leave delayed allocation blocks behind - they are always
converted in the context of the write. That's why it wasn't
triggering, and that meant that the open/read/close cycle was
removing the extent size hint allocation beyond EOF prematurely.
beyond EOF prematurely.

Then it occurred to me that extent size hints don't use delalloc
either, so they behave the same was as O_SYNC writes in this
situation.

Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
modify the file without having write permissions.

I suspect there's more cases like this when combined with repeated
open/<do_something>/close operations on a file that is being
written, but the patches address just these ones I just talked
about. The test script to reproduce them is below. Fragmentation
reduction results are in the commit descriptions. It's running
through fstests for a couple of hours now, no issues have been
noticed yet.

FWIW, I suspect we need to have a good hard think about whether we
should be trimming EOF blocks on close by default, or whether we
should only be doing it in very limited situations....

Comments, thoughts, flames welcome.

-Dave.

#!/bin/bash
#
# Test 1
#
# Write multiple files in parallel using synchronous buffered writes. Aim is to
# interleave allocations to fragment the files. Synchronous writes defeat the
# open/write/close heuristics in xfs_release() that prevent EOF block removal,
# so this should fragment badly.

workdir=/mnt/scratch
nfiles=8
wsize=4096
wcnt=1000

echo
echo "Test 1: sync write fragmentation counts"
echo
write_sync_file()
{
	idx=$1

	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -s -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$nfiles; n++)); do
	write_sync_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;

# Test 2
#
# Same as test 1, but instead of sync writes, use extent size hints to defeat
# the open/write/close heuristic

extent_size=16m

echo
echo "Test 2: Extent size hint fragmentation counts"
echo

write_extsz_file()
{
	idx=$1

	xfs_io -f -c "extsize $extent_size" $workdir/file.$idx
	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$nfiles; n++)); do
	write_extsz_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;

# Test 3
#
# Same as test 2, but instead of extent size hints, use open/read/close loops
# on the files to remove EOF blocks.

echo
echo "Test 3: Open/read/close loop fragmentation counts"
echo

write_file()
{
	idx=$1

	xfs_io -f -s -c "pwrite -b 64k 0 50m" $workdir/file.$idx
}

read_file()
{
	idx=$1

	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -r -c "pread 0 28" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$((nfiles * 4)); n++)); do
	write_file $n > /dev/null 2>&1 &
	read_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;

^ permalink raw reply	[flat|nested] 24+ messages in thread