All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3]: Extreme fragmentation ahoy!
@ 2019-02-07  5:08 Dave Chinner
  2019-02-07  5:08 ` [PATCH 1/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Dave Chinner @ 2019-02-07  5:08 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

I've just finished analysing an IO trace from a application
generating an extreme filesystem fragmentation problem that started
with extent size hints and ended with spurious ENOSPC reports due to
massively fragmented files and free space. While the ENOSPC issue
looks to have previously been solved, I still wanted to understand
how the application had so comprehensively defeated extent size
hints as a method of avoiding file fragmentation.

The key behaviour that I discovered was that specific "append write
only" files that had extent size hints to prevent fragmentation
weren't actually write only.  The application didn't do a lot of
writes to the file, but it kept the file open and appended to the
file (from the traces I have) in chunks of between ~3000 bytes and
~160000 bytes. This didn't explain the problem. I did notice that
the files were opened O_SYNC, however.

I then found was another process that, once every second, opened the
log file O_RDONLY, read 28 bytes from offset zero, then closed the
file. Every second. IOWs, between every appending write that would
allocate an extent size hint worth of space beyond EOF and then
write a small chunk of it, there were numerous open/read/close
cycles being done on the same file.

And what do we do on close()? We call xfs_release() and that can
truncate away blocks beyond EOF. For some reason the close wasn't
triggering the IDIRTY_RELEASE heuristic that preventd close from
removing EOF blocks prematurely. Then I realised that O_SYNC writes
don't leave delayed allocation blocks behind - they are always
converted in the context of the write. That's why it wasn't
triggering, and that meant that the open/read/close cycle was
removing the extent size hint allocation beyond EOF prematurely.
beyond EOF prematurely.

Then it occurred to me that extent size hints don't use delalloc
either, so they behave the same was as O_SYNC writes in this
situation.

Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
modify the file without having write permissions.

I suspect there's more cases like this when combined with repeated
open/<do_something>/close operations on a file that is being
written, but the patches address just these ones I just talked
about. The test script to reproduce them is below. Fragmentation
reduction results are in the commit descriptions. It's running
through fstests for a couple of hours now, no issues have been
noticed yet.

FWIW, I suspect we need to have a good hard think about whether we
should be trimming EOF blocks on close by default, or whether we
should only be doing it in very limited situations....

Comments, thoughts, flames welcome.

-Dave.


#!/bin/bash
#
# Test 1
#
# Write multiple files in parallel using synchronous buffered writes. Aim is to
# interleave allocations to fragment the files. Synchronous writes defeat the
# open/write/close heuristics in xfs_release() that prevent EOF block removal,
# so this should fragment badly.

workdir=/mnt/scratch
nfiles=8
wsize=4096
wcnt=1000

echo
echo "Test 1: sync write fragmentation counts"
echo
write_sync_file()
{
	idx=$1

	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -s -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$nfiles; n++)); do
	write_sync_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;


# Test 2
#
# Same as test 1, but instead of sync writes, use extent size hints to defeat
# the open/write/close heuristic

extent_size=16m

echo
echo "Test 2: Extent size hint fragmentation counts"
echo

write_extsz_file()
{
	idx=$1

	xfs_io -f -c "extsize $extent_size" $workdir/file.$idx
	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$nfiles; n++)); do
	write_extsz_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;



# Test 3
#
# Same as test 2, but instead of extent size hints, use open/read/close loops
# on the files to remove EOF blocks.

echo
echo "Test 3: Open/read/close loop fragmentation counts"
echo

write_file()
{
	idx=$1

	xfs_io -f -s -c "pwrite -b 64k 0 50m" $workdir/file.$idx
}

read_file()
{
	idx=$1

	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -r -c "pread 0 28" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$((nfiles * 4)); n++)); do
	write_file $n > /dev/null 2>&1 &
	read_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] xfs: Don't free EOF blocks on sync write close
  2019-02-07  5:08 [RFC PATCH 0/3]: Extreme fragmentation ahoy! Dave Chinner
@ 2019-02-07  5:08 ` Dave Chinner
  2019-02-07  5:08 ` [PATCH 2/3] xfs: Don't free EOF blocks on close when extent size hints are set Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2019-02-07  5:08 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we have a workload that does open/sync write/close in parallel
with other allocation, the file becomes rapidly fragmented. This is
due to close() calling xfs_release() and removing the speculative
preallocation beyond EOF.

The existing open/write/close hueristic in xfs_release() does not
catch this as sync writes do not leave delayed allocation blocks
allocated on the inode for later writeback that can be detected in
xfs_release() and hence XFS_IDIRTY_RELEASE never gets set.

In xfs_file_release(), we can tell whether the release context was a
synchronous write context, and so we need to communicate this to
xfs_release() so it can do the right thing here and skip EOF
block truncation. This defers the EOF block cleanup for synchronous
write contexts to the background EOF block cleaner which will clean
up within a few minutes.

Before:

Test 1: sync write fragmentation counts

/mnt/scratch/file.0: 919
/mnt/scratch/file.1: 916
/mnt/scratch/file.2: 919
/mnt/scratch/file.3: 920
/mnt/scratch/file.4: 920
/mnt/scratch/file.5: 921
/mnt/scratch/file.6: 916
/mnt/scratch/file.7: 918

After:

Test 1: sync write fragmentation counts

/mnt/scratch/file.0: 24
/mnt/scratch/file.1: 24
/mnt/scratch/file.2: 11
/mnt/scratch/file.3: 24
/mnt/scratch/file.4: 3
/mnt/scratch/file.5: 24
/mnt/scratch/file.6: 24
/mnt/scratch/file.7: 23

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 15 +++++++++++++--
 fs/xfs/xfs_inode.c |  9 +++++----
 fs/xfs/xfs_inode.h |  2 +-
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e47425071e65..02f76b8e6c03 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1019,12 +1019,23 @@ xfs_dir_open(
 	return error;
 }
 
+/*
+ * When we release the file, we don't want it to trim EOF blocks for synchronous
+ * write contexts as this leads to severe fragmentation when applications do
+ * repeated open/appending sync write/close to a file amongst other file IO.
+ */
 STATIC int
 xfs_file_release(
 	struct inode	*inode,
-	struct file	*filp)
+	struct file	*file)
 {
-	return xfs_release(XFS_I(inode));
+	bool		free_eof_blocks = true;
+
+	if ((file->f_mode & FMODE_WRITE) &&
+	    (file->f_flags & O_DSYNC))
+		free_eof_blocks = false;
+
+	return xfs_release(XFS_I(inode), free_eof_blocks);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ae667ba74a1c..a74dc240697f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1603,10 +1603,11 @@ xfs_itruncate_extents_flags(
 
 int
 xfs_release(
-	xfs_inode_t	*ip)
+	struct xfs_inode	*ip,
+	bool			can_free_eofblocks)
 {
-	xfs_mount_t	*mp = ip->i_mount;
-	int		error;
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
 
 	if (!S_ISREG(VFS_I(ip)->i_mode) || (VFS_I(ip)->i_mode == 0))
 		return 0;
@@ -1642,7 +1643,7 @@ xfs_release(
 	if (VFS_I(ip)->i_nlink == 0)
 		return 0;
 
-	if (xfs_can_free_eofblocks(ip, false)) {
+	if (can_free_eofblocks && xfs_can_free_eofblocks(ip, false)) {
 
 		/*
 		 * Check if the inode is being opened, written and closed
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index be2014520155..7e59b0e086d7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -397,7 +397,7 @@ enum layout_break_reason {
 	(((pip)->i_mount->m_flags & XFS_MOUNT_GRPID) || \
 	 (VFS_I(pip)->i_mode & S_ISGID))
 
-int		xfs_release(struct xfs_inode *ip);
+int		xfs_release(struct xfs_inode *ip, bool can_free_eofblocks);
 void		xfs_inactive(struct xfs_inode *ip);
 int		xfs_lookup(struct xfs_inode *dp, struct xfs_name *name,
 			   struct xfs_inode **ipp, struct xfs_name *ci_name);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] xfs: Don't free EOF blocks on close when extent size hints are set
  2019-02-07  5:08 [RFC PATCH 0/3]: Extreme fragmentation ahoy! Dave Chinner
  2019-02-07  5:08 ` [PATCH 1/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
@ 2019-02-07  5:08 ` Dave Chinner
  2019-02-07 15:51   ` Brian Foster
  2019-02-07  5:08 ` [PATCH 3/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-07  5:08 UTC (permalink / raw)
  To: linux-xfs

When we have a workload that does open/write/close on files with
extent size hints set in parallel with other allocation, the file
becomes rapidly fragmented. This is due to close() calling
xfs_release() and removing the preallocated extent beyond EOF.  This
occurs for both buffered and direct writes that append to files with
extent size hints.

The existing open/write/close hueristic in xfs_release() does not
catch this as writes to files using extent size hints do not use
delayed allocation and hence do not leave delayed allocation blocks
allocated on the inode that can be detected in xfs_release(). Hence
XFS_IDIRTY_RELEASE never gets set.

In xfs_file_release(), we can tell whether the inode has extent size
hints set and skip EOF block truncation. We add this check to
xfs_can_free_eofblocks() so that we treat the post-EOF preallocated
extent like intentional preallocation and so are persistent unless
directly removed by userspace.

Before:

Test 2: Extent size hint fragmentation counts

/mnt/scratch/file.0: 1002
/mnt/scratch/file.1: 1002
/mnt/scratch/file.2: 1002
/mnt/scratch/file.3: 1002
/mnt/scratch/file.4: 1002
/mnt/scratch/file.5: 1002
/mnt/scratch/file.6: 1002
/mnt/scratch/file.7: 1002

After:

Test 2: Extent size hint fragmentation counts

/mnt/scratch/file.0: 4
/mnt/scratch/file.1: 4
/mnt/scratch/file.2: 4
/mnt/scratch/file.3: 4
/mnt/scratch/file.4: 4
/mnt/scratch/file.5: 4
/mnt/scratch/file.6: 4
/mnt/scratch/file.7: 4

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1ee8c5539fa4..98e5e305b789 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -761,12 +761,15 @@ xfs_can_free_eofblocks(struct xfs_inode *ip, bool force)
 		return false;
 
 	/*
-	 * Do not free real preallocated or append-only files unless the file
-	 * has delalloc blocks and we are forced to remove them.
+	 * Do not free extent size hints, real preallocated or append-only files
+	 * unless the file has delalloc blocks and we are forced to remove
+	 * them.
 	 */
-	if (ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))
+	if (xfs_get_extsz_hint(ip) ||
+	    (ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))) {
 		if (!force || ip->i_delayed_blks == 0)
 			return false;
+	}
 
 	return true;
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] xfs: Don't free EOF blocks on sync write close
  2019-02-07  5:08 [RFC PATCH 0/3]: Extreme fragmentation ahoy! Dave Chinner
  2019-02-07  5:08 ` [PATCH 1/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
  2019-02-07  5:08 ` [PATCH 2/3] xfs: Don't free EOF blocks on close when extent size hints are set Dave Chinner
@ 2019-02-07  5:08 ` Dave Chinner
  2019-02-07  5:19   ` Dave Chinner
  2019-02-07  5:21 ` [RFC PATCH 0/3]: Extreme fragmentation ahoy! Darrick J. Wong
  2019-02-18  2:26 ` [PATCH 4/3] xfs: EOF blocks are not busy extents Dave Chinner
  4 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-07  5:08 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we have a workload that does open/read/close in parallel with
other synchronous buffered writes to long term open files, the file
becomes rapidly fragmented. This is due to close() after read
calling xfs_release() and removing the speculative preallocation
beyond EOF.

The existing open/write/close hueristic in xfs_release() does not
catch this as sync writes do not leave delayed allocation blocks
allocated on the inode for later writeback that can be detected in
xfs_release() and hence XFS_IDIRTY_RELEASE never gets set.

Further, the close context here is for a file opened O_RDONLY, and
so /modifying/ the file metadata on close doesn't pass muster.
Fortunately, we can tell in xfs_file_release() whether the release
context was a read-only context, and so we need to communicate this
to xfs_release() so it can do the right thing here and skip EOF
block truncation, hence ensuring that only contexts with write
permissions will remove post-EOF blocks from the file.

Before:

Test 3: Open/read/close loop fragmentation counts

/mnt/scratch/file.0: 150
/mnt/scratch/file.1: 342
/mnt/scratch/file.2: 113
/mnt/scratch/file.3: 165
/mnt/scratch/file.4: 86
/mnt/scratch/file.5: 363
/mnt/scratch/file.6: 129
/mnt/scratch/file.7: 233

After:

Test 3: Open/read/close loop fragmentation counts

/mnt/scratch/file.0: 12
/mnt/scratch/file.1: 12
/mnt/scratch/file.2: 12
/mnt/scratch/file.3: 12
/mnt/scratch/file.4: 12
/mnt/scratch/file.5: 12
/mnt/scratch/file.6: 12
/mnt/scratch/file.7: 12

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 02f76b8e6c03..e2d8a0b7f891 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1023,6 +1023,10 @@ xfs_dir_open(
  * When we release the file, we don't want it to trim EOF blocks for synchronous
  * write contexts as this leads to severe fragmentation when applications do
  * repeated open/appending sync write/close to a file amongst other file IO.
+ *
+ * We also don't want to trim the EOF blocks if it is a read only context. This
+ * prevents open/read/close workloads from removing EOF blocks that other
+ * writers are depending on to prevent fragmentation.
  */
 STATIC int
 xfs_file_release(
@@ -1031,8 +1035,9 @@ xfs_file_release(
 {
 	bool		free_eof_blocks = true;
 
-	if ((file->f_mode & FMODE_WRITE) &&
-	    (file->f_flags & O_DSYNC))
+	if ((file->f_mode & FMODE_WRITE|FMODE_READ) == FMODE_READ)
+		free_eof_blocks = false;
+	else if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_DSYNC))
 		free_eof_blocks = false;
 
 	return xfs_release(XFS_I(inode), free_eof_blocks);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] xfs: Don't free EOF blocks on sync write close
  2019-02-07  5:08 ` [PATCH 3/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
@ 2019-02-07  5:19   ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2019-02-07  5:19 UTC (permalink / raw)
  To: linux-xfs


Ugh forgot to rename patch. should be:

Subject: [PATCH 0/3] xfs: Don't free EOF blocks on O_RDONLY close

On Thu, Feb 07, 2019 at 04:08:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we have a workload that does open/read/close in parallel with
> other synchronous buffered writes to long term open files, the file
> becomes rapidly fragmented. This is due to close() after read
> calling xfs_release() and removing the speculative preallocation
> beyond EOF.
> 
> The existing open/write/close hueristic in xfs_release() does not
> catch this as sync writes do not leave delayed allocation blocks
> allocated on the inode for later writeback that can be detected in
> xfs_release() and hence XFS_IDIRTY_RELEASE never gets set.
> 
> Further, the close context here is for a file opened O_RDONLY, and
> so /modifying/ the file metadata on close doesn't pass muster.
> Fortunately, we can tell in xfs_file_release() whether the release
> context was a read-only context, and so we need to communicate this
> to xfs_release() so it can do the right thing here and skip EOF
> block truncation, hence ensuring that only contexts with write
> permissions will remove post-EOF blocks from the file.
> 
> Before:
> 
> Test 3: Open/read/close loop fragmentation counts
> 
> /mnt/scratch/file.0: 150
> /mnt/scratch/file.1: 342
> /mnt/scratch/file.2: 113
> /mnt/scratch/file.3: 165
> /mnt/scratch/file.4: 86
> /mnt/scratch/file.5: 363
> /mnt/scratch/file.6: 129
> /mnt/scratch/file.7: 233
> 
> After:
> 
> Test 3: Open/read/close loop fragmentation counts
> 
> /mnt/scratch/file.0: 12
> /mnt/scratch/file.1: 12
> /mnt/scratch/file.2: 12
> /mnt/scratch/file.3: 12
> /mnt/scratch/file.4: 12
> /mnt/scratch/file.5: 12
> /mnt/scratch/file.6: 12
> /mnt/scratch/file.7: 12
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 02f76b8e6c03..e2d8a0b7f891 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1023,6 +1023,10 @@ xfs_dir_open(
>   * When we release the file, we don't want it to trim EOF blocks for synchronous
>   * write contexts as this leads to severe fragmentation when applications do
>   * repeated open/appending sync write/close to a file amongst other file IO.
> + *
> + * We also don't want to trim the EOF blocks if it is a read only context. This
> + * prevents open/read/close workloads from removing EOF blocks that other
> + * writers are depending on to prevent fragmentation.
>   */
>  STATIC int
>  xfs_file_release(
> @@ -1031,8 +1035,9 @@ xfs_file_release(
>  {
>  	bool		free_eof_blocks = true;
>  
> -	if ((file->f_mode & FMODE_WRITE) &&
> -	    (file->f_flags & O_DSYNC))
> +	if ((file->f_mode & FMODE_WRITE|FMODE_READ) == FMODE_READ)
> +		free_eof_blocks = false;
> +	else if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_DSYNC))
>  		free_eof_blocks = false;
>  
>  	return xfs_release(XFS_I(inode), free_eof_blocks);
> -- 
> 2.20.1
> 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-07  5:08 [RFC PATCH 0/3]: Extreme fragmentation ahoy! Dave Chinner
                   ` (2 preceding siblings ...)
  2019-02-07  5:08 ` [PATCH 3/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
@ 2019-02-07  5:21 ` Darrick J. Wong
  2019-02-07  5:39   ` Dave Chinner
  2019-02-18  2:26 ` [PATCH 4/3] xfs: EOF blocks are not busy extents Dave Chinner
  4 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2019-02-07  5:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> I've just finished analysing an IO trace from a application
> generating an extreme filesystem fragmentation problem that started
> with extent size hints and ended with spurious ENOSPC reports due to
> massively fragmented files and free space. While the ENOSPC issue
> looks to have previously been solved, I still wanted to understand
> how the application had so comprehensively defeated extent size
> hints as a method of avoiding file fragmentation.
> 
> The key behaviour that I discovered was that specific "append write
> only" files that had extent size hints to prevent fragmentation
> weren't actually write only.  The application didn't do a lot of
> writes to the file, but it kept the file open and appended to the
> file (from the traces I have) in chunks of between ~3000 bytes and
> ~160000 bytes. This didn't explain the problem. I did notice that
> the files were opened O_SYNC, however.
> 
> I then found was another process that, once every second, opened the
> log file O_RDONLY, read 28 bytes from offset zero, then closed the
> file. Every second. IOWs, between every appending write that would
> allocate an extent size hint worth of space beyond EOF and then
> write a small chunk of it, there were numerous open/read/close
> cycles being done on the same file.
> 
> And what do we do on close()? We call xfs_release() and that can
> truncate away blocks beyond EOF. For some reason the close wasn't
> triggering the IDIRTY_RELEASE heuristic that preventd close from
> removing EOF blocks prematurely. Then I realised that O_SYNC writes
> don't leave delayed allocation blocks behind - they are always
> converted in the context of the write. That's why it wasn't
> triggering, and that meant that the open/read/close cycle was
> removing the extent size hint allocation beyond EOF prematurely.
> beyond EOF prematurely.

<urk>

> Then it occurred to me that extent size hints don't use delalloc
> either, so they behave the same was as O_SYNC writes in this
> situation.
> 
> Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
> modify the file without having write permissions.

Yikes!

> I suspect there's more cases like this when combined with repeated
> open/<do_something>/close operations on a file that is being
> written, but the patches address just these ones I just talked
> about. The test script to reproduce them is below. Fragmentation
> reduction results are in the commit descriptions. It's running
> through fstests for a couple of hours now, no issues have been
> noticed yet.
> 
> FWIW, I suspect we need to have a good hard think about whether we
> should be trimming EOF blocks on close by default, or whether we
> should only be doing it in very limited situations....
> 
> Comments, thoughts, flames welcome.
> 
> -Dave.
> 
> 
> #!/bin/bash
> #
> # Test 1

Can you please turn these into fstests to cause the maintainer maximal
immediate pain^W^W^Wmake everyone pay attention^W^W^W^Westablish a basis
for regression testing and finding whatever other problems we can find
from digging deeper? :)

--D

> #
> # Write multiple files in parallel using synchronous buffered writes. Aim is to
> # interleave allocations to fragment the files. Synchronous writes defeat the
> # open/write/close heuristics in xfs_release() that prevent EOF block removal,
> # so this should fragment badly.
> 
> workdir=/mnt/scratch
> nfiles=8
> wsize=4096
> wcnt=1000
> 
> echo
> echo "Test 1: sync write fragmentation counts"
> echo
> write_sync_file()
> {
> 	idx=$1
> 
> 	for ((cnt=0; cnt<$wcnt; cnt++)); do
> 		xfs_io -f -s -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
> 	done
> }
> 
> rm -f $workdir/file*
> for ((n=0; n<$nfiles; n++)); do
> 	write_sync_file $n > /dev/null 2>&1 &
> done
> wait
> 
> sync
> 
> for ((n=0; n<$nfiles; n++)); do
> 	echo -n "$workdir/file.$n: "
> 	xfs_bmap -vp $workdir/file.$n | wc -l
> done;
> 
> 
> # Test 2
> #
> # Same as test 1, but instead of sync writes, use extent size hints to defeat
> # the open/write/close heuristic
> 
> extent_size=16m
> 
> echo
> echo "Test 2: Extent size hint fragmentation counts"
> echo
> 
> write_extsz_file()
> {
> 	idx=$1
> 
> 	xfs_io -f -c "extsize $extent_size" $workdir/file.$idx
> 	for ((cnt=0; cnt<$wcnt; cnt++)); do
> 		xfs_io -f -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
> 	done
> }
> 
> rm -f $workdir/file*
> for ((n=0; n<$nfiles; n++)); do
> 	write_extsz_file $n > /dev/null 2>&1 &
> done
> wait
> 
> sync
> 
> for ((n=0; n<$nfiles; n++)); do
> 	echo -n "$workdir/file.$n: "
> 	xfs_bmap -vp $workdir/file.$n | wc -l
> done;
> 
> 
> 
> # Test 3
> #
> # Same as test 2, but instead of extent size hints, use open/read/close loops
> # on the files to remove EOF blocks.
> 
> echo
> echo "Test 3: Open/read/close loop fragmentation counts"
> echo
> 
> write_file()
> {
> 	idx=$1
> 
> 	xfs_io -f -s -c "pwrite -b 64k 0 50m" $workdir/file.$idx
> }
> 
> read_file()
> {
> 	idx=$1
> 
> 	for ((cnt=0; cnt<$wcnt; cnt++)); do
> 		xfs_io -f -r -c "pread 0 28" $workdir/file.$idx
> 	done
> }
> 
> rm -f $workdir/file*
> for ((n=0; n<$((nfiles * 4)); n++)); do
> 	write_file $n > /dev/null 2>&1 &
> 	read_file $n > /dev/null 2>&1 &
> done
> wait
> 
> sync
> 
> for ((n=0; n<$nfiles; n++)); do
> 	echo -n "$workdir/file.$n: "
> 	xfs_bmap -vp $workdir/file.$n | wc -l
> done;
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-07  5:21 ` [RFC PATCH 0/3]: Extreme fragmentation ahoy! Darrick J. Wong
@ 2019-02-07  5:39   ` Dave Chinner
  2019-02-07 15:52     ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-07  5:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > I've just finished analysing an IO trace from a application
> > generating an extreme filesystem fragmentation problem that started
> > with extent size hints and ended with spurious ENOSPC reports due to
> > massively fragmented files and free space. While the ENOSPC issue
> > looks to have previously been solved, I still wanted to understand
> > how the application had so comprehensively defeated extent size
> > hints as a method of avoiding file fragmentation.
> > 
> > The key behaviour that I discovered was that specific "append write
> > only" files that had extent size hints to prevent fragmentation
> > weren't actually write only.  The application didn't do a lot of
> > writes to the file, but it kept the file open and appended to the
> > file (from the traces I have) in chunks of between ~3000 bytes and
> > ~160000 bytes. This didn't explain the problem. I did notice that
> > the files were opened O_SYNC, however.
> > 
> > I then found was another process that, once every second, opened the
> > log file O_RDONLY, read 28 bytes from offset zero, then closed the
> > file. Every second. IOWs, between every appending write that would
> > allocate an extent size hint worth of space beyond EOF and then
> > write a small chunk of it, there were numerous open/read/close
> > cycles being done on the same file.
> > 
> > And what do we do on close()? We call xfs_release() and that can
> > truncate away blocks beyond EOF. For some reason the close wasn't
> > triggering the IDIRTY_RELEASE heuristic that preventd close from
> > removing EOF blocks prematurely. Then I realised that O_SYNC writes
> > don't leave delayed allocation blocks behind - they are always
> > converted in the context of the write. That's why it wasn't
> > triggering, and that meant that the open/read/close cycle was
> > removing the extent size hint allocation beyond EOF prematurely.
> > beyond EOF prematurely.
> 
> <urk>
> 
> > Then it occurred to me that extent size hints don't use delalloc
> > either, so they behave the same was as O_SYNC writes in this
> > situation.
> > 
> > Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
> > modify the file without having write permissions.
> 
> Yikes!
> 
> > I suspect there's more cases like this when combined with repeated
> > open/<do_something>/close operations on a file that is being
> > written, but the patches address just these ones I just talked
> > about. The test script to reproduce them is below. Fragmentation
> > reduction results are in the commit descriptions. It's running
> > through fstests for a couple of hours now, no issues have been
> > noticed yet.
> > 
> > FWIW, I suspect we need to have a good hard think about whether we
> > should be trimming EOF blocks on close by default, or whether we
> > should only be doing it in very limited situations....
> > 
> > Comments, thoughts, flames welcome.
> > 
> > -Dave.
> > 
> > 
> > #!/bin/bash
> > #
> > # Test 1
> 
> Can you please turn these into fstests to cause the maintainer maximal
> immediate pain^W^W^Wmake everyone pay attention^W^W^W^Westablish a basis
> for regression testing and finding whatever other problems we can find
> from digging deeper? :)

I will, but not today - I only understood the cause well enough to
write a prototype reproducer about 4 hours ago. The rest of the time
since then has been fixing the issues and running smoke tests. My
brain is about fried now....

FWIW, I think the scope of the problem is quite widespread -
anything that does open/something/close repeatedly on a file that is
being written to with O_DSYNC or O_DIRECT appending writes will kill
the post-eof extent size hint allocated space. That's why I suspect
we need to think about not trimming by default and trying to
enumerating only the cases that need to trim eof blocks.

e.g. I closed the O_RDONLY case, but O_RDWR/read/close in a loop
will still trigger removal of post EOF extent size hint
preallocation and hence severe fragmentation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] xfs: Don't free EOF blocks on close when extent size hints are set
  2019-02-07  5:08 ` [PATCH 2/3] xfs: Don't free EOF blocks on close when extent size hints are set Dave Chinner
@ 2019-02-07 15:51   ` Brian Foster
  0 siblings, 0 replies; 24+ messages in thread
From: Brian Foster @ 2019-02-07 15:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 07, 2019 at 04:08:12PM +1100, Dave Chinner wrote:
> When we have a workload that does open/write/close on files with
> extent size hints set in parallel with other allocation, the file
> becomes rapidly fragmented. This is due to close() calling
> xfs_release() and removing the preallocated extent beyond EOF.  This
> occurs for both buffered and direct writes that append to files with
> extent size hints.
> 
> The existing open/write/close hueristic in xfs_release() does not
> catch this as writes to files using extent size hints do not use
> delayed allocation and hence do not leave delayed allocation blocks
> allocated on the inode that can be detected in xfs_release(). Hence
> XFS_IDIRTY_RELEASE never gets set.
> 
> In xfs_file_release(), we can tell whether the inode has extent size
> hints set and skip EOF block truncation. We add this check to
> xfs_can_free_eofblocks() so that we treat the post-EOF preallocated
> extent like intentional preallocation and so are persistent unless
> directly removed by userspace.
> 
> Before:
> 
> Test 2: Extent size hint fragmentation counts
> 
> /mnt/scratch/file.0: 1002
> /mnt/scratch/file.1: 1002
> /mnt/scratch/file.2: 1002
> /mnt/scratch/file.3: 1002
> /mnt/scratch/file.4: 1002
> /mnt/scratch/file.5: 1002
> /mnt/scratch/file.6: 1002
> /mnt/scratch/file.7: 1002
> 
> After:
> 
> Test 2: Extent size hint fragmentation counts
> 
> /mnt/scratch/file.0: 4
> /mnt/scratch/file.1: 4
> /mnt/scratch/file.2: 4
> /mnt/scratch/file.3: 4
> /mnt/scratch/file.4: 4
> /mnt/scratch/file.5: 4
> /mnt/scratch/file.6: 4
> /mnt/scratch/file.7: 4
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 1ee8c5539fa4..98e5e305b789 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -761,12 +761,15 @@ xfs_can_free_eofblocks(struct xfs_inode *ip, bool force)
>  		return false;
>  
>  	/*
> -	 * Do not free real preallocated or append-only files unless the file
> -	 * has delalloc blocks and we are forced to remove them.
> +	 * Do not free extent size hints, real preallocated or append-only files
> +	 * unless the file has delalloc blocks and we are forced to remove
> +	 * them.
>  	 */
> -	if (ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))
> +	if (xfs_get_extsz_hint(ip) ||
> +	    (ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))) {
>  		if (!force || ip->i_delayed_blks == 0)
>  			return false;
> +	}

Note that this will affect the background eofblocks scanner as well such
that we'll no longer ever trim files with an extent size hint. I'm not
saying that's necessarily a problem, but it should at minimum be
discussed in the commit log (which currently only refers to the more
problematic release context).

The consideration to be made is that this could affect the ability to
reclaim post-eof space on -ENOSPC mitigating eofblocks scans in cases
where there are large extent size hints (or many files with smaller
extsz hints, etc.). That might be something worth trying to accommodate
one way or another since it's slightly inconsistent behavior. Consider
that an unmount -> reclaim induced force eofb trim -> mount would
suddenly free up a bunch of space that the eofblocks scan didn't, for
example. This is already the case for preallocated files of course, so
this may very well be reasonable enough for extsz hints as well. What
might also be interesting is considering whether it's worth further
differentiating an -ENOSPC scan from a typical background scan to allow
the former to behave a bit more like reclaim in this regard.

Brian

>  
>  	return true;
>  }
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-07  5:39   ` Dave Chinner
@ 2019-02-07 15:52     ` Brian Foster
  2019-02-08  2:47       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2019-02-07 15:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > I've just finished analysing an IO trace from a application
> > > generating an extreme filesystem fragmentation problem that started
> > > with extent size hints and ended with spurious ENOSPC reports due to
> > > massively fragmented files and free space. While the ENOSPC issue
> > > looks to have previously been solved, I still wanted to understand
> > > how the application had so comprehensively defeated extent size
> > > hints as a method of avoiding file fragmentation.
> > > 
> > > The key behaviour that I discovered was that specific "append write
> > > only" files that had extent size hints to prevent fragmentation
> > > weren't actually write only.  The application didn't do a lot of
> > > writes to the file, but it kept the file open and appended to the
> > > file (from the traces I have) in chunks of between ~3000 bytes and
> > > ~160000 bytes. This didn't explain the problem. I did notice that
> > > the files were opened O_SYNC, however.
> > > 
> > > I then found was another process that, once every second, opened the
> > > log file O_RDONLY, read 28 bytes from offset zero, then closed the
> > > file. Every second. IOWs, between every appending write that would
> > > allocate an extent size hint worth of space beyond EOF and then
> > > write a small chunk of it, there were numerous open/read/close
> > > cycles being done on the same file.
> > > 
> > > And what do we do on close()? We call xfs_release() and that can
> > > truncate away blocks beyond EOF. For some reason the close wasn't
> > > triggering the IDIRTY_RELEASE heuristic that preventd close from
> > > removing EOF blocks prematurely. Then I realised that O_SYNC writes
> > > don't leave delayed allocation blocks behind - they are always
> > > converted in the context of the write. That's why it wasn't
> > > triggering, and that meant that the open/read/close cycle was
> > > removing the extent size hint allocation beyond EOF prematurely.
> > > beyond EOF prematurely.
> > 
> > <urk>
> > 
> > > Then it occurred to me that extent size hints don't use delalloc
> > > either, so they behave the same was as O_SYNC writes in this
> > > situation.
> > > 
> > > Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
> > > modify the file without having write permissions.
> > 
> > Yikes!
> > 
> > > I suspect there's more cases like this when combined with repeated
> > > open/<do_something>/close operations on a file that is being
> > > written, but the patches address just these ones I just talked
> > > about. The test script to reproduce them is below. Fragmentation
> > > reduction results are in the commit descriptions. It's running
> > > through fstests for a couple of hours now, no issues have been
> > > noticed yet.
> > > 
> > > FWIW, I suspect we need to have a good hard think about whether we
> > > should be trimming EOF blocks on close by default, or whether we
> > > should only be doing it in very limited situations....
> > > 
> > > Comments, thoughts, flames welcome.
> > > 
> > > -Dave.
> > > 
> > > 
> > > #!/bin/bash
> > > #
> > > # Test 1
> > 
> > Can you please turn these into fstests to cause the maintainer maximal
> > immediate pain^W^W^Wmake everyone pay attention^W^W^W^Westablish a basis
> > for regression testing and finding whatever other problems we can find
> > from digging deeper? :)
> 
> I will, but not today - I only understood the cause well enough to
> write a prototype reproducer about 4 hours ago. The rest of the time
> since then has been fixing the issues and running smoke tests. My
> brain is about fried now....
> 
> FWIW, I think the scope of the problem is quite widespread -
> anything that does open/something/close repeatedly on a file that is
> being written to with O_DSYNC or O_DIRECT appending writes will kill
> the post-eof extent size hint allocated space. That's why I suspect
> we need to think about not trimming by default and trying to
> enumerating only the cases that need to trim eof blocks.
> 

To further this point.. I think the eofblocks scanning stuff came long
after the speculative preallocation code and associated release time
post-eof truncate. I think the background scanning was initially an
enhancement to deal with things like the dirty release optimization
leaving these blocks around longer and being able to free up this
accumulated space when we're at -ENOSPC conditions. Now that we have the
scanning mechanism in place (and a 5 minute default background scan,
which really isn't all that long), it might be reasonable to just drop
the release time truncate completely and only trim post-eof blocks via
the bg scan or reclaim paths.

Brian

> e.g. I closed the O_RDONLY case, but O_RDWR/read/close in a loop
> will still trigger removal of post EOF extent size hint
> preallocation and hence severe fragmentation.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-07 15:52     ` Brian Foster
@ 2019-02-08  2:47       ` Dave Chinner
  2019-02-08 12:34         ` Brian Foster
  2019-02-08 16:29         ` Darrick J. Wong
  0 siblings, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2019-02-08  2:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > I've just finished analysing an IO trace from a application
> > > > generating an extreme filesystem fragmentation problem that started
> > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > massively fragmented files and free space. While the ENOSPC issue
> > > > looks to have previously been solved, I still wanted to understand
> > > > how the application had so comprehensively defeated extent size
> > > > hints as a method of avoiding file fragmentation.
....
> > FWIW, I think the scope of the problem is quite widespread -
> > anything that does open/something/close repeatedly on a file that is
> > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > the post-eof extent size hint allocated space. That's why I suspect
> > we need to think about not trimming by default and trying to
> > enumerating only the cases that need to trim eof blocks.
> > 
> 
> To further this point.. I think the eofblocks scanning stuff came long
> after the speculative preallocation code and associated release time
> post-eof truncate.

Yes, I cribed a bit of the history of the xfs_release() behaviour
on #xfs yesterday afternoon:

 <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
 <dchinner>	historic. People doing operations then complaining du didn't match ls
 <dchinner>	stuff like that
 <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
 <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
 <dchinner>	(i.e. that's how we used to make nfs not suck)
 <dchinner>	that's when we started doing work in ->release
 <dchinner>	it was close enough to "last close" for most workloads it made no difference.
 <dchinner>	Except for concurrent NFS writes into the same directory
 <dchinner>	and now there's another pathological application that triggers problems
 <dchinner>	The NFS exception was prior to having thebackground reaper
 <dchinner>	as these things goes the background reaper is relatively recent functionality
 <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al

> I think the background scanning was initially an
> enhancement to deal with things like the dirty release optimization
> leaving these blocks around longer and being able to free up this
> accumulated space when we're at -ENOSPC conditions.

Yes, amongst other things like slow writes keeping the file open
forever.....

> Now that we have the
> scanning mechanism in place (and a 5 minute default background scan,
> which really isn't all that long), it might be reasonable to just drop
> the release time truncate completely and only trim post-eof blocks via
> the bg scan or reclaim paths.

Yeah, that's kinda the question I'm asking here. What's the likely
impact of not trimming EOF blocks at least on close apart from
people complaining about df/ls not matching du?

I don't really care about that anymore because, well, reflink/dedupe
completely break any remaining assumption that du reported space
consumption is related to the file size (if sparse files wasn't
enough of a hint arlready)....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-08  2:47       ` Dave Chinner
@ 2019-02-08 12:34         ` Brian Foster
  2019-02-12  1:13           ` Darrick J. Wong
  2019-02-08 16:29         ` Darrick J. Wong
  1 sibling, 1 reply; 24+ messages in thread
From: Brian Foster @ 2019-02-08 12:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > I've just finished analysing an IO trace from a application
> > > > > generating an extreme filesystem fragmentation problem that started
> > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > looks to have previously been solved, I still wanted to understand
> > > > > how the application had so comprehensively defeated extent size
> > > > > hints as a method of avoiding file fragmentation.
> ....
> > > FWIW, I think the scope of the problem is quite widespread -
> > > anything that does open/something/close repeatedly on a file that is
> > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > the post-eof extent size hint allocated space. That's why I suspect
> > > we need to think about not trimming by default and trying to
> > > enumerating only the cases that need to trim eof blocks.
> > > 
> > 
> > To further this point.. I think the eofblocks scanning stuff came long
> > after the speculative preallocation code and associated release time
> > post-eof truncate.
> 
> Yes, I cribed a bit of the history of the xfs_release() behaviour
> on #xfs yesterday afternoon:
> 
>  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
>  <dchinner>	historic. People doing operations then complaining du didn't match ls
>  <dchinner>	stuff like that
>  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
>  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
>  <dchinner>	(i.e. that's how we used to make nfs not suck)
>  <dchinner>	that's when we started doing work in ->release
>  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
>  <dchinner>	Except for concurrent NFS writes into the same directory
>  <dchinner>	and now there's another pathological application that triggers problems
>  <dchinner>	The NFS exception was prior to having thebackground reaper
>  <dchinner>	as these things goes the background reaper is relatively recent functionality
>  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> 

Thanks.

> > I think the background scanning was initially an
> > enhancement to deal with things like the dirty release optimization
> > leaving these blocks around longer and being able to free up this
> > accumulated space when we're at -ENOSPC conditions.
> 
> Yes, amongst other things like slow writes keeping the file open
> forever.....
> 
> > Now that we have the
> > scanning mechanism in place (and a 5 minute default background scan,
> > which really isn't all that long), it might be reasonable to just drop
> > the release time truncate completely and only trim post-eof blocks via
> > the bg scan or reclaim paths.
> 
> Yeah, that's kinda the question I'm asking here. What's the likely
> impact of not trimming EOF blocks at least on close apart from
> people complaining about df/ls not matching du?
> 

Ok. ISTM it's just a continuation of the same "might confuse some users"
scenario that pops up occasionally. It also seems that kind of thing has
died down as either most people don't really know or care about the
transient state or are just more familiar with it at this point. IME,
complex applications that depend on block ownership stats (userspace
filesystems for example) already have to account for speculative
preallocation with XFS, so tweaking the semantics of the optimization
shouldn't really have much of an impact that I can tell so long as the
broader/long-term behavior doesn't change[1].

I suppose there are all kinds of other applications that are technically
affected by dropping the release time trim (simple file copies, archive
extraction, etc.), but it's not clear to me that matters so long as we
have effective bg and -ENOSPC scans. The only thing I can think of so
far is whether we should consider changes to the bg scan heuristics to
accommodate scenarios currently covered by the release time trim. For
example, the release time scan doesn't consider whether the file is
dirty or not while the bg scan always skips "active" files.

Brian

[1] Which isn't the case for a change to not bg trim files with extent
size hints.

> I don't really care about that anymore because, well, reflink/dedupe
> completely break any remaining assumption that du reported space
> consumption is related to the file size (if sparse files wasn't
> enough of a hint arlready)....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-08  2:47       ` Dave Chinner
  2019-02-08 12:34         ` Brian Foster
@ 2019-02-08 16:29         ` Darrick J. Wong
  1 sibling, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2019-02-08 16:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > I've just finished analysing an IO trace from a application
> > > > > generating an extreme filesystem fragmentation problem that started
> > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > looks to have previously been solved, I still wanted to understand
> > > > > how the application had so comprehensively defeated extent size
> > > > > hints as a method of avoiding file fragmentation.
> ....
> > > FWIW, I think the scope of the problem is quite widespread -
> > > anything that does open/something/close repeatedly on a file that is
> > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > the post-eof extent size hint allocated space. That's why I suspect
> > > we need to think about not trimming by default and trying to
> > > enumerating only the cases that need to trim eof blocks.
> > > 
> > 
> > To further this point.. I think the eofblocks scanning stuff came long
> > after the speculative preallocation code and associated release time
> > post-eof truncate.
> 
> Yes, I cribed a bit of the history of the xfs_release() behaviour
> on #xfs yesterday afternoon:
> 
>  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
>  <dchinner>	historic. People doing operations then complaining du didn't match ls
>  <dchinner>	stuff like that
>  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
>  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
>  <dchinner>	(i.e. that's how we used to make nfs not suck)
>  <dchinner>	that's when we started doing work in ->release
>  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
>  <dchinner>	Except for concurrent NFS writes into the same directory
>  <dchinner>	and now there's another pathological application that triggers problems
>  <dchinner>	The NFS exception was prior to having thebackground reaper
>  <dchinner>	as these things goes the background reaper is relatively recent functionality
>  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> 
> > I think the background scanning was initially an
> > enhancement to deal with things like the dirty release optimization
> > leaving these blocks around longer and being able to free up this
> > accumulated space when we're at -ENOSPC conditions.
> 
> Yes, amongst other things like slow writes keeping the file open
> forever.....
> 
> > Now that we have the
> > scanning mechanism in place (and a 5 minute default background scan,
> > which really isn't all that long), it might be reasonable to just drop
> > the release time truncate completely and only trim post-eof blocks via
> > the bg scan or reclaim paths.
> 
> Yeah, that's kinda the question I'm asking here. What's the likely
> impact of not trimming EOF blocks at least on close apart from
> people complaining about df/ls not matching du?
> 
> I don't really care about that anymore because, well, reflink/dedupe
> completely break any remaining assumption that du reported space
> consumption is related to the file size (if sparse files wasn't
> enough of a hint arlready)....

Not to mention the deferred inactivation series tracks "space we could
free if we did a bunch of inactivation work" so that we can lie to
statfs and pretend we already did the work.  It wouldn't be hard to
include speculative posteof blocks in that too.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-08 12:34         ` Brian Foster
@ 2019-02-12  1:13           ` Darrick J. Wong
  2019-02-12 11:46             ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2019-02-12  1:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > Hi folks,
> > > > > > 
> > > > > > I've just finished analysing an IO trace from a application
> > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > how the application had so comprehensively defeated extent size
> > > > > > hints as a method of avoiding file fragmentation.
> > ....
> > > > FWIW, I think the scope of the problem is quite widespread -
> > > > anything that does open/something/close repeatedly on a file that is
> > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > we need to think about not trimming by default and trying to
> > > > enumerating only the cases that need to trim eof blocks.
> > > > 
> > > 
> > > To further this point.. I think the eofblocks scanning stuff came long
> > > after the speculative preallocation code and associated release time
> > > post-eof truncate.
> > 
> > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > on #xfs yesterday afternoon:
> > 
> >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> >  <dchinner>	stuff like that
> >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> >  <dchinner>	that's when we started doing work in ->release
> >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> >  <dchinner>	Except for concurrent NFS writes into the same directory
> >  <dchinner>	and now there's another pathological application that triggers problems
> >  <dchinner>	The NFS exception was prior to having thebackground reaper
> >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > 
> 
> Thanks.
> 
> > > I think the background scanning was initially an
> > > enhancement to deal with things like the dirty release optimization
> > > leaving these blocks around longer and being able to free up this
> > > accumulated space when we're at -ENOSPC conditions.
> > 
> > Yes, amongst other things like slow writes keeping the file open
> > forever.....
> > 
> > > Now that we have the
> > > scanning mechanism in place (and a 5 minute default background scan,
> > > which really isn't all that long), it might be reasonable to just drop
> > > the release time truncate completely and only trim post-eof blocks via
> > > the bg scan or reclaim paths.
> > 
> > Yeah, that's kinda the question I'm asking here. What's the likely
> > impact of not trimming EOF blocks at least on close apart from
> > people complaining about df/ls not matching du?
> > 
> 
> Ok. ISTM it's just a continuation of the same "might confuse some users"
> scenario that pops up occasionally. It also seems that kind of thing has
> died down as either most people don't really know or care about the
> transient state or are just more familiar with it at this point. IME,
> complex applications that depend on block ownership stats (userspace
> filesystems for example) already have to account for speculative
> preallocation with XFS, so tweaking the semantics of the optimization
> shouldn't really have much of an impact that I can tell so long as the
> broader/long-term behavior doesn't change[1].
> 
> I suppose there are all kinds of other applications that are technically
> affected by dropping the release time trim (simple file copies, archive
> extraction, etc.), but it's not clear to me that matters so long as we
> have effective bg and -ENOSPC scans. The only thing I can think of so
> far is whether we should consider changes to the bg scan heuristics to
> accommodate scenarios currently covered by the release time trim. For
> example, the release time scan doesn't consider whether the file is
> dirty or not while the bg scan always skips "active" files.

I wrote a quick and dirty fstest that writes 999 files between 128k and
256k in size, to simulate untarring onto a filesystem.  No fancy
preallocation, just buffered writes.  I patched my kernel to skip the
posteof block freeing in xfs_release, so the preallocations get freed by
inode inactivation.  Then the freespace histogram looks like:

+   from      to extents  blocks    pct
+      1       1      36      36   0.00
+      2       3      69     175   0.01
+      4       7     122     698   0.02
+      8      15     237    2691   0.08
+     16      31       1      16   0.00
+     32      63     500   27843   0.88
+ 524288  806272       4 3141225  99.01

Pretty gnarly. :)  By comparison, a stock upstream kernel:

+   from      to extents  blocks    pct
+ 524288  806272       4 3172579 100.00

That's 969 free extents vs. 4, on a fs with 999 new files... which is
pretty bad.  Dave also suggessted on IRC that maybe this should be a
little smarter -- possibly skipping the posteof removal only if the
filesystem has sunit/swidth set, or if the inode has extent size hints,
or whatever. :)

--D

> Brian
> 
> [1] Which isn't the case for a change to not bg trim files with extent
> size hints.
> 
> > I don't really care about that anymore because, well, reflink/dedupe
> > completely break any remaining assumption that du reported space
> > consumption is related to the file size (if sparse files wasn't
> > enough of a hint arlready)....
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-12  1:13           ` Darrick J. Wong
@ 2019-02-12 11:46             ` Brian Foster
  2019-02-12 20:21               ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2019-02-12 11:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > > Hi folks,
> > > > > > > 
> > > > > > > I've just finished analysing an IO trace from a application
> > > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > > how the application had so comprehensively defeated extent size
> > > > > > > hints as a method of avoiding file fragmentation.
> > > ....
> > > > > FWIW, I think the scope of the problem is quite widespread -
> > > > > anything that does open/something/close repeatedly on a file that is
> > > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > > we need to think about not trimming by default and trying to
> > > > > enumerating only the cases that need to trim eof blocks.
> > > > > 
> > > > 
> > > > To further this point.. I think the eofblocks scanning stuff came long
> > > > after the speculative preallocation code and associated release time
> > > > post-eof truncate.
> > > 
> > > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > > on #xfs yesterday afternoon:
> > > 
> > >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> > >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> > >  <dchinner>	stuff like that
> > >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> > >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> > >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> > >  <dchinner>	that's when we started doing work in ->release
> > >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> > >  <dchinner>	Except for concurrent NFS writes into the same directory
> > >  <dchinner>	and now there's another pathological application that triggers problems
> > >  <dchinner>	The NFS exception was prior to having thebackground reaper
> > >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> > >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > > 
> > 
> > Thanks.
> > 
> > > > I think the background scanning was initially an
> > > > enhancement to deal with things like the dirty release optimization
> > > > leaving these blocks around longer and being able to free up this
> > > > accumulated space when we're at -ENOSPC conditions.
> > > 
> > > Yes, amongst other things like slow writes keeping the file open
> > > forever.....
> > > 
> > > > Now that we have the
> > > > scanning mechanism in place (and a 5 minute default background scan,
> > > > which really isn't all that long), it might be reasonable to just drop
> > > > the release time truncate completely and only trim post-eof blocks via
> > > > the bg scan or reclaim paths.
> > > 
> > > Yeah, that's kinda the question I'm asking here. What's the likely
> > > impact of not trimming EOF blocks at least on close apart from
> > > people complaining about df/ls not matching du?
> > > 
> > 
> > Ok. ISTM it's just a continuation of the same "might confuse some users"
> > scenario that pops up occasionally. It also seems that kind of thing has
> > died down as either most people don't really know or care about the
> > transient state or are just more familiar with it at this point. IME,
> > complex applications that depend on block ownership stats (userspace
> > filesystems for example) already have to account for speculative
> > preallocation with XFS, so tweaking the semantics of the optimization
> > shouldn't really have much of an impact that I can tell so long as the
> > broader/long-term behavior doesn't change[1].
> > 
> > I suppose there are all kinds of other applications that are technically
> > affected by dropping the release time trim (simple file copies, archive
> > extraction, etc.), but it's not clear to me that matters so long as we
> > have effective bg and -ENOSPC scans. The only thing I can think of so
> > far is whether we should consider changes to the bg scan heuristics to
> > accommodate scenarios currently covered by the release time trim. For
> > example, the release time scan doesn't consider whether the file is
> > dirty or not while the bg scan always skips "active" files.
> 
> I wrote a quick and dirty fstest that writes 999 files between 128k and
> 256k in size, to simulate untarring onto a filesystem.  No fancy
> preallocation, just buffered writes.  I patched my kernel to skip the
> posteof block freeing in xfs_release, so the preallocations get freed by
> inode inactivation.  Then the freespace histogram looks like:
> 

You didn't mention whether you disabled background eofb trims. Are you
just rendering that irrelevant by disabling the release time trim and
doing a mount cycle?

> +   from      to extents  blocks    pct
> +      1       1      36      36   0.00
> +      2       3      69     175   0.01
> +      4       7     122     698   0.02
> +      8      15     237    2691   0.08
> +     16      31       1      16   0.00
> +     32      63     500   27843   0.88
> + 524288  806272       4 3141225  99.01
> 
> Pretty gnarly. :)  By comparison, a stock upstream kernel:
> 

Indeed, that's a pretty rapid degradation. Thanks for testing that.

> +   from      to extents  blocks    pct
> + 524288  806272       4 3172579 100.00
> 
> That's 969 free extents vs. 4, on a fs with 999 new files... which is
> pretty bad.  Dave also suggessted on IRC that maybe this should be a
> little smarter -- possibly skipping the posteof removal only if the
> filesystem has sunit/swidth set, or if the inode has extent size hints,
> or whatever. :)
> 

This test implies that there's a significant difference between eofb
trims prior to delalloc conversion vs. after, which I suspect is the
primary difference between doing so on close vs. some time later. Is
there any good way to confirm that with your test? If that is the case,
it makes me wonder whether we should think about more generalized logic
as opposed to a battery of whatever particular inode state checks that
we've determined in practice contribute to free space fragmentation.

For example, extent size hints just happen to skip delayed allocation.
I don't recall the exact history, but I don't think this was always the
case for extsz hints. So would the close time eofb trim be as
problematic as for extsz hint files if the behavior of the latter
changed back to using delayed allocation? I think a patch for that was
proposed fairly recently, but it depended on delalloc -> unwritten
functionality which still had unresolved issues (IIRC).

>From another angle, would a system that held files open for a
significant amount of time relative to a one-time write such that close
consistently occurred after writeback (and thus delalloc conversion) be
susceptible to the same level of free space fragmentation as shown
above? (I'm not sure if/why anybody would actually do that.. userspace
fs with an fd cache perhaps? It's somewhat besides the point,
anyways...)).

More testing and thought is probably required. I _was_ wondering if we
should consider something like always waiting as long as possible to
eofb trim already converted post-eof blocks, but I'm not totally
convinced that actually has value. For files that are not going to see
any further appends, we may have already lost since the real post-eof
blocks will end up truncated just the same whether it happens sooner or
not until inode reclaim.

Hmm, maybe we actually need to think about how to be smarter about when
to introduce speculative preallocation as opposed to how/when to reclaim
it. We currently limit speculative prealloc to files of a minimum size
(64k IIRC). Just thinking out loud, but what if we restricted
preallocation to files that have been appended after at least one
writeback cycle, for example?

Brian

> --D
> 
> > Brian
> > 
> > [1] Which isn't the case for a change to not bg trim files with extent
> > size hints.
> > 
> > > I don't really care about that anymore because, well, reflink/dedupe
> > > completely break any remaining assumption that du reported space
> > > consumption is related to the file size (if sparse files wasn't
> > > enough of a hint arlready)....
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-12 11:46             ` Brian Foster
@ 2019-02-12 20:21               ` Dave Chinner
  2019-02-13 13:50                 ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-12 20:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > > > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > > > Hi folks,
> > > > > > > > 
> > > > > > > > I've just finished analysing an IO trace from a application
> > > > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > > > how the application had so comprehensively defeated extent size
> > > > > > > > hints as a method of avoiding file fragmentation.
> > > > ....
> > > > > > FWIW, I think the scope of the problem is quite widespread -
> > > > > > anything that does open/something/close repeatedly on a file that is
> > > > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > > > we need to think about not trimming by default and trying to
> > > > > > enumerating only the cases that need to trim eof blocks.
> > > > > > 
> > > > > 
> > > > > To further this point.. I think the eofblocks scanning stuff came long
> > > > > after the speculative preallocation code and associated release time
> > > > > post-eof truncate.
> > > > 
> > > > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > > > on #xfs yesterday afternoon:
> > > > 
> > > >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> > > >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> > > >  <dchinner>	stuff like that
> > > >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> > > >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> > > >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> > > >  <dchinner>	that's when we started doing work in ->release
> > > >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> > > >  <dchinner>	Except for concurrent NFS writes into the same directory
> > > >  <dchinner>	and now there's another pathological application that triggers problems
> > > >  <dchinner>	The NFS exception was prior to having thebackground reaper
> > > >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> > > >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > > > 
> > > 
> > > Thanks.
> > > 
> > > > > I think the background scanning was initially an
> > > > > enhancement to deal with things like the dirty release optimization
> > > > > leaving these blocks around longer and being able to free up this
> > > > > accumulated space when we're at -ENOSPC conditions.
> > > > 
> > > > Yes, amongst other things like slow writes keeping the file open
> > > > forever.....
> > > > 
> > > > > Now that we have the
> > > > > scanning mechanism in place (and a 5 minute default background scan,
> > > > > which really isn't all that long), it might be reasonable to just drop
> > > > > the release time truncate completely and only trim post-eof blocks via
> > > > > the bg scan or reclaim paths.
> > > > 
> > > > Yeah, that's kinda the question I'm asking here. What's the likely
> > > > impact of not trimming EOF blocks at least on close apart from
> > > > people complaining about df/ls not matching du?
> > > > 
> > > 
> > > Ok. ISTM it's just a continuation of the same "might confuse some users"
> > > scenario that pops up occasionally. It also seems that kind of thing has
> > > died down as either most people don't really know or care about the
> > > transient state or are just more familiar with it at this point. IME,
> > > complex applications that depend on block ownership stats (userspace
> > > filesystems for example) already have to account for speculative
> > > preallocation with XFS, so tweaking the semantics of the optimization
> > > shouldn't really have much of an impact that I can tell so long as the
> > > broader/long-term behavior doesn't change[1].
> > > 
> > > I suppose there are all kinds of other applications that are technically
> > > affected by dropping the release time trim (simple file copies, archive
> > > extraction, etc.), but it's not clear to me that matters so long as we
> > > have effective bg and -ENOSPC scans. The only thing I can think of so
> > > far is whether we should consider changes to the bg scan heuristics to
> > > accommodate scenarios currently covered by the release time trim. For
> > > example, the release time scan doesn't consider whether the file is
> > > dirty or not while the bg scan always skips "active" files.
> > 
> > I wrote a quick and dirty fstest that writes 999 files between 128k and
> > 256k in size, to simulate untarring onto a filesystem.  No fancy
> > preallocation, just buffered writes.  I patched my kernel to skip the
> > posteof block freeing in xfs_release, so the preallocations get freed by
> > inode inactivation.  Then the freespace histogram looks like:
> > 
> 
> You didn't mention whether you disabled background eofb trims. Are you
> just rendering that irrelevant by disabling the release time trim and
> doing a mount cycle?
> 
> > +   from      to extents  blocks    pct
> > +      1       1      36      36   0.00
> > +      2       3      69     175   0.01
> > +      4       7     122     698   0.02
> > +      8      15     237    2691   0.08
> > +     16      31       1      16   0.00
> > +     32      63     500   27843   0.88
> > + 524288  806272       4 3141225  99.01
> > 
> > Pretty gnarly. :)  By comparison, a stock upstream kernel:
> > 
> 
> Indeed, that's a pretty rapid degradation. Thanks for testing that.
> 
> > +   from      to extents  blocks    pct
> > + 524288  806272       4 3172579 100.00
> > 
> > That's 969 free extents vs. 4, on a fs with 999 new files... which is
> > pretty bad.  Dave also suggessted on IRC that maybe this should be a
> > little smarter -- possibly skipping the posteof removal only if the
> > filesystem has sunit/swidth set, or if the inode has extent size hints,
> > or whatever. :)
> > 
> 
> This test implies that there's a significant difference between eofb
> trims prior to delalloc conversion vs. after, which I suspect is the
> primary difference between doing so on close vs. some time later.

Yes, it's the difference between trimming the excess off the
delalloc extent and trimming the excess off an allocated extent
after writeback. In the later case, we end up fragmenting free space
because, while writeback is packing as tightly as it can, there is
unused space between the end of the one file and the start of the
next that ends up as free space.

> Is
> there any good way to confirm that with your test? If that is the case,
> it makes me wonder whether we should think about more generalized logic
> as opposed to a battery of whatever particular inode state checks that
> we've determined in practice contribute to free space fragmentation.

Yeah, we talked about that on #xfs, and it seems to me that the best
heuristic we can come up with is "trim on first close, if there are
multiple closes treat it as a repeated open/write/close workload and
apply the IDIRTY_RELEASE heuristic to it and don't remove the
prealloc on closes after the first.

> For example, extent size hints just happen to skip delayed allocation.
> I don't recall the exact history, but I don't think this was always the
> case for extsz hints.

It wasn't, but extent size hints + delalloc never played nicely and
could corrupt or expose stale data and caused all sorts of problems
at ENOSPC because delalloc reservations are unaware of alignment
requirements for extent size hints. Hence to make extent size hints
work for buffered writes, I simply made them work the same way as
direct writes (i.e. immediate allocation w/ unwritten extents).

> So would the close time eofb trim be as
> problematic as for extsz hint files if the behavior of the latter
> changed back to using delayed allocation?

Yes, but if it's a write-once file that doesn't matter. If it's
write-many, then we'd retain the post-eof blocks...

> I think a patch for that was
> proposed fairly recently, but it depended on delalloc -> unwritten
> functionality which still had unresolved issues (IIRC).

*nod*

> From another angle, would a system that held files open for a
> significant amount of time relative to a one-time write such that close
> consistently occurred after writeback (and thus delalloc conversion) be
> susceptible to the same level of free space fragmentation as shown
> above?

If the file is held open for writing for a long time, we have to
assume that they are going to write again (and again) so we should
leave the EOF blocks there. If they are writing slower than the eofb
gc, then there's nothing more we can really do in that case...

> (I'm not sure if/why anybody would actually do that.. userspace
> fs with an fd cache perhaps? It's somewhat besides the point,
> anyways...)).

*nod*

> More testing and thought is probably required. I _was_ wondering if we
> should consider something like always waiting as long as possible to
> eofb trim already converted post-eof blocks, but I'm not totally
> convinced that actually has value. For files that are not going to see
> any further appends, we may have already lost since the real post-eof
> blocks will end up truncated just the same whether it happens sooner or
> not until inode reclaim.

If the writes are far enough apart, then we lose any IO optimisation
advantage of retaining post-eof blocks (induces seeks because
location of new writes is fixed ahead of time). Then it just becomes
a fragmentation avoidance If the writes are slow enough,
fragmentation really doesn't matter a whole lot - it's when writes
are frequent and we trash the post-eof blocks quickly that it
matters.

> Hmm, maybe we actually need to think about how to be smarter about when
> to introduce speculative preallocation as opposed to how/when to reclaim
> it. We currently limit speculative prealloc to files of a minimum size
> (64k IIRC). Just thinking out loud, but what if we restricted
> preallocation to files that have been appended after at least one
> writeback cycle, for example?

Speculative delalloc for write once large files also has a massive
impact on things like allocation overhead - we can write gigabytes
into the page cache before writeback begins. If we take away the
specualtive delalloc for these first write files, then we are
essentially doing an extent manipluation (extending delalloc extent)
on every write() call we make.

Right now we only do that extent btree work when we hit the end of
the current speculative delalloc extent, so the normal write case is
just extending the in-memory EOF location rather than running the
entire of xfs_bmapi_reserve_delalloc() and doing space accounting,
etc.

/me points at his 2006 OLS paper about scaling write performance
as an example of just how important keeping delalloc overhead down
is for high throughput write performance:

https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf

IOWs, speculative prealloc beyond EOF is not just about preventing
fragmentation - it also helps minimise the per-write CPU overhead of
delalloc space accounting. (i.e. allows faster write rates into
cache). IOWs, for anything more than a really small files, we want
to be doing speculative delalloc on the first time the file is
written to.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-12 20:21               ` Dave Chinner
@ 2019-02-13 13:50                 ` Brian Foster
  2019-02-13 22:27                   ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2019-02-13 13:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > > > > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > > > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > > > > Hi folks,
> > > > > > > > > 
> > > > > > > > > I've just finished analysing an IO trace from a application
> > > > > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > > > > how the application had so comprehensively defeated extent size
> > > > > > > > > hints as a method of avoiding file fragmentation.
> > > > > ....
> > > > > > > FWIW, I think the scope of the problem is quite widespread -
> > > > > > > anything that does open/something/close repeatedly on a file that is
> > > > > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > > > > we need to think about not trimming by default and trying to
> > > > > > > enumerating only the cases that need to trim eof blocks.
> > > > > > > 
> > > > > > 
> > > > > > To further this point.. I think the eofblocks scanning stuff came long
> > > > > > after the speculative preallocation code and associated release time
> > > > > > post-eof truncate.
> > > > > 
> > > > > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > > > > on #xfs yesterday afternoon:
> > > > > 
> > > > >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> > > > >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> > > > >  <dchinner>	stuff like that
> > > > >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> > > > >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> > > > >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> > > > >  <dchinner>	that's when we started doing work in ->release
> > > > >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> > > > >  <dchinner>	Except for concurrent NFS writes into the same directory
> > > > >  <dchinner>	and now there's another pathological application that triggers problems
> > > > >  <dchinner>	The NFS exception was prior to having thebackground reaper
> > > > >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> > > > >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > > > > 
> > > > 
> > > > Thanks.
> > > > 
> > > > > > I think the background scanning was initially an
> > > > > > enhancement to deal with things like the dirty release optimization
> > > > > > leaving these blocks around longer and being able to free up this
> > > > > > accumulated space when we're at -ENOSPC conditions.
> > > > > 
> > > > > Yes, amongst other things like slow writes keeping the file open
> > > > > forever.....
> > > > > 
> > > > > > Now that we have the
> > > > > > scanning mechanism in place (and a 5 minute default background scan,
> > > > > > which really isn't all that long), it might be reasonable to just drop
> > > > > > the release time truncate completely and only trim post-eof blocks via
> > > > > > the bg scan or reclaim paths.
> > > > > 
> > > > > Yeah, that's kinda the question I'm asking here. What's the likely
> > > > > impact of not trimming EOF blocks at least on close apart from
> > > > > people complaining about df/ls not matching du?
> > > > > 
> > > > 
> > > > Ok. ISTM it's just a continuation of the same "might confuse some users"
> > > > scenario that pops up occasionally. It also seems that kind of thing has
> > > > died down as either most people don't really know or care about the
> > > > transient state or are just more familiar with it at this point. IME,
> > > > complex applications that depend on block ownership stats (userspace
> > > > filesystems for example) already have to account for speculative
> > > > preallocation with XFS, so tweaking the semantics of the optimization
> > > > shouldn't really have much of an impact that I can tell so long as the
> > > > broader/long-term behavior doesn't change[1].
> > > > 
> > > > I suppose there are all kinds of other applications that are technically
> > > > affected by dropping the release time trim (simple file copies, archive
> > > > extraction, etc.), but it's not clear to me that matters so long as we
> > > > have effective bg and -ENOSPC scans. The only thing I can think of so
> > > > far is whether we should consider changes to the bg scan heuristics to
> > > > accommodate scenarios currently covered by the release time trim. For
> > > > example, the release time scan doesn't consider whether the file is
> > > > dirty or not while the bg scan always skips "active" files.
> > > 
> > > I wrote a quick and dirty fstest that writes 999 files between 128k and
> > > 256k in size, to simulate untarring onto a filesystem.  No fancy
> > > preallocation, just buffered writes.  I patched my kernel to skip the
> > > posteof block freeing in xfs_release, so the preallocations get freed by
> > > inode inactivation.  Then the freespace histogram looks like:
> > > 
> > 
> > You didn't mention whether you disabled background eofb trims. Are you
> > just rendering that irrelevant by disabling the release time trim and
> > doing a mount cycle?
> > 
> > > +   from      to extents  blocks    pct
> > > +      1       1      36      36   0.00
> > > +      2       3      69     175   0.01
> > > +      4       7     122     698   0.02
> > > +      8      15     237    2691   0.08
> > > +     16      31       1      16   0.00
> > > +     32      63     500   27843   0.88
> > > + 524288  806272       4 3141225  99.01
> > > 
> > > Pretty gnarly. :)  By comparison, a stock upstream kernel:
> > > 
> > 
> > Indeed, that's a pretty rapid degradation. Thanks for testing that.
> > 
> > > +   from      to extents  blocks    pct
> > > + 524288  806272       4 3172579 100.00
> > > 
> > > That's 969 free extents vs. 4, on a fs with 999 new files... which is
> > > pretty bad.  Dave also suggessted on IRC that maybe this should be a
> > > little smarter -- possibly skipping the posteof removal only if the
> > > filesystem has sunit/swidth set, or if the inode has extent size hints,
> > > or whatever. :)
> > > 
> > 
> > This test implies that there's a significant difference between eofb
> > trims prior to delalloc conversion vs. after, which I suspect is the
> > primary difference between doing so on close vs. some time later.
> 
> Yes, it's the difference between trimming the excess off the
> delalloc extent and trimming the excess off an allocated extent
> after writeback. In the later case, we end up fragmenting free space
> because, while writeback is packing as tightly as it can, there is
> unused space between the end of the one file and the start of the
> next that ends up as free space.
> 

Makes sense, kind of what I expected. I think it does raise the question
of the value of small-ish speculative preallocations.

> > Is
> > there any good way to confirm that with your test? If that is the case,
> > it makes me wonder whether we should think about more generalized logic
> > as opposed to a battery of whatever particular inode state checks that
> > we've determined in practice contribute to free space fragmentation.
> 
> Yeah, we talked about that on #xfs, and it seems to me that the best
> heuristic we can come up with is "trim on first close, if there are
> multiple closes treat it as a repeated open/write/close workload and
> apply the IDIRTY_RELEASE heuristic to it and don't remove the
> prealloc on closes after the first.
> 

Darrick mentioned this yesterday on irc. This sounds like a reasonable
variation of the change to me. It filters out the write once use case
that we clearly can't unconditionally defer to background trim based on
the test results above.

> > For example, extent size hints just happen to skip delayed allocation.
> > I don't recall the exact history, but I don't think this was always the
> > case for extsz hints.
> 
> It wasn't, but extent size hints + delalloc never played nicely and
> could corrupt or expose stale data and caused all sorts of problems
> at ENOSPC because delalloc reservations are unaware of alignment
> requirements for extent size hints. Hence to make extent size hints
> work for buffered writes, I simply made them work the same way as
> direct writes (i.e. immediate allocation w/ unwritten extents).
> 
> > So would the close time eofb trim be as
> > problematic as for extsz hint files if the behavior of the latter
> > changed back to using delayed allocation?
> 
> Yes, but if it's a write-once file that doesn't matter. If it's
> write-many, then we'd retain the post-eof blocks...
> 
> > I think a patch for that was
> > proposed fairly recently, but it depended on delalloc -> unwritten
> > functionality which still had unresolved issues (IIRC).
> 
> *nod*
> 
> > From another angle, would a system that held files open for a
> > significant amount of time relative to a one-time write such that close
> > consistently occurred after writeback (and thus delalloc conversion) be
> > susceptible to the same level of free space fragmentation as shown
> > above?
> 
> If the file is held open for writing for a long time, we have to
> assume that they are going to write again (and again) so we should
> leave the EOF blocks there. If they are writing slower than the eofb
> gc, then there's nothing more we can really do in that case...
> 

I'm not necessarily sure that this condition is always a matter of
writing too slow. It very well may be true, but I'm curious if there are
parallel copy scenarios (perhaps under particular cpu/RAM configs)
where we could end up doing a large number of one time file writes and
not doing the release time trim until an underlying extent (with
post-eof blocks) has been converted in more cases than not.

Perhaps this is more of a question of writeback behavior than anything
in XFS, but if that can occur with files on the smaller side it's a
clear path to free space fragmentation. I think it's at least worth
considering whether we can improve things, but of course this requires
significantly more thought and analysis to determine whether that's an
actual problem and if the cure is worse than the disease. :P

> > (I'm not sure if/why anybody would actually do that.. userspace
> > fs with an fd cache perhaps? It's somewhat besides the point,
> > anyways...)).
> 
> *nod*
> 
> > More testing and thought is probably required. I _was_ wondering if we
> > should consider something like always waiting as long as possible to
> > eofb trim already converted post-eof blocks, but I'm not totally
> > convinced that actually has value. For files that are not going to see
> > any further appends, we may have already lost since the real post-eof
> > blocks will end up truncated just the same whether it happens sooner or
> > not until inode reclaim.
> 
> If the writes are far enough apart, then we lose any IO optimisation
> advantage of retaining post-eof blocks (induces seeks because
> location of new writes is fixed ahead of time). Then it just becomes
> a fragmentation avoidance If the writes are slow enough,
> fragmentation really doesn't matter a whole lot - it's when writes
> are frequent and we trash the post-eof blocks quickly that it
> matters.
> 

It sounds like what you're saying is that it doesn't really matter
either way at this point. There's no perf advantage to keeping the eof
blocks in this scenario, but there's also no real harm in deferring the
eofb trim of physical post-eof blocks because any future free space
fragmentation damage has already been done (assuming no more writes come
in).

The thought above was tip-toeing around the idea of (in addition to the
one-time trim heuristic you mentioned above) never doing a release time
trim of non-delalloc post-eof blocks.

> > Hmm, maybe we actually need to think about how to be smarter about when
> > to introduce speculative preallocation as opposed to how/when to reclaim
> > it. We currently limit speculative prealloc to files of a minimum size
> > (64k IIRC). Just thinking out loud, but what if we restricted
> > preallocation to files that have been appended after at least one
> > writeback cycle, for example?
> 
> Speculative delalloc for write once large files also has a massive
> impact on things like allocation overhead - we can write gigabytes
> into the page cache before writeback begins. If we take away the
> specualtive delalloc for these first write files, then we are
> essentially doing an extent manipluation (extending delalloc extent)
> on every write() call we make.
> 

I was kind of banking on the ability to write large amounts of data to
cache before a particular file sees writeback activity. That means we'll
allocate as large as possible extents for actual file data and only
start doing preallocations on the larger side as well.

I hadn't considered the angle of preallocation reducing (del)allocation
overhead, however..

> Right now we only do that extent btree work when we hit the end of
> the current speculative delalloc extent, so the normal write case is
> just extending the in-memory EOF location rather than running the
> entire of xfs_bmapi_reserve_delalloc() and doing space accounting,
> etc.
> 
> /me points at his 2006 OLS paper about scaling write performance
> as an example of just how important keeping delalloc overhead down
> is for high throughput write performance:
> 
> https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf
> 

Interesting work, thanks.

> IOWs, speculative prealloc beyond EOF is not just about preventing
> fragmentation - it also helps minimise the per-write CPU overhead of
> delalloc space accounting. (i.e. allows faster write rates into
> cache). IOWs, for anything more than a really small files, we want
> to be doing speculative delalloc on the first time the file is
> written to.
> 

Ok, this paper refers to CPU overhead as it contributes to lack of
scalability. It certainly makes sense that a delayed allocation buffered
write does more work than a non-alloc write, but this document isn't
necessarily claiming that as evidence of measurably faster or slower
single threaded buffered writes. Rather, that overhead becomes
measurable with enough parallel buffered writes contending on a global
allocation critical section (section 5.1 "Spinlocks in Hot Paths").

Subsequently, section 6.1.2 addresses that particular issue via the
introduction of per-cpu accounting. It looks like there is further
discussion around the general link between efficiency and performance,
which makes sense, but I don't think that draws a definite conclusion
that speculative preallocation needs to be introduced immediately on
sequential buffered writes. What I think it suggests is that we need to
consider the potential scalability impact of any prospective change in
speculative preallocation behavior (along with the other tradeoffs
associated with preallocation) because less aggressive preallocation
means more buffered write overhead.

BTW for historical context.. was speculative preallocation a thing when
this paper was written? The doc suggests per-page block allocation needs
to be reduced on XFS, but also indicates that XFS had minimal
fragmentation compared to other fs'. AFAICT, section 5.2 attributes XFS
fragmentation avoidance to delayed allocation (not necessarily
preallocation).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-13 13:50                 ` Brian Foster
@ 2019-02-13 22:27                   ` Dave Chinner
  2019-02-14 13:00                     ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-13 22:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > For example, extent size hints just happen to skip delayed allocation.
> > > I don't recall the exact history, but I don't think this was always the
> > > case for extsz hints.
> > 
> > It wasn't, but extent size hints + delalloc never played nicely and
> > could corrupt or expose stale data and caused all sorts of problems
> > at ENOSPC because delalloc reservations are unaware of alignment
> > requirements for extent size hints. Hence to make extent size hints
> > work for buffered writes, I simply made them work the same way as
> > direct writes (i.e. immediate allocation w/ unwritten extents).
> > 
> > > So would the close time eofb trim be as
> > > problematic as for extsz hint files if the behavior of the latter
> > > changed back to using delayed allocation?
> > 
> > Yes, but if it's a write-once file that doesn't matter. If it's
> > write-many, then we'd retain the post-eof blocks...
> > 
> > > I think a patch for that was
> > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > functionality which still had unresolved issues (IIRC).
> > 
> > *nod*
> > 
> > > From another angle, would a system that held files open for a
> > > significant amount of time relative to a one-time write such that close
> > > consistently occurred after writeback (and thus delalloc conversion) be
> > > susceptible to the same level of free space fragmentation as shown
> > > above?
> > 
> > If the file is held open for writing for a long time, we have to
> > assume that they are going to write again (and again) so we should
> > leave the EOF blocks there. If they are writing slower than the eofb
> > gc, then there's nothing more we can really do in that case...
> > 
> 
> I'm not necessarily sure that this condition is always a matter of
> writing too slow. It very well may be true, but I'm curious if there are
> parallel copy scenarios (perhaps under particular cpu/RAM configs)
> where we could end up doing a large number of one time file writes and
> not doing the release time trim until an underlying extent (with
> post-eof blocks) has been converted in more cases than not.

I'm not sure I follow what sort of workload and situation you are
describing here. Are you talking about the effect of an EOFB gc pass
during ongoing writes?

> Perhaps this is more of a question of writeback behavior than anything
> in XFS, but if that can occur with files on the smaller side it's a
> clear path to free space fragmentation. I think it's at least worth
> considering whether we can improve things, but of course this requires
> significantly more thought and analysis to determine whether that's an
> actual problem and if the cure is worse than the disease. :P
> 
> > > (I'm not sure if/why anybody would actually do that.. userspace
> > > fs with an fd cache perhaps? It's somewhat besides the point,
> > > anyways...)).
> > 
> > *nod*
> > 
> > > More testing and thought is probably required. I _was_ wondering if we
> > > should consider something like always waiting as long as possible to
> > > eofb trim already converted post-eof blocks, but I'm not totally
> > > convinced that actually has value. For files that are not going to see
> > > any further appends, we may have already lost since the real post-eof
> > > blocks will end up truncated just the same whether it happens sooner or
> > > not until inode reclaim.
> > 
> > If the writes are far enough apart, then we lose any IO optimisation
> > advantage of retaining post-eof blocks (induces seeks because
> > location of new writes is fixed ahead of time). Then it just becomes
> > a fragmentation avoidance If the writes are slow enough,
> > fragmentation really doesn't matter a whole lot - it's when writes
> > are frequent and we trash the post-eof blocks quickly that it
> > matters.
> > 
> 
> It sounds like what you're saying is that it doesn't really matter
> either way at this point. There's no perf advantage to keeping the eof
> blocks in this scenario, but there's also no real harm in deferring the
> eofb trim of physical post-eof blocks because any future free space
> fragmentation damage has already been done (assuming no more writes come
> in).

Essentially, yes.

> The thought above was tip-toeing around the idea of (in addition to the
> one-time trim heuristic you mentioned above) never doing a release time
> trim of non-delalloc post-eof blocks.

Except we want to trim blocks in the in the cases where it's a write
once file that has been fsync()d or written by O_DIRECT w/ really
large extent size hints....

It seems to me that once we start making exceptions to the general
rule, we end up in a world of corner cases and, most likely,
unreliable heuristics.

> > IOWs, speculative prealloc beyond EOF is not just about preventing
> > fragmentation - it also helps minimise the per-write CPU overhead of
> > delalloc space accounting. (i.e. allows faster write rates into
> > cache). IOWs, for anything more than a really small files, we want
> > to be doing speculative delalloc on the first time the file is
> > written to.
> > 
> 
> Ok, this paper refers to CPU overhead as it contributes to lack of
> scalability.

Well, that was the experiments that were being performed. I'm using
it as an example of how per-write overhead is actually important to
throughput. Ignore the "global lock caused overall throughput
issues" because we don't have that problem any more, and instead
look at it as a demonstration of "anything that slows down a write()
reduces per-thread throughput".

> It certainly makes sense that a delayed allocation buffered
> write does more work than a non-alloc write, but this document isn't
> necessarily claiming that as evidence of measurably faster or slower
> single threaded buffered writes. Rather, that overhead becomes
> measurable with enough parallel buffered writes contending on a global
> allocation critical section (section 5.1 "Spinlocks in Hot Paths").

It's just an example of how increasing per-write overhead impacts
performance. The same can be said for all the memory reclaim
mods - they also reduced per-write overhead because memory
allocation for the page cache was much, much faster and required
much less CPU, so each write spent less time allocating pages to
cache the data to be written. hence, faster writes....

> Subsequently, section 6.1.2 addresses that particular issue via the
> introduction of per-cpu accounting. It looks like there is further
> discussion around the general link between efficiency and performance,
> which makes sense, but I don't think that draws a definite conclusion
> that speculative preallocation needs to be introduced immediately on
> sequential buffered writes.

Sure, because the post-EOF preallocation that was already happening
wasn't the problem, and we couldn't easily address the per-page
block mapping overhead with generic code changes that would be
useful in any circumstance (Linux community at the time was hostile
towards bringing anything useful to XFS into the core kernel code).

And this was only a small machine (24 CPUs) so we had to address the
locking problem because increasing speculative prealloc wouldn't
help solve the global lock contention problems on the 2048p machines
that SGI was shipping at the time....

> What I think it suggests is that we need to
> consider the potential scalability impact of any prospective change in
> speculative preallocation behavior (along with the other tradeoffs
> associated with preallocation) because less aggressive preallocation
> means more buffered write overhead.

It's not a scalability problem now because of the fixes made back
then - it's now just a per-write thread latency and throughput
issue.

What is not obvious from the paper is that I designed the XFS
benchmarks so that it wrote each file into a separate directories to
place them into different AGs so that they didn't fragment due to
interleaving writes. i.e. so we could directly control the number of
sequential write streams going to the disks.

We had to do that to minimise the seek overhead of the disks so
they'd run at full bandwidth rather than being seek bound - we had
to keep the 256 disks at >97% bandwidth utilisation to get to
10GB/S, so even one extra seek per disk per second and we wouldn't
get there.

IOWs, we'd carefully taken anything to do with fragmentation out of
the picture entirely, both for direct and buffered writes. We were
just focussed on write efficiency and maximising total throughput
and that largely made post-EOF preallocation irrelevant to us.

> BTW for historical context.. was speculative preallocation a thing when
> this paper was written?

Yes. specualtive prealloc goes way back into the 90s from Irix.  It
was first made configurable in XFS via the biosize mount option
added with v3 superblocks in 1997, but the initial linux port only
allowed up to 64k.

In 2005, the linux mount option allowed biosize to be extended to
1GB, which made sense because >4GB allocation groups (mkfs enabled
them late 2003) were now starting to be widely used and so users
were reporting new large AG fragmentation issues that had never been
seen before. i.e.  it was now practical to have contiguous multi-GB
extents in files and the delalloc code was struggling to create
them, so having EOF-prealloc be able to make use of that capability
was needed....

And then auto-tuning made sense because more and more people were
having to use the mount option in more general workloads to avoid
fragmentation.

And even since then it's abeen able tweaking the algorithms to stop
userspace from defeating it and re-introducing fragmentation
vectors...

> The doc suggests per-page block allocation needs
> to be reduced on XFS,

Yup. It still stands - that was the specific design premise that
lead to the iomap infrastructure we now have. i.e. do block mapping
once per write() call, not once per page per write call. early
prototypes got us a 10-20% increase in single threaded bufered write
throughput on slow CPUs, and with the current code some high
throughput buffered write cases either went up by 30-40% or CPU
usage went down by that amount.

> but also indicates that XFS had minimal
> fragmentation compared to other fs'. AFAICT, section 5.2 attributes XFS
> fragmentation avoidance to delayed allocation (not necessarily
> preallocation).

Right, because none of the other filesystems had delayed allocation
and they were interleaving concurent allocations at write() time.
i.e. they serialised write() for long periods of time while they did
global fs allocation, whilst XFS had a low overhead delalloc path
with a tiny critical global section. i.e. delalloc allowed us to
defer allocation to writeback where it isn't so throughput critical
and better allocation decisions can be made.

IOWs, delalloc was the primary difference in both the fragmentation
and scalability differences reported in the paper. That XFS did
small amounts of in-memory-only prealloc beyond EOF during delalloc
was irrelevant - it's just part of the delalloc mechanism in this
context....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-13 22:27                   ` Dave Chinner
@ 2019-02-14 13:00                     ` Brian Foster
  2019-02-14 21:51                       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2019-02-14 13:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > For example, extent size hints just happen to skip delayed allocation.
> > > > I don't recall the exact history, but I don't think this was always the
> > > > case for extsz hints.
> > > 
> > > It wasn't, but extent size hints + delalloc never played nicely and
> > > could corrupt or expose stale data and caused all sorts of problems
> > > at ENOSPC because delalloc reservations are unaware of alignment
> > > requirements for extent size hints. Hence to make extent size hints
> > > work for buffered writes, I simply made them work the same way as
> > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > 
> > > > So would the close time eofb trim be as
> > > > problematic as for extsz hint files if the behavior of the latter
> > > > changed back to using delayed allocation?
> > > 
> > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > write-many, then we'd retain the post-eof blocks...
> > > 
> > > > I think a patch for that was
> > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > functionality which still had unresolved issues (IIRC).
> > > 
> > > *nod*
> > > 
> > > > From another angle, would a system that held files open for a
> > > > significant amount of time relative to a one-time write such that close
> > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > susceptible to the same level of free space fragmentation as shown
> > > > above?
> > > 
> > > If the file is held open for writing for a long time, we have to
> > > assume that they are going to write again (and again) so we should
> > > leave the EOF blocks there. If they are writing slower than the eofb
> > > gc, then there's nothing more we can really do in that case...
> > > 
> > 
> > I'm not necessarily sure that this condition is always a matter of
> > writing too slow. It very well may be true, but I'm curious if there are
> > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > where we could end up doing a large number of one time file writes and
> > not doing the release time trim until an underlying extent (with
> > post-eof blocks) has been converted in more cases than not.
> 
> I'm not sure I follow what sort of workload and situation you are
> describing here. Are you talking about the effect of an EOFB gc pass
> during ongoing writes?
> 

I'm not sure if it's an actual reproducible situation.. I'm just
wondering out loud if there are normal workloads that might still defeat
a one-time trim at release time. For example, copy enough files in
parallel such that writeback touches most of them before the copies
complete and we end up trimming physical blocks rather than delalloc
blocks.

This is not so much of a problem if those files are large, I think,
because then the preallocs and the resulting trimmed free space is on
the larger side as well. If we're copying a bunch of little files with
small preallocs, however, then we put ourselves in the pathological
situation shown in Darrick's test.

I was originally thinking about whether this could happen or not on a
highly parallel small file copy workload, but having thought about it a
bit more I think there is a more simple example. What about an untar
like workload that creates small files and calls fsync() before each fd
is released? Wouldn't that still defeat a one-time release heuristic and
produce the same layout issues as shown above? We'd prealloc,
writeback/convert then trim small/spurious fragments of post-eof space
back to the allocator.

> > Perhaps this is more of a question of writeback behavior than anything
> > in XFS, but if that can occur with files on the smaller side it's a
> > clear path to free space fragmentation. I think it's at least worth
> > considering whether we can improve things, but of course this requires
> > significantly more thought and analysis to determine whether that's an
> > actual problem and if the cure is worse than the disease. :P
> > 
> > > > (I'm not sure if/why anybody would actually do that.. userspace
> > > > fs with an fd cache perhaps? It's somewhat besides the point,
> > > > anyways...)).
> > > 
> > > *nod*
> > > 
> > > > More testing and thought is probably required. I _was_ wondering if we
> > > > should consider something like always waiting as long as possible to
> > > > eofb trim already converted post-eof blocks, but I'm not totally
> > > > convinced that actually has value. For files that are not going to see
> > > > any further appends, we may have already lost since the real post-eof
> > > > blocks will end up truncated just the same whether it happens sooner or
> > > > not until inode reclaim.
> > > 
> > > If the writes are far enough apart, then we lose any IO optimisation
> > > advantage of retaining post-eof blocks (induces seeks because
> > > location of new writes is fixed ahead of time). Then it just becomes
> > > a fragmentation avoidance If the writes are slow enough,
> > > fragmentation really doesn't matter a whole lot - it's when writes
> > > are frequent and we trash the post-eof blocks quickly that it
> > > matters.
> > > 
> > 
> > It sounds like what you're saying is that it doesn't really matter
> > either way at this point. There's no perf advantage to keeping the eof
> > blocks in this scenario, but there's also no real harm in deferring the
> > eofb trim of physical post-eof blocks because any future free space
> > fragmentation damage has already been done (assuming no more writes come
> > in).
> 
> Essentially, yes.
> 
> > The thought above was tip-toeing around the idea of (in addition to the
> > one-time trim heuristic you mentioned above) never doing a release time
> > trim of non-delalloc post-eof blocks.
> 
> Except we want to trim blocks in the in the cases where it's a write
> once file that has been fsync()d or written by O_DIRECT w/ really
> large extent size hints....
> 

We still would trim it. The one time write case is essentially
unaffected because the only advantage of that hueristic is to trim eof
blocks before they are converted. If the eof blocks are real, the
release time heuristic has already failed (i.e., it hasn't provided any
benefit that background trim doesn't already provide).

IOW, what we really want to avoid is trimming (small batches of) unused
physical eof blocks.

> It seems to me that once we start making exceptions to the general
> rule, we end up in a world of corner cases and, most likely,
> unreliable heuristics.
> 

I agree in principle, but my purpose for this discussion is to try and
think about whether we can come up with a better set of rules/behaviors.

> > > IOWs, speculative prealloc beyond EOF is not just about preventing
> > > fragmentation - it also helps minimise the per-write CPU overhead of
> > > delalloc space accounting. (i.e. allows faster write rates into
> > > cache). IOWs, for anything more than a really small files, we want
> > > to be doing speculative delalloc on the first time the file is
> > > written to.
> > > 
> > 
> > Ok, this paper refers to CPU overhead as it contributes to lack of
> > scalability.
> 
> Well, that was the experiments that were being performed. I'm using
> it as an example of how per-write overhead is actually important to
> throughput. Ignore the "global lock caused overall throughput
> issues" because we don't have that problem any more, and instead
> look at it as a demonstration of "anything that slows down a write()
> reduces per-thread throughput".
> 

Makes sense, and that's how I took it after reading through the paper.
My point was just that I think this is more of a tradeoff and caveat to
consider than something that outright rules out doing less agressive
preallocation in certain cases.

I ran a few tests yesterday out of curiousity and was able to measure (a
small) difference in single-threaded buffered writes to cache with and
without preallocation. What I found a bit interesting was that my
original attempt to test this actually showed _faster_ throughput
without preallocation because the mechanism I happened to use to bypass
preallocation was an up front truncate. I eventually realized that this
had other effects (i.e., one copy doing size updates vs. the other not
doing so) and just compared a fixed, full size (1G) preallocation with a
fixed 4k preallocation to reproduce the boost provided by the former.
The point here is that while there is such a boost, there are also other
workload dependent factors that are out of our control. For example,
somebody who today cares about preallocation only for a boost on writes
to cache can apparently achieve a greater benefit by truncating the file
up front and disabling preallocation entirely.

Moving beyond the truncate thing, I also saw the benefit of
preallocation diminish as write buffer size was increased. As the write
size increases from 4k to around 64k, the pure performance benefit of
preallocation trailed off to zero. Part of that could also be effects of
less frequent size updates and whatnot due to the larger writes, but I
also don't think that's an uncommon thing in practice.

My only point here is that I don't think it's so cut and dry that we
absolutely need dynamic speculative preallocation for write to cache
performance. There is definitely an existing benefit and
perf/scalability caveats to consider before changing anything as such.
TBH it's probably not worth getting too much into the weeds here because
I don't really have any great solution/alternative in mind at the moment
beyond what has already been proposed. It's much easier to reason about
(and test) an actual prototype of some sort as opposed to handwavy and
poorly described ideas :P, but I also don't think it's reasonable or
wise to restrict ourselves to a particular implementation for what seem
like secondary benefits.

> > It certainly makes sense that a delayed allocation buffered
> > write does more work than a non-alloc write, but this document isn't
> > necessarily claiming that as evidence of measurably faster or slower
> > single threaded buffered writes. Rather, that overhead becomes
> > measurable with enough parallel buffered writes contending on a global
> > allocation critical section (section 5.1 "Spinlocks in Hot Paths").
> 
> It's just an example of how increasing per-write overhead impacts
> performance. The same can be said for all the memory reclaim
> mods - they also reduced per-write overhead because memory
> allocation for the page cache was much, much faster and required
> much less CPU, so each write spent less time allocating pages to
> cache the data to be written. hence, faster writes....
> 
> > Subsequently, section 6.1.2 addresses that particular issue via the
> > introduction of per-cpu accounting. It looks like there is further
> > discussion around the general link between efficiency and performance,
> > which makes sense, but I don't think that draws a definite conclusion
> > that speculative preallocation needs to be introduced immediately on
> > sequential buffered writes.
> 
> Sure, because the post-EOF preallocation that was already happening
> wasn't the problem, and we couldn't easily address the per-page
> block mapping overhead with generic code changes that would be
> useful in any circumstance (Linux community at the time was hostile
> towards bringing anything useful to XFS into the core kernel code).
> 
> And this was only a small machine (24 CPUs) so we had to address the
> locking problem because increasing speculative prealloc wouldn't
> help solve the global lock contention problems on the 2048p machines
> that SGI was shipping at the time....
> 

Makes sense.

> > What I think it suggests is that we need to
> > consider the potential scalability impact of any prospective change in
> > speculative preallocation behavior (along with the other tradeoffs
> > associated with preallocation) because less aggressive preallocation
> > means more buffered write overhead.
> 
> It's not a scalability problem now because of the fixes made back
> then - it's now just a per-write thread latency and throughput
> issue.
> 
> What is not obvious from the paper is that I designed the XFS
> benchmarks so that it wrote each file into a separate directories to
> place them into different AGs so that they didn't fragment due to
> interleaving writes. i.e. so we could directly control the number of
> sequential write streams going to the disks.
> 
> We had to do that to minimise the seek overhead of the disks so
> they'd run at full bandwidth rather than being seek bound - we had
> to keep the 256 disks at >97% bandwidth utilisation to get to
> 10GB/S, so even one extra seek per disk per second and we wouldn't
> get there.
> 
> IOWs, we'd carefully taken anything to do with fragmentation out of
> the picture entirely, both for direct and buffered writes. We were
> just focussed on write efficiency and maximising total throughput
> and that largely made post-EOF preallocation irrelevant to us.
> 

Ok.

> > BTW for historical context.. was speculative preallocation a thing when
> > this paper was written?
> 
> Yes. specualtive prealloc goes way back into the 90s from Irix.  It
> was first made configurable in XFS via the biosize mount option
> added with v3 superblocks in 1997, but the initial linux port only
> allowed up to 64k.
> 

Hmm, Ok.. so it was originally speculative preallocation without the
"dynamic sizing" logic that we have today. Thanks for the background.

> In 2005, the linux mount option allowed biosize to be extended to
> 1GB, which made sense because >4GB allocation groups (mkfs enabled
> them late 2003) were now starting to be widely used and so users
> were reporting new large AG fragmentation issues that had never been
> seen before. i.e.  it was now practical to have contiguous multi-GB
> extents in files and the delalloc code was struggling to create
> them, so having EOF-prealloc be able to make use of that capability
> was needed....
> 
> And then auto-tuning made sense because more and more people were
> having to use the mount option in more general workloads to avoid
> fragmentation.
> 

"auto-tuning" means "dynamic sizing" here, yes? FWIW, much of this
discussion also makes me wonder how appropriate the current size limit
(64k) on preallocation is for today's filesystems and systems (e.g. RAM
availability), as opposed to something larger (on the order of hundreds
of MBs for example, perhaps 256-512MB).

Brian

> And even since then it's abeen able tweaking the algorithms to stop
> userspace from defeating it and re-introducing fragmentation
> vectors...
> 
> > The doc suggests per-page block allocation needs
> > to be reduced on XFS,
> 
> Yup. It still stands - that was the specific design premise that
> lead to the iomap infrastructure we now have. i.e. do block mapping
> once per write() call, not once per page per write call. early
> prototypes got us a 10-20% increase in single threaded bufered write
> throughput on slow CPUs, and with the current code some high
> throughput buffered write cases either went up by 30-40% or CPU
> usage went down by that amount.
> 
> > but also indicates that XFS had minimal
> > fragmentation compared to other fs'. AFAICT, section 5.2 attributes XFS
> > fragmentation avoidance to delayed allocation (not necessarily
> > preallocation).
> 
> Right, because none of the other filesystems had delayed allocation
> and they were interleaving concurent allocations at write() time.
> i.e. they serialised write() for long periods of time while they did
> global fs allocation, whilst XFS had a low overhead delalloc path
> with a tiny critical global section. i.e. delalloc allowed us to
> defer allocation to writeback where it isn't so throughput critical
> and better allocation decisions can be made.
> 
> IOWs, delalloc was the primary difference in both the fragmentation
> and scalability differences reported in the paper. That XFS did
> small amounts of in-memory-only prealloc beyond EOF during delalloc
> was irrelevant - it's just part of the delalloc mechanism in this
> context....
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-14 13:00                     ` Brian Foster
@ 2019-02-14 21:51                       ` Dave Chinner
  2019-02-15  2:35                         ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-14 21:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > I don't recall the exact history, but I don't think this was always the
> > > > > case for extsz hints.
> > > > 
> > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > could corrupt or expose stale data and caused all sorts of problems
> > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > requirements for extent size hints. Hence to make extent size hints
> > > > work for buffered writes, I simply made them work the same way as
> > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > 
> > > > > So would the close time eofb trim be as
> > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > changed back to using delayed allocation?
> > > > 
> > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > write-many, then we'd retain the post-eof blocks...
> > > > 
> > > > > I think a patch for that was
> > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > functionality which still had unresolved issues (IIRC).
> > > > 
> > > > *nod*
> > > > 
> > > > > From another angle, would a system that held files open for a
> > > > > significant amount of time relative to a one-time write such that close
> > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > susceptible to the same level of free space fragmentation as shown
> > > > > above?
> > > > 
> > > > If the file is held open for writing for a long time, we have to
> > > > assume that they are going to write again (and again) so we should
> > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > gc, then there's nothing more we can really do in that case...
> > > > 
> > > 
> > > I'm not necessarily sure that this condition is always a matter of
> > > writing too slow. It very well may be true, but I'm curious if there are
> > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > where we could end up doing a large number of one time file writes and
> > > not doing the release time trim until an underlying extent (with
> > > post-eof blocks) has been converted in more cases than not.
> > 
> > I'm not sure I follow what sort of workload and situation you are
> > describing here. Are you talking about the effect of an EOFB gc pass
> > during ongoing writes?
> > 
> 
> I'm not sure if it's an actual reproducible situation.. I'm just
> wondering out loud if there are normal workloads that might still defeat
> a one-time trim at release time. For example, copy enough files in
> parallel such that writeback touches most of them before the copies
> complete and we end up trimming physical blocks rather than delalloc
> blocks.
> 
> This is not so much of a problem if those files are large, I think,
> because then the preallocs and the resulting trimmed free space is on
> the larger side as well. If we're copying a bunch of little files with
> small preallocs, however, then we put ourselves in the pathological
> situation shown in Darrick's test.
> 
> I was originally thinking about whether this could happen or not on a
> highly parallel small file copy workload, but having thought about it a
> bit more I think there is a more simple example. What about an untar
> like workload that creates small files and calls fsync() before each fd
> is released?

Which is the same as an O_SYNC write if it's the same fd, which
means we'll trim allocated blocks on close. i.e. it's no different
to the current behaviour.  If such files are written in parallel then,
again, it is no different to the existing behaviour. i.e. it
largely depends on the timing of allocation in writeback and EOF
block clearing in close(). If close happens before the next
allocation in that AG, then they'll pack because there's no EOF
blocks that push out the new allocation.  If it's the other way
around, we get some level of freespace fragmentation.

THis is mitigated by the fact that workloads like this tend to be
separated into separate AGs - allocation first attempts to be
non-blocking which means if it races with a free in progress it will
skip to the next AG and it won't pack sequentially, anyway. So I
don't think this is a major concern in terms of free space
fragmentation - the allocator AG selection makes it impossible to
predict what happens when racing alloc/frees occur in a single AG
and we really try to avoid that as much as possible, anyway.

If it's a separate open/fsync/close pass after all the writes (and
close) then the behaviour is the same with either the existing code
or the "trim on first close" behaviour - and the delalloc blocks
beyond EOF will get killed on the first close and not be present for
the writeback, and all is good.

> Wouldn't that still defeat a one-time release heuristic and
> produce the same layout issues as shown above? We'd prealloc,
> writeback/convert then trim small/spurious fragments of post-eof space
> back to the allocator.

Yes, but as far as I can reason (sanely), "trim on first close" is
no worse that the existing heuristic in these cases, whilst being
significantly better in others we have observed in the wild that
cause problems...

I'm not looking for perfect here, just "better with no obvious
regressions". We can't predict every situation, so if it deals with
all the problems we've had reported and a few similar cases we don't
curently handle as well, then we should run with that and not really
worry about the cases that it (or the existing code) does not
solve until we have evidence that those workloads exist and are
causing real world problems. It's a difficult enough issue to reason
about without making it more complex by playing "what about" games..

> > > It sounds like what you're saying is that it doesn't really matter
> > > either way at this point. There's no perf advantage to keeping the eof
> > > blocks in this scenario, but there's also no real harm in deferring the
> > > eofb trim of physical post-eof blocks because any future free space
> > > fragmentation damage has already been done (assuming no more writes come
> > > in).
> > 
> > Essentially, yes.
> > 
> > > The thought above was tip-toeing around the idea of (in addition to the
> > > one-time trim heuristic you mentioned above) never doing a release time
> > > trim of non-delalloc post-eof blocks.
> > 
> > Except we want to trim blocks in the in the cases where it's a write
> > once file that has been fsync()d or written by O_DIRECT w/ really
> > large extent size hints....
> 
> We still would trim it. The one time write case is essentially
> unaffected because the only advantage of that hueristic is to trim eof
> blocks before they are converted. If the eof blocks are real, the
> release time heuristic has already failed (i.e., it hasn't provided any
> benefit that background trim doesn't already provide).

No, all it means is that the blocks were allocated before the fd was
closed, not that the release time heuristic failed. The release time
heuristic is deciding what to do /after/ the writes have been
completed, whatever the post-eof situation is. It /can't fail/ if it
hasn't been triggered before physical allocation has been done, it
can only decide what to do about those extents once it is called...

In which case, if it's a write-once file we want to kill the blocks,
no matter whether they are allocated or not. And if it's a repeated
open/X/close workload, then a single removal of EOF blocks won't
greatly impact the file layout because subsequent closes won't
trigger.

IOWs, in the general case it does the right thing, and when it's
wrong the impact is negated by the fact it will do the right thing
on all future closes on that inode....

> IOW, what we really want to avoid is trimming (small batches of) unused
> physical eof blocks.

For the general case, I disagree. :)

> > > > IOWs, speculative prealloc beyond EOF is not just about preventing
> > > > fragmentation - it also helps minimise the per-write CPU overhead of
> > > > delalloc space accounting. (i.e. allows faster write rates into
> > > > cache). IOWs, for anything more than a really small files, we want
> > > > to be doing speculative delalloc on the first time the file is
> > > > written to.
> > > > 
> > > 
> > > Ok, this paper refers to CPU overhead as it contributes to lack of
> > > scalability.
> > 
> > Well, that was the experiments that were being performed. I'm using
> > it as an example of how per-write overhead is actually important to
> > throughput. Ignore the "global lock caused overall throughput
> > issues" because we don't have that problem any more, and instead
> > look at it as a demonstration of "anything that slows down a write()
> > reduces per-thread throughput".
> > 
> 
> Makes sense, and that's how I took it after reading through the paper.
> My point was just that I think this is more of a tradeoff and caveat to
> consider than something that outright rules out doing less agressive
> preallocation in certain cases.
> 
> I ran a few tests yesterday out of curiousity and was able to measure (a
> small) difference in single-threaded buffered writes to cache with and
> without preallocation. What I found a bit interesting was that my
> original attempt to test this actually showed _faster_ throughput
> without preallocation because the mechanism I happened to use to bypass
> preallocation was an up front truncate.

So, like:

$ xfs_io -f -t -c "pwrite 0 1g" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0.6582 sec (1.519 GiB/sec and 398224.4691 ops/sec)
$

Vs:

$ xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0.7159 sec (1.397 GiB/sec and 366147.4512 ops/sec)
$

I get an average of 1.53GiB/s write rate into cache with
speculative prealloc active, and only 1.36GiB/s write rate in cache
without it. I've run these 10 times each, ranges are
1.519-1.598GiB/s for prealloc and 1.291-1.361GiB/s whenusing
truncate to prevent prealloc.

IOWs, on my test machine, the write rate into cache using 4kB writes
is over 10% faster with prealloc enabled. And to point out the
dminishing returns, with a write size of 1MB:

$  xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0.4975 sec (2.010 GiB/sec and 2058.1922 ops/sec)

The throughput with or without prealloc average out just over
2GiB/s and the difference is noise (2.03GiB/s vs 2.05GiB/s)

Note: I'm jsut measuring write-in rates here, and I've isolated them
from writeback completely because the file size is small enough
that dirty throttling (20% of memory) isn't kicking in because I
have 16GB of RAM on this machine. If I go over 3GB in file size,
dirty throttling and writeback kicks in and the results are a
complete crap-shoot because it's dependent on when writeback kicks
in and how the writeback rate ramps up in the first second of
writeback.

> I eventually realized that this
> had other effects (i.e., one copy doing size updates vs. the other not
> doing so) and just compared a fixed, full size (1G) preallocation with a
> fixed 4k preallocation to reproduce the boost provided by the former.

That only matters for *writeback overhead*, not ingest efficiency.
Indeed, using a preallocated extent:

$ time sudo xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0.6610 sec (1.513 GiB/sec and 396557.5926 ops/sec)


$ xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0.4946 sec (2.022 GiB/sec and 2070.0795 ops/sec)

The write-in rates are identical to when specualtive prealloc is
enabled (always hitting a preexisting extent and not having to
extend an extent on every iomap lookup). If I time the fsync, then
things are different because there's additional writeback overhead
(i.e. unwritten extent conversion), but this does not affect the
per-write cache ingest CPU overhead.

> The point here is that while there is such a boost, there are also other
> workload dependent factors that are out of our control. For example,
> somebody who today cares about preallocation only for a boost on writes
> to cache can apparently achieve a greater benefit by truncating the file
> up front and disabling preallocation entirely.

Can you post your tests and results, because I'm getting very
different results from running what I think are the same tests.

> Moving beyond the truncate thing, I also saw the benefit of
> preallocation diminish as write buffer size was increased. As the write
> size increases from 4k to around 64k, the pure performance benefit of
> preallocation trailed off to zero.

Yes, because we've optimised the overhead out of larger writes with
the iomap infrastructure. This used to be a whole lot worse when we
did a delalloc mapping for every page instead of one per write()
call. IOWs, a significant part of the 30-35% speed increase you see in my numbers above
going from 4kB writes to 1MB writes is a result of iomap dropping
the number of delalloc mapping calls by a factor of 256....

> Part of that could also be effects of
> less frequent size updates and whatnot due to the larger writes, but I
> also don't think that's an uncommon thing in practice.

Again, size updates are only done on writeback, not ingest....

> My only point here is that I don't think it's so cut and dry that we
> absolutely need dynamic speculative preallocation for write to cache
> performance.

History, experience with both embedded and high end NAS machines
(where per-write CPU usage really matters!) and my own experiements
tell me a different story :/

> > > BTW for historical context.. was speculative preallocation a thing when
> > > this paper was written?
> > 
> > Yes. specualtive prealloc goes way back into the 90s from Irix.  It
> > was first made configurable in XFS via the biosize mount option
> > added with v3 superblocks in 1997, but the initial linux port only
> > allowed up to 64k.
> > 
> 
> Hmm, Ok.. so it was originally speculative preallocation without the
> "dynamic sizing" logic that we have today. Thanks for the background.
> 
> > In 2005, the linux mount option allowed biosize to be extended to
> > 1GB, which made sense because >4GB allocation groups (mkfs enabled
> > them late 2003) were now starting to be widely used and so users
> > were reporting new large AG fragmentation issues that had never been
> > seen before. i.e.  it was now practical to have contiguous multi-GB
> > extents in files and the delalloc code was struggling to create
> > them, so having EOF-prealloc be able to make use of that capability
> > was needed....
> > 
> > And then auto-tuning made sense because more and more people were
> > having to use the mount option in more general workloads to avoid
> > fragmentation.
> > 
> 
> "auto-tuning" means "dynamic sizing" here, yes?

Yes.

> FWIW, much of this
> discussion also makes me wonder how appropriate the current size limit
> (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> availability), as opposed to something larger (on the order of hundreds
> of MBs for example, perhaps 256-512MB).

RAM size is irrelevant. What matters is file size and the impact
of allocation patterns on writeback IO patterns. i.e. the size limit
is about optimising writeback, not preventing fragmentation or
making more efficient use of memory, etc.

i.e. when we have lots of small files, we want writeback to pack
them so we get multiple-file sequentialisation of the write stream -
this makes things like untarring a kernel tarball (which is a large
number of small files) a sequential write workload rather than a
seek-per-individual-file-write workload. That make a massive
difference to performance on spinning disks, and that's what the 64k
threshold (and post-EOF block removal on close for larger files)
tries to preserve.

Realistically, we should probably change that 64k threshold to match
sunit if it is set, so that we really do end up trying to pack
any write once file smaller than sunit as the allocator won't try
to align them, anyway....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-14 21:51                       ` Dave Chinner
@ 2019-02-15  2:35                         ` Brian Foster
  2019-02-15  7:23                           ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2019-02-15  2:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 15, 2019 at 08:51:24AM +1100, Dave Chinner wrote:
> On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> > On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > > I don't recall the exact history, but I don't think this was always the
> > > > > > case for extsz hints.
> > > > > 
> > > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > > could corrupt or expose stale data and caused all sorts of problems
> > > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > > requirements for extent size hints. Hence to make extent size hints
> > > > > work for buffered writes, I simply made them work the same way as
> > > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > > 
> > > > > > So would the close time eofb trim be as
> > > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > > changed back to using delayed allocation?
> > > > > 
> > > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > > write-many, then we'd retain the post-eof blocks...
> > > > > 
> > > > > > I think a patch for that was
> > > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > > functionality which still had unresolved issues (IIRC).
> > > > > 
> > > > > *nod*
> > > > > 
> > > > > > From another angle, would a system that held files open for a
> > > > > > significant amount of time relative to a one-time write such that close
> > > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > > susceptible to the same level of free space fragmentation as shown
> > > > > > above?
> > > > > 
> > > > > If the file is held open for writing for a long time, we have to
> > > > > assume that they are going to write again (and again) so we should
> > > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > > gc, then there's nothing more we can really do in that case...
> > > > > 
> > > > 
> > > > I'm not necessarily sure that this condition is always a matter of
> > > > writing too slow. It very well may be true, but I'm curious if there are
> > > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > > where we could end up doing a large number of one time file writes and
> > > > not doing the release time trim until an underlying extent (with
> > > > post-eof blocks) has been converted in more cases than not.
> > > 
> > > I'm not sure I follow what sort of workload and situation you are
> > > describing here. Are you talking about the effect of an EOFB gc pass
> > > during ongoing writes?
> > > 
> > 
> > I'm not sure if it's an actual reproducible situation.. I'm just
> > wondering out loud if there are normal workloads that might still defeat
> > a one-time trim at release time. For example, copy enough files in
> > parallel such that writeback touches most of them before the copies
> > complete and we end up trimming physical blocks rather than delalloc
> > blocks.
> > 
> > This is not so much of a problem if those files are large, I think,
> > because then the preallocs and the resulting trimmed free space is on
> > the larger side as well. If we're copying a bunch of little files with
> > small preallocs, however, then we put ourselves in the pathological
> > situation shown in Darrick's test.
> > 
> > I was originally thinking about whether this could happen or not on a
> > highly parallel small file copy workload, but having thought about it a
> > bit more I think there is a more simple example. What about an untar
> > like workload that creates small files and calls fsync() before each fd
> > is released?
> 
> Which is the same as an O_SYNC write if it's the same fd, which
> means we'll trim allocated blocks on close. i.e. it's no different
> to the current behaviour.  If such files are written in parallel then,
> again, it is no different to the existing behaviour. i.e. it
> largely depends on the timing of allocation in writeback and EOF
> block clearing in close(). If close happens before the next
> allocation in that AG, then they'll pack because there's no EOF
> blocks that push out the new allocation.  If it's the other way
> around, we get some level of freespace fragmentation.
> 

I know it's the same as the current behavior. ;P I think we're talking
past eachother on this. What I'm saying is that the downside to the
current behavior is that a simple copy file -> fsync -> copy next file
workload fragments free space.

Darrick demonstrated this better in his random size test with the
release time trim removed, but a simple loop to write one thousand 100k
files (xfs_io -fc "pwrite 0 100k" -c fsync ...) demonstrates similar
behavior:

# xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp
   from      to extents  blocks    pct
      1       1      18      18   0.00
     16      31       1      25   0.00
     32      63       1      58   0.00
 131072  262143     924 242197739  24.97
 262144  524287       1  365696   0.04
134217728 242588672       3 727292183  74.99

vs. the same test without the fsync:

# xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp 
   from      to extents  blocks    pct
      1       1      16      16   0.00
     16      31       1      20   0.00
4194304 8388607       2 16752060   1.73
134217728 242588672       4 953103627  98.27

Clearly there is an advantage to trimming before delalloc conversion.

Random thought: perhaps a one time trim at writeback (where writeback
would convert delaloc across eof) _or_ release time, whichever happens
first on an inode with preallocation, might help mitigate this problem.

> THis is mitigated by the fact that workloads like this tend to be
> separated into separate AGs - allocation first attempts to be
> non-blocking which means if it races with a free in progress it will
> skip to the next AG and it won't pack sequentially, anyway. So I
> don't think this is a major concern in terms of free space
> fragmentation - the allocator AG selection makes it impossible to
> predict what happens when racing alloc/frees occur in a single AG
> and we really try to avoid that as much as possible, anyway.
> 

Indeed..

> If it's a separate open/fsync/close pass after all the writes (and
> close) then the behaviour is the same with either the existing code
> or the "trim on first close" behaviour - and the delalloc blocks
> beyond EOF will get killed on the first close and not be present for
> the writeback, and all is good.
> 
> > Wouldn't that still defeat a one-time release heuristic and
> > produce the same layout issues as shown above? We'd prealloc,
> > writeback/convert then trim small/spurious fragments of post-eof space
> > back to the allocator.
> 
> Yes, but as far as I can reason (sanely), "trim on first close" is
> no worse that the existing heuristic in these cases, whilst being
> significantly better in others we have observed in the wild that
> cause problems...
> 

Agreed. Again, I'm thinking about this assuming that fix is already in
place.

> I'm not looking for perfect here, just "better with no obvious
> regressions". We can't predict every situation, so if it deals with
> all the problems we've had reported and a few similar cases we don't
> curently handle as well, then we should run with that and not really
> worry about the cases that it (or the existing code) does not
> solve until we have evidence that those workloads exist and are
> causing real world problems. It's a difficult enough issue to reason
> about without making it more complex by playing "what about" games..
> 

That's fair, but we have had users run into this situation. The whole
sparse inodes thing is partially a workaround for side effects of this
problem (free space fragmentation being so bad we can't allocate
inodes). Granted, some of those users may have also been able to avoid
that problem with better fs usage/configuration.

> > > > It sounds like what you're saying is that it doesn't really matter
> > > > either way at this point. There's no perf advantage to keeping the eof
> > > > blocks in this scenario, but there's also no real harm in deferring the
> > > > eofb trim of physical post-eof blocks because any future free space
> > > > fragmentation damage has already been done (assuming no more writes come
> > > > in).
> > > 
> > > Essentially, yes.
> > > 
> > > > The thought above was tip-toeing around the idea of (in addition to the
> > > > one-time trim heuristic you mentioned above) never doing a release time
> > > > trim of non-delalloc post-eof blocks.
> > > 
> > > Except we want to trim blocks in the in the cases where it's a write
> > > once file that has been fsync()d or written by O_DIRECT w/ really
> > > large extent size hints....
> > 
> > We still would trim it. The one time write case is essentially
> > unaffected because the only advantage of that hueristic is to trim eof
> > blocks before they are converted. If the eof blocks are real, the
> > release time heuristic has already failed (i.e., it hasn't provided any
> > benefit that background trim doesn't already provide).
> 
> No, all it means is that the blocks were allocated before the fd was
> closed, not that the release time heuristic failed. The release time
> heuristic is deciding what to do /after/ the writes have been
> completed, whatever the post-eof situation is. It /can't fail/ if it
> hasn't been triggered before physical allocation has been done, it
> can only decide what to do about those extents once it is called...
> 

Confused. By "failed," I mean we physically allocated blocks that were
never intended to be used. This is basically referring to the negative
effect of delalloc conversion -> eof trim behavior on once written files
demonstrated above. If this negative effect didn't exist, we wouldn't
need the release time trim at all and could just rely on background
trim.

> In which case, if it's a write-once file we want to kill the blocks,
> no matter whether they are allocated or not. And if it's a repeated
> open/X/close workload, then a single removal of EOF blocks won't
> greatly impact the file layout because subsequent closes won't
> trigger.
> 
> IOWs, in the general case it does the right thing, and when it's
> wrong the impact is negated by the fact it will do the right thing
> on all future closes on that inode....
> 
> > IOW, what we really want to avoid is trimming (small batches of) unused
> > physical eof blocks.
> 
> For the general case, I disagree. :)
> 
> > > > > IOWs, speculative prealloc beyond EOF is not just about preventing
> > > > > fragmentation - it also helps minimise the per-write CPU overhead of
> > > > > delalloc space accounting. (i.e. allows faster write rates into
> > > > > cache). IOWs, for anything more than a really small files, we want
> > > > > to be doing speculative delalloc on the first time the file is
> > > > > written to.
> > > > > 
> > > > 
> > > > Ok, this paper refers to CPU overhead as it contributes to lack of
> > > > scalability.
> > > 
> > > Well, that was the experiments that were being performed. I'm using
> > > it as an example of how per-write overhead is actually important to
> > > throughput. Ignore the "global lock caused overall throughput
> > > issues" because we don't have that problem any more, and instead
> > > look at it as a demonstration of "anything that slows down a write()
> > > reduces per-thread throughput".
> > > 
> > 
> > Makes sense, and that's how I took it after reading through the paper.
> > My point was just that I think this is more of a tradeoff and caveat to
> > consider than something that outright rules out doing less agressive
> > preallocation in certain cases.
> > 
> > I ran a few tests yesterday out of curiousity and was able to measure (a
> > small) difference in single-threaded buffered writes to cache with and
> > without preallocation. What I found a bit interesting was that my
> > original attempt to test this actually showed _faster_ throughput
> > without preallocation because the mechanism I happened to use to bypass
> > preallocation was an up front truncate.
> 
> So, like:
> 
> $ xfs_io -f -t -c "pwrite 0 1g" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.6582 sec (1.519 GiB/sec and 398224.4691 ops/sec)
> $
> 
> Vs:
> 
> $ xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.7159 sec (1.397 GiB/sec and 366147.4512 ops/sec)
> $
> 
> I get an average of 1.53GiB/s write rate into cache with
> speculative prealloc active, and only 1.36GiB/s write rate in cache
> without it. I've run these 10 times each, ranges are
> 1.519-1.598GiB/s for prealloc and 1.291-1.361GiB/s whenusing
> truncate to prevent prealloc.
> 

Yes, that's basically the same test. I repeated it again and saw pretty
much the same numbers here. Hmm, not sure what happened there. Perhaps I
fat fingered something or mixed in a run with a different buffer size
for the truncate case. I recall seeing right around ~1.5GB/s for the
base case and closer to ~1.6GB/s for the truncate case somehow or
another, but I don't have a record of it to figure out what happened.
Anyways, sorry for the noise.. I did ultimately reproduce the prealloc
boost and still see the diminishing returns at about 64k or so (around
~2.7GB/s with or without preallocation).

> IOWs, on my test machine, the write rate into cache using 4kB writes
> is over 10% faster with prealloc enabled. And to point out the
> dminishing returns, with a write size of 1MB:
> 
> $  xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0.4975 sec (2.010 GiB/sec and 2058.1922 ops/sec)
> 
> The throughput with or without prealloc average out just over
> 2GiB/s and the difference is noise (2.03GiB/s vs 2.05GiB/s)
> 
> Note: I'm jsut measuring write-in rates here, and I've isolated them
> from writeback completely because the file size is small enough
> that dirty throttling (20% of memory) isn't kicking in because I
> have 16GB of RAM on this machine. If I go over 3GB in file size,
> dirty throttling and writeback kicks in and the results are a
> complete crap-shoot because it's dependent on when writeback kicks
> in and how the writeback rate ramps up in the first second of
> writeback.
> 

*nod*

> > I eventually realized that this
> > had other effects (i.e., one copy doing size updates vs. the other not
> > doing so) and just compared a fixed, full size (1G) preallocation with a
> > fixed 4k preallocation to reproduce the boost provided by the former.
> 
> That only matters for *writeback overhead*, not ingest efficiency.
> Indeed, using a preallocated extent:
> 

Not sure how we got into physical preallocation here. Assume any
reference to "preallocation" in this thread by me refers to post-eof
speculative preallocation. The above preallocation tweaks were
controlled in my tests via the allocsize mount option, not physical
block preallocation.

> $ time sudo xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.6610 sec (1.513 GiB/sec and 396557.5926 ops/sec)
> 
> 
> $ xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0.4946 sec (2.022 GiB/sec and 2070.0795 ops/sec)
> 
> The write-in rates are identical to when specualtive prealloc is
> enabled (always hitting a preexisting extent and not having to
> extend an extent on every iomap lookup). If I time the fsync, then
> things are different because there's additional writeback overhead
> (i.e. unwritten extent conversion), but this does not affect the
> per-write cache ingest CPU overhead.
> 
> > The point here is that while there is such a boost, there are also other
> > workload dependent factors that are out of our control. For example,
> > somebody who today cares about preallocation only for a boost on writes
> > to cache can apparently achieve a greater benefit by truncating the file
> > up front and disabling preallocation entirely.
> 
> Can you post your tests and results, because I'm getting very
> different results from running what I think are the same tests.
> 
> > Moving beyond the truncate thing, I also saw the benefit of
> > preallocation diminish as write buffer size was increased. As the write
> > size increases from 4k to around 64k, the pure performance benefit of
> > preallocation trailed off to zero.
> 
> Yes, because we've optimised the overhead out of larger writes with
> the iomap infrastructure. This used to be a whole lot worse when we
> did a delalloc mapping for every page instead of one per write()
> call. IOWs, a significant part of the 30-35% speed increase you see in my numbers above
> going from 4kB writes to 1MB writes is a result of iomap dropping
> the number of delalloc mapping calls by a factor of 256....
> 
> > Part of that could also be effects of
> > less frequent size updates and whatnot due to the larger writes, but I
> > also don't think that's an uncommon thing in practice.
> 
> Again, size updates are only done on writeback, not ingest....
> 
> > My only point here is that I don't think it's so cut and dry that we
> > absolutely need dynamic speculative preallocation for write to cache
> > performance.
> 
> History, experience with both embedded and high end NAS machines
> (where per-write CPU usage really matters!) and my own experiements
> tell me a different story :/
> 
> > > > BTW for historical context.. was speculative preallocation a thing when
> > > > this paper was written?
> > > 
> > > Yes. specualtive prealloc goes way back into the 90s from Irix.  It
> > > was first made configurable in XFS via the biosize mount option
> > > added with v3 superblocks in 1997, but the initial linux port only
> > > allowed up to 64k.
> > > 
> > 
> > Hmm, Ok.. so it was originally speculative preallocation without the
> > "dynamic sizing" logic that we have today. Thanks for the background.
> > 
> > > In 2005, the linux mount option allowed biosize to be extended to
> > > 1GB, which made sense because >4GB allocation groups (mkfs enabled
> > > them late 2003) were now starting to be widely used and so users
> > > were reporting new large AG fragmentation issues that had never been
> > > seen before. i.e.  it was now practical to have contiguous multi-GB
> > > extents in files and the delalloc code was struggling to create
> > > them, so having EOF-prealloc be able to make use of that capability
> > > was needed....
> > > 
> > > And then auto-tuning made sense because more and more people were
> > > having to use the mount option in more general workloads to avoid
> > > fragmentation.
> > > 
> > 
> > "auto-tuning" means "dynamic sizing" here, yes?
> 
> Yes.
> 
> > FWIW, much of this
> > discussion also makes me wonder how appropriate the current size limit
> > (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> > availability), as opposed to something larger (on the order of hundreds
> > of MBs for example, perhaps 256-512MB).
> 
> RAM size is irrelevant. What matters is file size and the impact
> of allocation patterns on writeback IO patterns. i.e. the size limit
> is about optimising writeback, not preventing fragmentation or
> making more efficient use of memory, etc.
> 

I'm just suggesting that the more RAM that is available, the more we're
able to write into cache before writeback starts and thus the larger
physical extents we're able to allocate independent of speculative
preallocation (perf issues notwithstanding).

> i.e. when we have lots of small files, we want writeback to pack
> them so we get multiple-file sequentialisation of the write stream -
> this makes things like untarring a kernel tarball (which is a large
> number of small files) a sequential write workload rather than a
> seek-per-individual-file-write workload. That make a massive
> difference to performance on spinning disks, and that's what the 64k
> threshold (and post-EOF block removal on close for larger files)
> tries to preserve.
> 

Sure, but what's the downside to increasing that threshold to even
something on the order of MBs? Wouldn't that at least help us leave
larger free extents around in those workloads/patterns that do fragment
free space?

> Realistically, we should probably change that 64k threshold to match
> sunit if it is set, so that we really do end up trying to pack
> any write once file smaller than sunit as the allocator won't try
> to align them, anyway....
> 

That makes sense.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-15  2:35                         ` Brian Foster
@ 2019-02-15  7:23                           ` Dave Chinner
  2019-02-15 20:33                             ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-15  7:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 14, 2019 at 09:35:21PM -0500, Brian Foster wrote:
> On Fri, Feb 15, 2019 at 08:51:24AM +1100, Dave Chinner wrote:
> > On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> > > On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > > > I don't recall the exact history, but I don't think this was always the
> > > > > > > case for extsz hints.
> > > > > > 
> > > > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > > > could corrupt or expose stale data and caused all sorts of problems
> > > > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > > > requirements for extent size hints. Hence to make extent size hints
> > > > > > work for buffered writes, I simply made them work the same way as
> > > > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > > > 
> > > > > > > So would the close time eofb trim be as
> > > > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > > > changed back to using delayed allocation?
> > > > > > 
> > > > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > > > write-many, then we'd retain the post-eof blocks...
> > > > > > 
> > > > > > > I think a patch for that was
> > > > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > > > functionality which still had unresolved issues (IIRC).
> > > > > > 
> > > > > > *nod*
> > > > > > 
> > > > > > > From another angle, would a system that held files open for a
> > > > > > > significant amount of time relative to a one-time write such that close
> > > > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > > > susceptible to the same level of free space fragmentation as shown
> > > > > > > above?
> > > > > > 
> > > > > > If the file is held open for writing for a long time, we have to
> > > > > > assume that they are going to write again (and again) so we should
> > > > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > > > gc, then there's nothing more we can really do in that case...
> > > > > > 
> > > > > 
> > > > > I'm not necessarily sure that this condition is always a matter of
> > > > > writing too slow. It very well may be true, but I'm curious if there are
> > > > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > > > where we could end up doing a large number of one time file writes and
> > > > > not doing the release time trim until an underlying extent (with
> > > > > post-eof blocks) has been converted in more cases than not.
> > > > 
> > > > I'm not sure I follow what sort of workload and situation you are
> > > > describing here. Are you talking about the effect of an EOFB gc pass
> > > > during ongoing writes?
> > > > 
> > > 
> > > I'm not sure if it's an actual reproducible situation.. I'm just
> > > wondering out loud if there are normal workloads that might still defeat
> > > a one-time trim at release time. For example, copy enough files in
> > > parallel such that writeback touches most of them before the copies
> > > complete and we end up trimming physical blocks rather than delalloc
> > > blocks.
> > > 
> > > This is not so much of a problem if those files are large, I think,
> > > because then the preallocs and the resulting trimmed free space is on
> > > the larger side as well. If we're copying a bunch of little files with
> > > small preallocs, however, then we put ourselves in the pathological
> > > situation shown in Darrick's test.
> > > 
> > > I was originally thinking about whether this could happen or not on a
> > > highly parallel small file copy workload, but having thought about it a
> > > bit more I think there is a more simple example. What about an untar
> > > like workload that creates small files and calls fsync() before each fd
> > > is released?
> > 
> > Which is the same as an O_SYNC write if it's the same fd, which
> > means we'll trim allocated blocks on close. i.e. it's no different
> > to the current behaviour.  If such files are written in parallel then,
> > again, it is no different to the existing behaviour. i.e. it
> > largely depends on the timing of allocation in writeback and EOF
> > block clearing in close(). If close happens before the next
> > allocation in that AG, then they'll pack because there's no EOF
> > blocks that push out the new allocation.  If it's the other way
> > around, we get some level of freespace fragmentation.
> > 
> 
> I know it's the same as the current behavior. ;P I think we're talking
> past eachother on this.

Probably :P

> What I'm saying is that the downside to the
> current behavior is that a simple copy file -> fsync -> copy next file
> workload fragments free space.

Yes. But it's also one of the cases that "always release on first
close" fixes.

> Darrick demonstrated this better in his random size test with the
> release time trim removed, but a simple loop to write one thousand 100k
> files (xfs_io -fc "pwrite 0 100k" -c fsync ...) demonstrates similar
> behavior:
> 
> # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp
>    from      to extents  blocks    pct
>       1       1      18      18   0.00
>      16      31       1      25   0.00
>      32      63       1      58   0.00
>  131072  262143     924 242197739  24.97
>  262144  524287       1  365696   0.04
> 134217728 242588672       3 727292183  74.99

That should not be leaving freespace fragments behind - it should be
trimming the EOF blocks on close after fsync() and the next
allocation should pack tightly.

/me goes off to trace it because it's not doing what he knows it
should be doing.

Ngggh. Busy extent handling.

Basically, the extent we trimmed is busy because it is a user data
extent that has been freed and not yet committed (even though it was
not used), so it gets trimmed out of the free space range that is
allocated.

IOWs, how we handle busy extents results in this behaviour, not the
speculative prealloc which has already been removed and returned to
the free space pool....

> vs. the same test without the fsync:
> 
> # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp 
>    from      to extents  blocks    pct
>       1       1      16      16   0.00
>      16      31       1      20   0.00
> 4194304 8388607       2 16752060   1.73
> 134217728 242588672       4 953103627  98.27
> 
> Clearly there is an advantage to trimming before delalloc conversion.

The advantage is in avoiding busy extents by trimming before
physical allocation occurs. I think we need to fix the busy extent
handling here, not the speculative prealloc...

> Random thought: perhaps a one time trim at writeback (where writeback
> would convert delaloc across eof) _or_ release time, whichever happens
> first on an inode with preallocation, might help mitigate this problem.

Not sure how you'd do that reliably - there's no serialisation with
incoming writes, and no context as to what is changing EOF. Hence I
don't know how we'd decide that trimming was required or not...

> > I'm not looking for perfect here, just "better with no obvious
> > regressions". We can't predict every situation, so if it deals with
> > all the problems we've had reported and a few similar cases we don't
> > curently handle as well, then we should run with that and not really
> > worry about the cases that it (or the existing code) does not
> > solve until we have evidence that those workloads exist and are
> > causing real world problems. It's a difficult enough issue to reason
> > about without making it more complex by playing "what about" games..
> > 
> 
> That's fair, but we have had users run into this situation. The whole
> sparse inodes thing is partially a workaround for side effects of this
> problem (free space fragmentation being so bad we can't allocate
> inodes). Granted, some of those users may have also been able to avoid
> that problem with better fs usage/configuration.

I think systems that required sparse inodes is orthogonal - those
issues could be caused just with well packed small files and large
inodes, and had nothing in common with the workload we've recently
seen. Indeed, in the cases other than the specific small file
workload gluster used to fragment free space to prevent inode
allocation, we never got to the bottom of what caused the free space
fragmentation. All we could do is make the fs more tolerant of
freespace fragmentation.

This time, we have direct evidence of what caused freespace
fragmentation on this specific system. It was caused by the app
doing something bizarre and we can extrapolate several similar
behaviours from that workload.

But beyond that, we're well into "whatabout" and "whatif" territory.

> > > We still would trim it. The one time write case is essentially
> > > unaffected because the only advantage of that hueristic is to trim eof
> > > blocks before they are converted. If the eof blocks are real, the
> > > release time heuristic has already failed (i.e., it hasn't provided any
> > > benefit that background trim doesn't already provide).
> > 
> > No, all it means is that the blocks were allocated before the fd was
> > closed, not that the release time heuristic failed. The release time
> > heuristic is deciding what to do /after/ the writes have been
> > completed, whatever the post-eof situation is. It /can't fail/ if it
> > hasn't been triggered before physical allocation has been done, it
> > can only decide what to do about those extents once it is called...
> > 
> 
> Confused. By "failed," I mean we physically allocated blocks that were
> never intended to be used.

How can we know that are never intended to be used at writeback
time?

> This is basically referring to the negative
> effect of delalloc conversion -> eof trim behavior on once written files
> demonstrated above. If this negative effect didn't exist, we wouldn't
> need the release time trim at all and could just rely on background
> trim.

We also cannot predict what the application intends. Hence we have
heuristics that trigger once the application signals that it is
"done with this file". i.e. it has closed the fd.

> > > I eventually realized that this
> > > had other effects (i.e., one copy doing size updates vs. the other not
> > > doing so) and just compared a fixed, full size (1G) preallocation with a
> > > fixed 4k preallocation to reproduce the boost provided by the former.
> > 
> > That only matters for *writeback overhead*, not ingest efficiency.
> > Indeed, using a preallocated extent:
> > 
> 
> Not sure how we got into physical preallocation here. Assume any
> reference to "preallocation" in this thread by me refers to post-eof
> speculative preallocation. The above preallocation tweaks were
> controlled in my tests via the allocsize mount option, not physical
> block preallocation.

I'm just using it to demonstrate the difference is in continually
extending the delalloc extent in memory. I could have just done an
overwrite - it's the same thing.

> > > FWIW, much of this
> > > discussion also makes me wonder how appropriate the current size limit
> > > (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> > > availability), as opposed to something larger (on the order of hundreds
> > > of MBs for example, perhaps 256-512MB).
> > 
> > RAM size is irrelevant. What matters is file size and the impact
> > of allocation patterns on writeback IO patterns. i.e. the size limit
> > is about optimising writeback, not preventing fragmentation or
> > making more efficient use of memory, etc.
> 
> I'm just suggesting that the more RAM that is available, the more we're
> able to write into cache before writeback starts and thus the larger
> physical extents we're able to allocate independent of speculative
> preallocation (perf issues notwithstanding).

ISTR I looked at that years ago and couldn't get it to work
reliably. It works well for initial ingest, but once the writes go
on for long enough dirty throttling starts chopping ingest up in
smaller and smaller chunks as it rotors writeback bandwidth around
all the processes dirtying the page cache. This chunking happens
regardless of the size of the file being written. And so the more
processes that are dirtying the page cache, the smaller the file
fragments get because each file gets a smaller amount of the overall
writeback bandwidth each time it writeback occurs. i.e.
fragmentation increases as memory pressure, load and concurrency
increases, which are exactly the conditions we want to be avoiding
fragmentation as much as possible...

The only way I found to prevent this in fair and predictable
manner is the auto-grow algorithm we have now.  There's very few
real world corner cases where it break down, so we do not need
fundamental changes here. We've found one corner case where it is
defeated, so let's address that corner case with the minimal change
that is necessary but otherwise leave the underlying algorithm
alone so we can observe the loner term effects of the tweak we
need to make....

> > i.e. when we have lots of small files, we want writeback to pack
> > them so we get multiple-file sequentialisation of the write stream -
> > this makes things like untarring a kernel tarball (which is a large
> > number of small files) a sequential write workload rather than a
> > seek-per-individual-file-write workload. That make a massive
> > difference to performance on spinning disks, and that's what the 64k
> > threshold (and post-EOF block removal on close for larger files)
> > tries to preserve.
> 
> Sure, but what's the downside to increasing that threshold to even
> something on the order of MBs? Wouldn't that at least help us leave
> larger free extents around in those workloads/patterns that do fragment
> free space?

Because even for files in th the "few MB" size, worst case
fragmentation is thousands of extents and IO performance that
absolutely sucks. Especially on RAID5/6 devices. We have to ensure
file fragmentation at it's worst does not affect filesystem
throughput and that means in the general case less than ~1MB sized
extents is just not acceptible even for smallish files....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/3]: Extreme fragmentation ahoy!
  2019-02-15  7:23                           ` Dave Chinner
@ 2019-02-15 20:33                             ` Brian Foster
  0 siblings, 0 replies; 24+ messages in thread
From: Brian Foster @ 2019-02-15 20:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 15, 2019 at 06:23:32PM +1100, Dave Chinner wrote:
> On Thu, Feb 14, 2019 at 09:35:21PM -0500, Brian Foster wrote:
> > On Fri, Feb 15, 2019 at 08:51:24AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> > > > On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > > > > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > > > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > > > > I don't recall the exact history, but I don't think this was always the
> > > > > > > > case for extsz hints.
> > > > > > > 
> > > > > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > > > > could corrupt or expose stale data and caused all sorts of problems
> > > > > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > > > > requirements for extent size hints. Hence to make extent size hints
> > > > > > > work for buffered writes, I simply made them work the same way as
> > > > > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > > > > 
> > > > > > > > So would the close time eofb trim be as
> > > > > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > > > > changed back to using delayed allocation?
> > > > > > > 
> > > > > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > > > > write-many, then we'd retain the post-eof blocks...
> > > > > > > 
> > > > > > > > I think a patch for that was
> > > > > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > > > > functionality which still had unresolved issues (IIRC).
> > > > > > > 
> > > > > > > *nod*
> > > > > > > 
> > > > > > > > From another angle, would a system that held files open for a
> > > > > > > > significant amount of time relative to a one-time write such that close
> > > > > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > > > > susceptible to the same level of free space fragmentation as shown
> > > > > > > > above?
> > > > > > > 
> > > > > > > If the file is held open for writing for a long time, we have to
> > > > > > > assume that they are going to write again (and again) so we should
> > > > > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > > > > gc, then there's nothing more we can really do in that case...
> > > > > > > 
> > > > > > 
> > > > > > I'm not necessarily sure that this condition is always a matter of
> > > > > > writing too slow. It very well may be true, but I'm curious if there are
> > > > > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > > > > where we could end up doing a large number of one time file writes and
> > > > > > not doing the release time trim until an underlying extent (with
> > > > > > post-eof blocks) has been converted in more cases than not.
> > > > > 
> > > > > I'm not sure I follow what sort of workload and situation you are
> > > > > describing here. Are you talking about the effect of an EOFB gc pass
> > > > > during ongoing writes?
> > > > > 
> > > > 
> > > > I'm not sure if it's an actual reproducible situation.. I'm just
> > > > wondering out loud if there are normal workloads that might still defeat
> > > > a one-time trim at release time. For example, copy enough files in
> > > > parallel such that writeback touches most of them before the copies
> > > > complete and we end up trimming physical blocks rather than delalloc
> > > > blocks.
> > > > 
> > > > This is not so much of a problem if those files are large, I think,
> > > > because then the preallocs and the resulting trimmed free space is on
> > > > the larger side as well. If we're copying a bunch of little files with
> > > > small preallocs, however, then we put ourselves in the pathological
> > > > situation shown in Darrick's test.
> > > > 
> > > > I was originally thinking about whether this could happen or not on a
> > > > highly parallel small file copy workload, but having thought about it a
> > > > bit more I think there is a more simple example. What about an untar
> > > > like workload that creates small files and calls fsync() before each fd
> > > > is released?
> > > 
> > > Which is the same as an O_SYNC write if it's the same fd, which
> > > means we'll trim allocated blocks on close. i.e. it's no different
> > > to the current behaviour.  If such files are written in parallel then,
> > > again, it is no different to the existing behaviour. i.e. it
> > > largely depends on the timing of allocation in writeback and EOF
> > > block clearing in close(). If close happens before the next
> > > allocation in that AG, then they'll pack because there's no EOF
> > > blocks that push out the new allocation.  If it's the other way
> > > around, we get some level of freespace fragmentation.
> > > 
> > 
> > I know it's the same as the current behavior. ;P I think we're talking
> > past eachother on this.
> 
> Probably :P
> 
> > What I'm saying is that the downside to the
> > current behavior is that a simple copy file -> fsync -> copy next file
> > workload fragments free space.
> 
> Yes. But it's also one of the cases that "always release on first
> close" fixes.
> 
> > Darrick demonstrated this better in his random size test with the
> > release time trim removed, but a simple loop to write one thousand 100k
> > files (xfs_io -fc "pwrite 0 100k" -c fsync ...) demonstrates similar
> > behavior:
> > 
> > # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp
> >    from      to extents  blocks    pct
> >       1       1      18      18   0.00
> >      16      31       1      25   0.00
> >      32      63       1      58   0.00
> >  131072  262143     924 242197739  24.97
> >  262144  524287       1  365696   0.04
> > 134217728 242588672       3 727292183  74.99
> 
> That should not be leaving freespace fragments behind - it should be
> trimming the EOF blocks on close after fsync() and the next
> allocation should pack tightly.
> 
> /me goes off to trace it because it's not doing what he knows it
> should be doing.
> 
> Ngggh. Busy extent handling.
> 
> Basically, the extent we trimmed is busy because it is a user data
> extent that has been freed and not yet committed (even though it was
> not used), so it gets trimmed out of the free space range that is
> allocated.
> 
> IOWs, how we handle busy extents results in this behaviour, not the
> speculative prealloc which has already been removed and returned to
> the free space pool....
> 
> > vs. the same test without the fsync:
> > 
> > # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp 
> >    from      to extents  blocks    pct
> >       1       1      16      16   0.00
> >      16      31       1      20   0.00
> > 4194304 8388607       2 16752060   1.73
> > 134217728 242588672       4 953103627  98.27
> > 
> > Clearly there is an advantage to trimming before delalloc conversion.
> 
> The advantage is in avoiding busy extents by trimming before
> physical allocation occurs. I think we need to fix the busy extent
> handling here, not the speculative prealloc...
> 

Ok, that might be a reasonable approach for this particular pattern. It
might not help for other (probably less common) patterns. In any event,
I hope this at least shows why I'm fishing for further improvements..

> > Random thought: perhaps a one time trim at writeback (where writeback
> > would convert delaloc across eof) _or_ release time, whichever happens
> > first on an inode with preallocation, might help mitigate this problem.
> 
> Not sure how you'd do that reliably - there's no serialisation with
> incoming writes, and no context as to what is changing EOF. Hence I
> don't know how we'd decide that trimming was required or not...
> 

That's not that difficult to work around. If writeback can't reliably
add/remove blocks, it can make decisions about how many blocks to
convert in particular situations. A writeback time trim could be
implemented as a trimmed allocation/conversion that would also allow (or
refuse) a subsequent release to still perform the actual post-eof trim.
IOW, consider the approach as more of having a smarter post-eof
writeback delalloc conversion policy rather than exclusively attempting
to address post-eof blocks at release time.

A simple example could be something like not converting post-eof blocks
if the current writeback cycle is attempting to convert the only
delalloc extent in the file and/or if the extent extends from offset
zero to i_size (or whatever heuristic best describes a write once file),
otherwise behave exactly as we do today. That filters out smallish,
write-once files without requiring any changes to the buffered write
time speculative prealloc algorithm. It also could eliminate the need
for a release time trim because unused post-eof blocks are unlikely to
ever turn into physical blocks for small write once files, but I'd have
to think about that angle a bit more.

A tradeoff would be potential for higher fragmentation in cases where
writeback occurs on a single delalloc extent file before the file
finishes being copied, but we're only talking an extra extent or so if
it's a one time "writeback allocation trim" and it's probably likely the
file is large for that to occur in the first place. IOW, the extra
extent should be sized at or around the amount of data we were able
ingest into pagecache before writeback touched the inode.

This is of course just an idea and still subject to prototyping and
testing/experimentation and whatnot...

> > > I'm not looking for perfect here, just "better with no obvious
> > > regressions". We can't predict every situation, so if it deals with
> > > all the problems we've had reported and a few similar cases we don't
> > > curently handle as well, then we should run with that and not really
> > > worry about the cases that it (or the existing code) does not
> > > solve until we have evidence that those workloads exist and are
> > > causing real world problems. It's a difficult enough issue to reason
> > > about without making it more complex by playing "what about" games..
> > > 
> > 
> > That's fair, but we have had users run into this situation. The whole
> > sparse inodes thing is partially a workaround for side effects of this
> > problem (free space fragmentation being so bad we can't allocate
> > inodes). Granted, some of those users may have also been able to avoid
> > that problem with better fs usage/configuration.
> 
> I think systems that required sparse inodes is orthogonal - those
> issues could be caused just with well packed small files and large
> inodes, and had nothing in common with the workload we've recently
> seen. Indeed, in the cases other than the specific small file
> workload gluster used to fragment free space to prevent inode
> allocation, we never got to the bottom of what caused the free space
> fragmentation. All we could do is make the fs more tolerant of
> freespace fragmentation.
> 
> This time, we have direct evidence of what caused freespace
> fragmentation on this specific system. It was caused by the app
> doing something bizarre and we can extrapolate several similar
> behaviours from that workload.
> 
> But beyond that, we're well into "whatabout" and "whatif" territory.
> 
> > > > We still would trim it. The one time write case is essentially
> > > > unaffected because the only advantage of that hueristic is to trim eof
> > > > blocks before they are converted. If the eof blocks are real, the
> > > > release time heuristic has already failed (i.e., it hasn't provided any
> > > > benefit that background trim doesn't already provide).
> > > 
> > > No, all it means is that the blocks were allocated before the fd was
> > > closed, not that the release time heuristic failed. The release time
> > > heuristic is deciding what to do /after/ the writes have been
> > > completed, whatever the post-eof situation is. It /can't fail/ if it
> > > hasn't been triggered before physical allocation has been done, it
> > > can only decide what to do about those extents once it is called...
> > > 
> > 
> > Confused. By "failed," I mean we physically allocated blocks that were
> > never intended to be used.
> 
> How can we know that are never intended to be used at writeback
> time?
> 

We never really know if the blocks will be used or not. That's not the
point. I'm using the term "failed" simply to refer to the case where
post-eof blocks are unused and end up converted to physical blocks
before they are trimmed. I chose the term because that's the scenario
where the current heuristic contributes to free space fragmentation.
This has been demonstrated by Darrick's test and my more simplistic test
above. Feel free to choose a different term for that scenario, but it's
clearly a case that could stand to improve in XFS (whether it be via
something like the writeback time trim, busy extent fixes, etc.).

> > This is basically referring to the negative
> > effect of delalloc conversion -> eof trim behavior on once written files
> > demonstrated above. If this negative effect didn't exist, we wouldn't
> > need the release time trim at all and could just rely on background
> > trim.
> 
> We also cannot predict what the application intends. Hence we have
> heuristics that trigger once the application signals that it is
> "done with this file". i.e. it has closed the fd.
> 

Of course, that doesn't mean we can't try to make it smarter or dampen
negative side effects of the "failure" case.

> > > > I eventually realized that this
> > > > had other effects (i.e., one copy doing size updates vs. the other not
> > > > doing so) and just compared a fixed, full size (1G) preallocation with a
> > > > fixed 4k preallocation to reproduce the boost provided by the former.
> > > 
> > > That only matters for *writeback overhead*, not ingest efficiency.
> > > Indeed, using a preallocated extent:
> > > 
> > 
> > Not sure how we got into physical preallocation here. Assume any
> > reference to "preallocation" in this thread by me refers to post-eof
> > speculative preallocation. The above preallocation tweaks were
> > controlled in my tests via the allocsize mount option, not physical
> > block preallocation.
> 
> I'm just using it to demonstrate the difference is in continually
> extending the delalloc extent in memory. I could have just done an
> overwrite - it's the same thing.
> 
> > > > FWIW, much of this
> > > > discussion also makes me wonder how appropriate the current size limit
> > > > (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> > > > availability), as opposed to something larger (on the order of hundreds
> > > > of MBs for example, perhaps 256-512MB).
> > > 
> > > RAM size is irrelevant. What matters is file size and the impact
> > > of allocation patterns on writeback IO patterns. i.e. the size limit
> > > is about optimising writeback, not preventing fragmentation or
> > > making more efficient use of memory, etc.
> > 
> > I'm just suggesting that the more RAM that is available, the more we're
> > able to write into cache before writeback starts and thus the larger
> > physical extents we're able to allocate independent of speculative
> > preallocation (perf issues notwithstanding).
> 
> ISTR I looked at that years ago and couldn't get it to work
> reliably. It works well for initial ingest, but once the writes go
> on for long enough dirty throttling starts chopping ingest up in
> smaller and smaller chunks as it rotors writeback bandwidth around
> all the processes dirtying the page cache. This chunking happens
> regardless of the size of the file being written. And so the more
> processes that are dirtying the page cache, the smaller the file
> fragments get because each file gets a smaller amount of the overall
> writeback bandwidth each time it writeback occurs. i.e.
> fragmentation increases as memory pressure, load and concurrency
> increases, which are exactly the conditions we want to be avoiding
> fragmentation as much as possible...
> 

Ok, well that suggests less aggressive preallocation is at least worth
considering/investigating.

> The only way I found to prevent this in fair and predictable
> manner is the auto-grow algorithm we have now.  There's very few
> real world corner cases where it break down, so we do not need
> fundamental changes here. We've found one corner case where it is
> defeated, so let's address that corner case with the minimal change
> that is necessary but otherwise leave the underlying algorithm
> alone so we can observe the loner term effects of the tweak we
> need to make....
> 

I agree that an incremental fix is fine for now. I don't agree that the
current implementation is only defeated by corner cases. The write+fsync
example above is a pretty straightforward usage pattern. Userspace
filesystems are also increasingly common and may implement things like
open file caches that distort the filesystems perception of how it's
being used, and thus could defeat a release time trim heuristic in
exactly the same manner. I recall that gluster had something like this,
but I'd have to dig into it to see exactly if it applies here..

> > > i.e. when we have lots of small files, we want writeback to pack
> > > them so we get multiple-file sequentialisation of the write stream -
> > > this makes things like untarring a kernel tarball (which is a large
> > > number of small files) a sequential write workload rather than a
> > > seek-per-individual-file-write workload. That make a massive
> > > difference to performance on spinning disks, and that's what the 64k
> > > threshold (and post-EOF block removal on close for larger files)
> > > tries to preserve.
> > 
> > Sure, but what's the downside to increasing that threshold to even
> > something on the order of MBs? Wouldn't that at least help us leave
> > larger free extents around in those workloads/patterns that do fragment
> > free space?
> 
> Because even for files in th the "few MB" size, worst case
> fragmentation is thousands of extents and IO performance that
> absolutely sucks. Especially on RAID5/6 devices. We have to ensure
> file fragmentation at it's worst does not affect filesystem
> throughput and that means in the general case less than ~1MB sized
> extents is just not acceptible even for smallish files....
> 

Sure, fragmentation is bad. That doesn't answer my question though.
Under what conditions would you expect less aggressive speculative
preallocation in the form of a say 1MB threshold to manifest in a level
of fragmentation "where performance absolutely sucks?" Highly parallel
small file (<1MB) copies in a low memory environment perhaps? If so, how
much is low memory? I would think that if we had gobs of RAM, a good
number of 1MB file copies would make it cache completely before files
currently being written start ever being affected by writeback, but I
could be mistaken about reclaim/writeback behavior..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/3] xfs: EOF blocks are not busy extents
  2019-02-07  5:08 [RFC PATCH 0/3]: Extreme fragmentation ahoy! Dave Chinner
                   ` (3 preceding siblings ...)
  2019-02-07  5:21 ` [RFC PATCH 0/3]: Extreme fragmentation ahoy! Darrick J. Wong
@ 2019-02-18  2:26 ` Dave Chinner
  2019-02-20 15:12   ` Brian Foster
  4 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-02-18  2:26 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Userdata extents are considered as "busy extents" when they are
freed to ensure that they ar enot reallocated and written to before
the transaction that frees the user data extent has been committed
to disk.

However, in the case of post EOF blocks, these block have
never been exposed to user applications and so don't contain valid
data. Hence they don't need to be considered "busy" when they've
been freed because there is no data in them that can be destroyed
if they are reallocated and have data written to them before the
free transaction is committed.

We already have XFS_BMAPI_NODISCARD to extent freeing that the data
extent has never been used so it doesn't need discards issued on it.
This new functionality is just an extension of that concept - the
extent is actually unused, so doesn't even need to be marked busy.

Hence fix this by adding XFS_BMAPI_UNUSED and update the EOF block
data extent truncate with XFS_BMAPI_UNUSED and propagate that all
the way through the various structures an use it to avoid inserting
the extent into the busy list.

[ Note: this seems like a bit of a hack, but I just don't see the
point of inserting it into the busy list, then adding a new busy
list unused flag, then allowing every allocation type to use it in
the _trim code, then have every caller have to call _reuse to remove
the range from the busy tree. It just seems like complexity that
doesn't need to exist because anyone can reallocate an
unused extent for immediate use. ]

This avoids the problem of free space fragmentation when multiple
files are written sequentially  via synchronous writes or post-write
fsync calls before the next file is written. This results in the
post-eof blocks being marked busy and can't be immediately
reallocated resulting in the files packing poorly and unnecessarily
leaving free space between them.

Freespace fragmentation from sequential multi-file synchronous write
workload:

Before:
  from      to extents  blocks    pct
      1       1       7       7   0.00
      2       3      34      80   0.00
      4       7      65     345   0.01
      8      15     208    2417   0.05
     16      31     147    2982   0.06
     32      63       1      49   0.00
1048576 1310720       4 5185064  99.89

After:
   from      to extents  blocks    pct
      1       1       3       3   0.00
      2       3       1       3   0.00
1048576 1310720       4 5190871 100.00

Much better.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c  | 7 +++----
 fs/xfs/libxfs/xfs_bmap.c   | 6 +++---
 fs/xfs/libxfs/xfs_bmap.h   | 8 ++++++--
 fs/xfs/xfs_bmap_util.c     | 2 +-
 fs/xfs/xfs_trans_extfree.c | 6 +++---
 5 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 659bb9133955..729136ef0ed1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3014,7 +3014,7 @@ __xfs_free_extent(
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type		type,
-	bool				skip_discard)
+	bool				unused)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_buf			*agbp;
@@ -3045,9 +3045,8 @@ __xfs_free_extent(
 	if (error)
 		goto err;
 
-	if (skip_discard)
-		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
-	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
+	if (!unused)
+		xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
 	return 0;
 
 err:
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 332eefa2700b..ba0fd80eede4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -537,7 +537,7 @@ __xfs_bmap_add_free(
 	xfs_fsblock_t			bno,
 	xfs_filblks_t			len,
 	const struct xfs_owner_info	*oinfo,
-	bool				skip_discard)
+	bool				unused)
 {
 	struct xfs_extent_free_item	*new;		/* new element */
 #ifdef DEBUG
@@ -565,7 +565,7 @@ __xfs_bmap_add_free(
 		new->xefi_oinfo = *oinfo;
 	else
 		new->xefi_oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
-	new->xefi_skip_discard = skip_discard;
+	new->xefi_unused = unused;
 	trace_xfs_bmap_free_defer(tp->t_mountp,
 			XFS_FSB_TO_AGNO(tp->t_mountp, bno), 0,
 			XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len);
@@ -5069,7 +5069,7 @@ xfs_bmap_del_extent_real(
 		} else {
 			__xfs_bmap_add_free(tp, del->br_startblock,
 					del->br_blockcount, NULL,
-					(bflags & XFS_BMAPI_NODISCARD) ||
+					(bflags & XFS_BMAPI_UNUSED) ||
 					del->br_state == XFS_EXT_UNWRITTEN);
 		}
 	}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 09d3ea97cc15..33fb95f84ea0 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -54,7 +54,7 @@ struct xfs_extent_free_item
 	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
 	struct list_head	xefi_list;
 	struct xfs_owner_info	xefi_oinfo;	/* extent owner */
-	bool			xefi_skip_discard;
+	bool			xefi_unused;
 };
 
 #define	XFS_BMAP_MAX_NMAP	4
@@ -107,6 +107,9 @@ struct xfs_extent_free_item
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	0x2000
 
+/* Unused freed data extent, no need to mark busy */
+#define XFS_BMAPI_UNUSED	0x4000
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -120,7 +123,8 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_DELALLOC,	"DELALLOC" }, \
 	{ XFS_BMAPI_CONVERT_ONLY, "CONVERT_ONLY" }, \
 	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
-	{ XFS_BMAPI_NORMAP,	"NORMAP" }
+	{ XFS_BMAPI_NORMAP,	"NORMAP" }, \
+	{ XFS_BMAPI_UNUSED,	"UNUSED" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index af2e30d33794..f5a8a4385512 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -841,7 +841,7 @@ xfs_free_eofblocks(
 		 * may be full of holes (ie NULL files bug).
 		 */
 		error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
-					XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
+					XFS_ISIZE(ip), XFS_BMAPI_UNUSED);
 		if (error) {
 			/*
 			 * If we get an error at this point we simply don't
diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
index 0710434eb240..d06fb2cd6ffb 100644
--- a/fs/xfs/xfs_trans_extfree.c
+++ b/fs/xfs/xfs_trans_extfree.c
@@ -58,7 +58,7 @@ xfs_trans_free_extent(
 	xfs_fsblock_t			start_block,
 	xfs_extlen_t			ext_len,
 	const struct xfs_owner_info	*oinfo,
-	bool				skip_discard)
+	bool				unused)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent		*extp;
@@ -71,7 +71,7 @@ xfs_trans_free_extent(
 	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
 
 	error = __xfs_free_extent(tp, start_block, ext_len,
-				  oinfo, XFS_AG_RESV_NONE, skip_discard);
+				  oinfo, XFS_AG_RESV_NONE, unused);
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
 	 * transaction is aborted, which:
@@ -184,7 +184,7 @@ xfs_extent_free_finish_item(
 	error = xfs_trans_free_extent(tp, done_item,
 			free->xefi_startblock,
 			free->xefi_blockcount,
-			&free->xefi_oinfo, free->xefi_skip_discard);
+			&free->xefi_oinfo, free->xefi_unused);
 	kmem_free(free);
 	return error;
 }

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/3] xfs: EOF blocks are not busy extents
  2019-02-18  2:26 ` [PATCH 4/3] xfs: EOF blocks are not busy extents Dave Chinner
@ 2019-02-20 15:12   ` Brian Foster
  0 siblings, 0 replies; 24+ messages in thread
From: Brian Foster @ 2019-02-20 15:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Feb 18, 2019 at 01:26:19PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Userdata extents are considered as "busy extents" when they are
> freed to ensure that they ar enot reallocated and written to before
> the transaction that frees the user data extent has been committed
> to disk.
> 
> However, in the case of post EOF blocks, these block have
> never been exposed to user applications and so don't contain valid
> data. Hence they don't need to be considered "busy" when they've
> been freed because there is no data in them that can be destroyed
> if they are reallocated and have data written to them before the
> free transaction is committed.
> 
> We already have XFS_BMAPI_NODISCARD to extent freeing that the data

"to extent freeing" ?

> extent has never been used so it doesn't need discards issued on it.
> This new functionality is just an extension of that concept - the
> extent is actually unused, so doesn't even need to be marked busy.
> 
> Hence fix this by adding XFS_BMAPI_UNUSED and update the EOF block
> data extent truncate with XFS_BMAPI_UNUSED and propagate that all
> the way through the various structures an use it to avoid inserting
> the extent into the busy list.
> 
> [ Note: this seems like a bit of a hack, but I just don't see the
> point of inserting it into the busy list, then adding a new busy
> list unused flag, then allowing every allocation type to use it in
> the _trim code, then have every caller have to call _reuse to remove
> the range from the busy tree. It just seems like complexity that
> doesn't need to exist because anyone can reallocate an
> unused extent for immediate use. ]
> 

Yeah, I'm not a fan of adding that kind of noop code to achieve behavior
that's equivalent to bypassing subsystems or removing unnecessary code.

> This avoids the problem of free space fragmentation when multiple
> files are written sequentially  via synchronous writes or post-write
> fsync calls before the next file is written. This results in the
> post-eof blocks being marked busy and can't be immediately
> reallocated resulting in the files packing poorly and unnecessarily
> leaving free space between them.
> 
> Freespace fragmentation from sequential multi-file synchronous write
> workload:
> 
> Before:
>   from      to extents  blocks    pct
>       1       1       7       7   0.00
>       2       3      34      80   0.00
>       4       7      65     345   0.01
>       8      15     208    2417   0.05
>      16      31     147    2982   0.06
>      32      63       1      49   0.00
> 1048576 1310720       4 5185064  99.89
> 
> After:
>    from      to extents  blocks    pct
>       1       1       3       3   0.00
>       2       3       1       3   0.00
> 1048576 1310720       4 5190871 100.00
> 
> Much better.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

This seems reasonable to me in principle. Some comments...

>  fs/xfs/libxfs/xfs_alloc.c  | 7 +++----
>  fs/xfs/libxfs/xfs_bmap.c   | 6 +++---
>  fs/xfs/libxfs/xfs_bmap.h   | 8 ++++++--
>  fs/xfs/xfs_bmap_util.c     | 2 +-
>  fs/xfs/xfs_trans_extfree.c | 6 +++---
>  5 files changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 659bb9133955..729136ef0ed1 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -3014,7 +3014,7 @@ __xfs_free_extent(
>  	xfs_extlen_t			len,
>  	const struct xfs_owner_info	*oinfo,
>  	enum xfs_ag_resv_type		type,
> -	bool				skip_discard)
> +	bool				unused)
>  {
>  	struct xfs_mount		*mp = tp->t_mountp;
>  	struct xfs_buf			*agbp;
> @@ -3045,9 +3045,8 @@ __xfs_free_extent(
>  	if (error)
>  		goto err;
>  
> -	if (skip_discard)
> -		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> -	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
> +	if (!unused)
> +		xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);

busy_flags now serves no purpose in this function (it's hardcoded to
zero).

>  	return 0;
>  
>  err:
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 332eefa2700b..ba0fd80eede4 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -537,7 +537,7 @@ __xfs_bmap_add_free(
>  	xfs_fsblock_t			bno,
>  	xfs_filblks_t			len,
>  	const struct xfs_owner_info	*oinfo,
> -	bool				skip_discard)
> +	bool				unused)
>  {
>  	struct xfs_extent_free_item	*new;		/* new element */
>  #ifdef DEBUG
> @@ -565,7 +565,7 @@ __xfs_bmap_add_free(
>  		new->xefi_oinfo = *oinfo;
>  	else
>  		new->xefi_oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
> -	new->xefi_skip_discard = skip_discard;
> +	new->xefi_unused = unused;
>  	trace_xfs_bmap_free_defer(tp->t_mountp,
>  			XFS_FSB_TO_AGNO(tp->t_mountp, bno), 0,
>  			XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len);
> @@ -5069,7 +5069,7 @@ xfs_bmap_del_extent_real(
>  		} else {
>  			__xfs_bmap_add_free(tp, del->br_startblock,
>  					del->br_blockcount, NULL,
> -					(bflags & XFS_BMAPI_NODISCARD) ||
> +					(bflags & XFS_BMAPI_UNUSED) ||
>  					del->br_state == XFS_EXT_UNWRITTEN);

Note that this does technically change unrelated behavior as we now
consider unwritten extents as unused (along with post-eof blocks). I
think that's still fine for similar reasons as for eofblocks, but this
should at least be noted in the commit log.

>  		}
>  	}
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 09d3ea97cc15..33fb95f84ea0 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -54,7 +54,7 @@ struct xfs_extent_free_item
>  	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
>  	struct list_head	xefi_list;
>  	struct xfs_owner_info	xefi_oinfo;	/* extent owner */
> -	bool			xefi_skip_discard;
> +	bool			xefi_unused;
>  };
>  
>  #define	XFS_BMAP_MAX_NMAP	4
> @@ -107,6 +107,9 @@ struct xfs_extent_free_item
>  /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
>  #define XFS_BMAPI_NORMAP	0x2000
>  
> +/* Unused freed data extent, no need to mark busy */
> +#define XFS_BMAPI_UNUSED	0x4000
> +
>  #define XFS_BMAPI_FLAGS \
>  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
>  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> @@ -120,7 +123,8 @@ struct xfs_extent_free_item
>  	{ XFS_BMAPI_DELALLOC,	"DELALLOC" }, \
>  	{ XFS_BMAPI_CONVERT_ONLY, "CONVERT_ONLY" }, \
>  	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
> -	{ XFS_BMAPI_NORMAP,	"NORMAP" }
> +	{ XFS_BMAPI_NORMAP,	"NORMAP" }, \
> +	{ XFS_BMAPI_UNUSED,	"UNUSED" }

We can kill XFS_BMAPI_NODISCARD since _UNUSED replaces the only usage of
it. I suppose we might as well just rename it as you've done for the
related plumbing.

Brian

>  
>  
>  static inline int xfs_bmapi_aflag(int w)
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index af2e30d33794..f5a8a4385512 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -841,7 +841,7 @@ xfs_free_eofblocks(
>  		 * may be full of holes (ie NULL files bug).
>  		 */
>  		error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
> -					XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
> +					XFS_ISIZE(ip), XFS_BMAPI_UNUSED);
>  		if (error) {
>  			/*
>  			 * If we get an error at this point we simply don't
> diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> index 0710434eb240..d06fb2cd6ffb 100644
> --- a/fs/xfs/xfs_trans_extfree.c
> +++ b/fs/xfs/xfs_trans_extfree.c
> @@ -58,7 +58,7 @@ xfs_trans_free_extent(
>  	xfs_fsblock_t			start_block,
>  	xfs_extlen_t			ext_len,
>  	const struct xfs_owner_info	*oinfo,
> -	bool				skip_discard)
> +	bool				unused)
>  {
>  	struct xfs_mount		*mp = tp->t_mountp;
>  	struct xfs_extent		*extp;
> @@ -71,7 +71,7 @@ xfs_trans_free_extent(
>  	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
>  
>  	error = __xfs_free_extent(tp, start_block, ext_len,
> -				  oinfo, XFS_AG_RESV_NONE, skip_discard);
> +				  oinfo, XFS_AG_RESV_NONE, unused);
>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
>  	 * transaction is aborted, which:
> @@ -184,7 +184,7 @@ xfs_extent_free_finish_item(
>  	error = xfs_trans_free_extent(tp, done_item,
>  			free->xefi_startblock,
>  			free->xefi_blockcount,
> -			&free->xefi_oinfo, free->xefi_skip_discard);
> +			&free->xefi_oinfo, free->xefi_unused);
>  	kmem_free(free);
>  	return error;
>  }

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-02-20 15:12 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-07  5:08 [RFC PATCH 0/3]: Extreme fragmentation ahoy! Dave Chinner
2019-02-07  5:08 ` [PATCH 1/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
2019-02-07  5:08 ` [PATCH 2/3] xfs: Don't free EOF blocks on close when extent size hints are set Dave Chinner
2019-02-07 15:51   ` Brian Foster
2019-02-07  5:08 ` [PATCH 3/3] xfs: Don't free EOF blocks on sync write close Dave Chinner
2019-02-07  5:19   ` Dave Chinner
2019-02-07  5:21 ` [RFC PATCH 0/3]: Extreme fragmentation ahoy! Darrick J. Wong
2019-02-07  5:39   ` Dave Chinner
2019-02-07 15:52     ` Brian Foster
2019-02-08  2:47       ` Dave Chinner
2019-02-08 12:34         ` Brian Foster
2019-02-12  1:13           ` Darrick J. Wong
2019-02-12 11:46             ` Brian Foster
2019-02-12 20:21               ` Dave Chinner
2019-02-13 13:50                 ` Brian Foster
2019-02-13 22:27                   ` Dave Chinner
2019-02-14 13:00                     ` Brian Foster
2019-02-14 21:51                       ` Dave Chinner
2019-02-15  2:35                         ` Brian Foster
2019-02-15  7:23                           ` Dave Chinner
2019-02-15 20:33                             ` Brian Foster
2019-02-08 16:29         ` Darrick J. Wong
2019-02-18  2:26 ` [PATCH 4/3] xfs: EOF blocks are not busy extents Dave Chinner
2019-02-20 15:12   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.