linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space
@ 2020-03-27  1:45 Darrick J. Wong
  2020-03-27  2:27 ` Dave Chinner
  2020-03-27  9:08 ` Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: Darrick J. Wong @ 2020-03-27  1:45 UTC (permalink / raw)
  To: xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

A customer reported rcu stalls and softlockup warnings on a computer
with many CPU cores and many many more IO threads trying to write to a
filesystem that is totally out of space.  Subsequent analysis pointed to
the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
which causes a lot of wb_writeback_work to be queued.  The writeback
worker spends so much time trying to wake the many many threads waiting
for writeback completion that it trips the softlockup detector, and (in
this case) the system automatically reboots.

In addition, they complain that the lengthy xfs_flush_inodes scan traps
all of those threads in uninterruptible sleep, which hampers their
ability to kill the program or do anything else to escape the situation.

Fix this by replacing the full filesystem flush (which is offloaded to a
workqueue which we then have to wait for) with directly flushing the
file that we're trying to write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c |   29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b8a4a3f29b36..08f0aa7e9cea 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -651,14 +651,18 @@ xfs_file_buffered_aio_write(
 
 	/*
 	 * If we hit a space limit, try to free up some lingering preallocated
-	 * space before returning an error. In the case of ENOSPC, first try to
-	 * write back all dirty inodes to free up some of the excess reserved
-	 * metadata space. This reduces the chances that the eofblocks scan
-	 * waits on dirty mappings. Since xfs_flush_inodes() is serialized, this
-	 * also behaves as a filter to prevent too many eofblocks scans from
-	 * running at the same time.
+	 * space and delalloc reservations before returning an error.
 	 */
-	if (ret == -EDQUOT && !enospc) {
+	if ((ret == -EDQUOT || ret == -ENOSPC) && !enospc) {
+		/*
+		 * Flush the current file's dirty data to free up any delalloc
+		 * reservation blocks that might have been reserved for bmbt
+		 * expansion.  Ignore the return code because we don't want to
+		 * return EIO for a different write that failed.
+		 */
+		filemap_fdatawrite(mapping);
+		filemap_fdatawait_keep_errors(mapping);
+
 		xfs_iunlock(ip, iolock);
 		enospc = xfs_inode_free_quota_eofblocks(ip);
 		if (enospc)
@@ -667,17 +671,6 @@ xfs_file_buffered_aio_write(
 		if (enospc)
 			goto write_retry;
 		iolock = 0;
-	} else if (ret == -ENOSPC && !enospc) {
-		struct xfs_eofblocks eofb = {0};
-
-		enospc = 1;
-		xfs_flush_inodes(ip->i_mount);
-
-		xfs_iunlock(ip, iolock);
-		eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
-		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
-		xfs_icache_free_cowblocks(ip->i_mount, &eofb);
-		goto write_retry;
 	}
 
 	current->backing_dev_info = NULL;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space
  2020-03-27  1:45 [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space Darrick J. Wong
@ 2020-03-27  2:27 ` Dave Chinner
  2020-03-27  2:51   ` Darrick J. Wong
  2020-03-27  9:08 ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2020-03-27  2:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Thu, Mar 26, 2020 at 06:45:58PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> A customer reported rcu stalls and softlockup warnings on a computer
> with many CPU cores and many many more IO threads trying to write to a
> filesystem that is totally out of space.  Subsequent analysis pointed to
> the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
> which causes a lot of wb_writeback_work to be queued.  The writeback
> worker spends so much time trying to wake the many many threads waiting
> for writeback completion that it trips the softlockup detector, and (in
> this case) the system automatically reboots.

That doesn't sound right. Each writeback work that is queued via
sync_inodes_sb should only have a single process waiting on it's
completion. And how many threads do you actually have to need to
wake up for it to trigger a 10s soft-lockup timeout?

More detail, please?

> In addition, they complain that the lengthy xfs_flush_inodes scan traps
> all of those threads in uninterruptible sleep, which hampers their
> ability to kill the program or do anything else to escape the situation.
> 
> Fix this by replacing the full filesystem flush (which is offloaded to a
> workqueue which we then have to wait for) with directly flushing the
> file that we're trying to write.

Which does nothing to flush -other- outstanding delalloc
reservations and allow the eofblocks/cowblock scan to reclaim unused
post-EOF speculative preallocations.

That's the purpose of the xfs_flush_inodes() - without it we can get
very premature ENOSPC, especially on small filesystems when writing
largish files in the background. So I'm not sure that dropping the
sync is a viable solution. It is actually needed.

Perhaps we need to go back to the ancient code thatonly allowed XFS
to run a single xfs_flush_inodes() at a time - everything else
waited on the single flush to complete, then all returned at the
same time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space
  2020-03-27  2:27 ` Dave Chinner
@ 2020-03-27  2:51   ` Darrick J. Wong
  2020-03-27  4:50     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2020-03-27  2:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Mar 27, 2020 at 01:27:14PM +1100, Dave Chinner wrote:
> On Thu, Mar 26, 2020 at 06:45:58PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > A customer reported rcu stalls and softlockup warnings on a computer
> > with many CPU cores and many many more IO threads trying to write to a
> > filesystem that is totally out of space.  Subsequent analysis pointed to
> > the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
> > which causes a lot of wb_writeback_work to be queued.  The writeback
> > worker spends so much time trying to wake the many many threads waiting
> > for writeback completion that it trips the softlockup detector, and (in
> > this case) the system automatically reboots.
> 
> That doesn't sound right. Each writeback work that is queued via
> sync_inodes_sb should only have a single process waiting on it's
> completion. And how many threads do you actually have to need to
> wake up for it to trigger a 10s soft-lockup timeout?
> 
> More detail, please?

It's a two socket 64-core system with some sort of rdma/infiniband magic
and somewhere between 600-900 processes doing who knows what with the
magic.  Each of those threads *also* is writing trace data to its own
separate trace file (three private log files per process).  Hilariously
they never check the return code from write() so they keep pounding the
system forever.

(I don't know what the rdma/infiniband magic is, they won't tell me.)

When the filesystem fills up all the way (it's a 10G fs with 8,207
blocks free) they keep banging away on it until something finally dies.

I tried writing a dumb fstest to simulate the log writer part, but that
never succeeds in triggering the rcu stalls.

If you want the gory dmesg details I can send you some dmesg log.

> > In addition, they complain that the lengthy xfs_flush_inodes scan traps
> > all of those threads in uninterruptible sleep, which hampers their
> > ability to kill the program or do anything else to escape the situation.
> > 
> > Fix this by replacing the full filesystem flush (which is offloaded to a
> > workqueue which we then have to wait for) with directly flushing the
> > file that we're trying to write.
> 
> Which does nothing to flush -other- outstanding delalloc
> reservations and allow the eofblocks/cowblock scan to reclaim unused
> post-EOF speculative preallocations.
> 
> That's the purpose of the xfs_flush_inodes() - without it we can get
> very premature ENOSPC, especially on small filesystems when writing
> largish files in the background. So I'm not sure that dropping the
> sync is a viable solution. It is actually needed.

Yeah, I did kinda wonder about that...

> Perhaps we need to go back to the ancient code thatonly allowed XFS
> to run a single xfs_flush_inodes() at a time - everything else
> waited on the single flush to complete, then all returned at the
> same time...

That might work too.  Admittedly it's pretty silly to be running this
scan over and over and over considering that there's never going to be
any more free space.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space
  2020-03-27  2:51   ` Darrick J. Wong
@ 2020-03-27  4:50     ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2020-03-27  4:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Thu, Mar 26, 2020 at 07:51:53PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 27, 2020 at 01:27:14PM +1100, Dave Chinner wrote:
> > On Thu, Mar 26, 2020 at 06:45:58PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > A customer reported rcu stalls and softlockup warnings on a computer
> > > with many CPU cores and many many more IO threads trying to write to a
> > > filesystem that is totally out of space.  Subsequent analysis pointed to
> > > the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
> > > which causes a lot of wb_writeback_work to be queued.  The writeback
> > > worker spends so much time trying to wake the many many threads waiting
> > > for writeback completion that it trips the softlockup detector, and (in
> > > this case) the system automatically reboots.
> > 
> > That doesn't sound right. Each writeback work that is queued via
> > sync_inodes_sb should only have a single process waiting on it's
> > completion. And how many threads do you actually have to need to
> > wake up for it to trigger a 10s soft-lockup timeout?
> > 
> > More detail, please?
> 
> It's a two socket 64-core system with some sort of rdma/infiniband magic
> and somewhere between 600-900 processes doing who knows what with the
> magic.  Each of those threads *also* is writing trace data to its own
> separate trace file (three private log files per process).  Hilariously
> they never check the return code from write() so they keep pounding the
> system forever.

<facepalm>

Ah, another game of ye olde "blame the filesystem because it's the
first to complain"...

> (I don't know what the rdma/infiniband magic is, they won't tell me.)
> 
> When the filesystem fills up all the way (it's a 10G fs with 8,207
> blocks free) they keep banging away on it until something finally dies.
> 
> I tried writing a dumb fstest to simulate the log writer part, but that
> never succeeds in triggering the rcu stalls.

Which means it probably requires a bunch of other RCU magic to be
done by other part of the system to trigger it...

> If you want the gory dmesg details I can send you some dmesg log.

No need, won't be able to read it anyway because facepalm...

> > > In addition, they complain that the lengthy xfs_flush_inodes scan traps
> > > all of those threads in uninterruptible sleep, which hampers their
> > > ability to kill the program or do anything else to escape the situation.
> > > 
> > > Fix this by replacing the full filesystem flush (which is offloaded to a
> > > workqueue which we then have to wait for) with directly flushing the
> > > file that we're trying to write.
> > 
> > Which does nothing to flush -other- outstanding delalloc
> > reservations and allow the eofblocks/cowblock scan to reclaim unused
> > post-EOF speculative preallocations.
> > 
> > That's the purpose of the xfs_flush_inodes() - without it we can get
> > very premature ENOSPC, especially on small filesystems when writing
> > largish files in the background. So I'm not sure that dropping the
> > sync is a viable solution. It is actually needed.
> 
> Yeah, I did kinda wonder about that...
> 
> > Perhaps we need to go back to the ancient code thatonly allowed XFS
> > to run a single xfs_flush_inodes() at a time - everything else
> > waited on the single flush to complete, then all returned at the
> > same time...
> 
> That might work too.  Admittedly it's pretty silly to be running this
> scan over and over and over considering that there's never going to be
> any more free space.

Actually, what if we just rate limit the calls? Once a second would
probably do the trick just fine - after the first few seconds
there'll be no space left to reclaim, anyway...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space
  2020-03-27  1:45 [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space Darrick J. Wong
  2020-03-27  2:27 ` Dave Chinner
@ 2020-03-27  9:08 ` Christoph Hellwig
  2020-03-27  9:09   ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2020-03-27  9:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

FYI, independ of the discussion what to flush, using
filemap_fdatawait_keep_errors is a very obvious and important bug fix.
Can you split that out and apply it with a Fixes tag so that it gets
backported ASAP?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space
  2020-03-27  9:08 ` Christoph Hellwig
@ 2020-03-27  9:09   ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2020-03-27  9:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Fri, Mar 27, 2020 at 02:08:00AM -0700, Christoph Hellwig wrote:
> FYI, independ of the discussion what to flush, using
> filemap_fdatawait_keep_errors is a very obvious and important bug fix.
> Can you split that out and apply it with a Fixes tag so that it gets
> backported ASAP?

Sorry, -ENOCOFFEE.  We didn't have a previous call that comsumed the
errors, so the above doesn't make any sense.  Please ignore it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-03-27  9:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-27  1:45 [PATCH] xfs: don't flush the entire filesystem when a buffered write runs out of space Darrick J. Wong
2020-03-27  2:27 ` Dave Chinner
2020-03-27  2:51   ` Darrick J. Wong
2020-03-27  4:50     ` Dave Chinner
2020-03-27  9:08 ` Christoph Hellwig
2020-03-27  9:09   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).