All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs ioend batching log reservation deadlock
@ 2021-03-26 15:39 Brian Foster
  2021-03-26 17:32 ` Darrick J. Wong
  2021-03-29  5:45 ` Christoph Hellwig
  0 siblings, 2 replies; 10+ messages in thread
From: Brian Foster @ 2021-03-26 15:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: Darrick J. Wong, Christoph Hellwig

Hi all,

We have a report of a workload that deadlocks on log reservation via
iomap_ioend completion batching. To start, the fs format is somewhat
unique in that the log is on the smaller side (35MB) and the log stripe
unit is 256k, but this is actually a default mkfs for the underlying
storage. I don't have much more information wrt to the workload or
anything that contributes to the completion processing characteristics.

The overall scenario is that a workqueue task is executing in
xfs_end_io() and blocked on transaction reservation for an unwritten
extent conversion. Since this task began executing and pulled pending
items from ->i_ioend_list, the latter was repopulated with 90 ioends, 67
of which have append transactions. These append transactions account for
~520k of log reservation each due to the log stripe unit. All together
this consumes nearly all of available log space, prevents allocation of
the aforementioned unwritten extent conversion transaction and thus
leaves the fs in a deadlocked state.

I can think of different ways we could probably optimize this problem
away. One example is to transfer the append transaction to the inode at
bio completion time such that we retain only one per pending batch of
ioends. The workqueue task would then pull this append transaction from
the inode along with the ioend list and transfer it back to the last
non-unwritten/shared ioend in the sorted list.

That said, I'm not totally convinced this addresses the fundamental
problem of acquiring transaction reservation from a context that
essentially already owns outstanding reservation vs. just making it hard
to reproduce. I'm wondering if/why we need the append transaction at
all. AFAICT it goes back to commit 281627df3eb5 ("xfs: log file size
updates at I/O completion time") in v3.4 which changed the completion
on-disk size update from being an unlogged update. If we continue to
send these potential append ioends to the workqueue for completion
processing, is there any reason we can't let the workqueue allocate the
transaction as it already does for unwritten conversion?

If that is reasonable, I'm thinking of a couple patches:

1. Optimize current append transaction processing with an inode field as
noted above.

2. Replace the submission side append transaction entirely with a flag
or some such on the ioend that allocates the transaction at completion
time, but otherwise preserves batching behavior instituted in patch 1.

Thoughts?

Brian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-26 15:39 xfs ioend batching log reservation deadlock Brian Foster
@ 2021-03-26 17:32 ` Darrick J. Wong
  2021-03-26 18:13   ` Brian Foster
  2021-03-29  2:28   ` Dave Chinner
  2021-03-29  5:45 ` Christoph Hellwig
  1 sibling, 2 replies; 10+ messages in thread
From: Darrick J. Wong @ 2021-03-26 17:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Christoph Hellwig

On Fri, Mar 26, 2021 at 11:39:38AM -0400, Brian Foster wrote:
> Hi all,
> 
> We have a report of a workload that deadlocks on log reservation via
> iomap_ioend completion batching. To start, the fs format is somewhat
> unique in that the log is on the smaller side (35MB) and the log stripe
> unit is 256k, but this is actually a default mkfs for the underlying
> storage. I don't have much more information wrt to the workload or
> anything that contributes to the completion processing characteristics.
> 
> The overall scenario is that a workqueue task is executing in
> xfs_end_io() and blocked on transaction reservation for an unwritten
> extent conversion. Since this task began executing and pulled pending
> items from ->i_ioend_list, the latter was repopulated with 90 ioends, 67
> of which have append transactions. These append transactions account for
> ~520k of log reservation each due to the log stripe unit. All together
> this consumes nearly all of available log space, prevents allocation of
> the aforementioned unwritten extent conversion transaction and thus
> leaves the fs in a deadlocked state.
> 
> I can think of different ways we could probably optimize this problem
> away. One example is to transfer the append transaction to the inode at
> bio completion time such that we retain only one per pending batch of
> ioends. The workqueue task would then pull this append transaction from
> the inode along with the ioend list and transfer it back to the last
> non-unwritten/shared ioend in the sorted list.
> 
> That said, I'm not totally convinced this addresses the fundamental
> problem of acquiring transaction reservation from a context that
> essentially already owns outstanding reservation vs. just making it hard
> to reproduce. I'm wondering if/why we need the append transaction at
> all. AFAICT it goes back to commit 281627df3eb5 ("xfs: log file size
> updates at I/O completion time") in v3.4 which changed the completion
> on-disk size update from being an unlogged update. If we continue to
> send these potential append ioends to the workqueue for completion
> processing, is there any reason we can't let the workqueue allocate the
> transaction as it already does for unwritten conversion?

Frankly I've never understood what benefit we get from preallocating a
transaction and letting it twist in the wind consuming log space while
writeback pushes data to the disk.  It's perfectly fine to delay ioend
processing while we wait for unwritten conversions and cow remapping to
take effect, so what's the harm in a slight delay for this?

I guess it's an optimization to reduce wait times?  It's a pity that
nobody left a comment justifying why it was done in that particular way,
what with the freeze protection lockdep weirdness too.

> If that is reasonable, I'm thinking of a couple patches:
> 
> 1. Optimize current append transaction processing with an inode field as
> noted above.
> 
> 2. Replace the submission side append transaction entirely with a flag
> or some such on the ioend that allocates the transaction at completion
> time, but otherwise preserves batching behavior instituted in patch 1.

What happens if you replace the call to xfs_setfilesize_ioend in
xfs_end_ioend with xfs_setfilesize, and skip the transaction
preallocation altogether?

--D

> Thoughts?
> 
> Brian
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-26 17:32 ` Darrick J. Wong
@ 2021-03-26 18:13   ` Brian Foster
  2021-03-29  2:28   ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Brian Foster @ 2021-03-26 18:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Christoph Hellwig

On Fri, Mar 26, 2021 at 10:32:44AM -0700, Darrick J. Wong wrote:
> On Fri, Mar 26, 2021 at 11:39:38AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > We have a report of a workload that deadlocks on log reservation via
> > iomap_ioend completion batching. To start, the fs format is somewhat
> > unique in that the log is on the smaller side (35MB) and the log stripe
> > unit is 256k, but this is actually a default mkfs for the underlying
> > storage. I don't have much more information wrt to the workload or
> > anything that contributes to the completion processing characteristics.
> > 
> > The overall scenario is that a workqueue task is executing in
> > xfs_end_io() and blocked on transaction reservation for an unwritten
> > extent conversion. Since this task began executing and pulled pending
> > items from ->i_ioend_list, the latter was repopulated with 90 ioends, 67
> > of which have append transactions. These append transactions account for
> > ~520k of log reservation each due to the log stripe unit. All together
> > this consumes nearly all of available log space, prevents allocation of
> > the aforementioned unwritten extent conversion transaction and thus
> > leaves the fs in a deadlocked state.
> > 
> > I can think of different ways we could probably optimize this problem
> > away. One example is to transfer the append transaction to the inode at
> > bio completion time such that we retain only one per pending batch of
> > ioends. The workqueue task would then pull this append transaction from
> > the inode along with the ioend list and transfer it back to the last
> > non-unwritten/shared ioend in the sorted list.
> > 
> > That said, I'm not totally convinced this addresses the fundamental
> > problem of acquiring transaction reservation from a context that
> > essentially already owns outstanding reservation vs. just making it hard
> > to reproduce. I'm wondering if/why we need the append transaction at
> > all. AFAICT it goes back to commit 281627df3eb5 ("xfs: log file size
> > updates at I/O completion time") in v3.4 which changed the completion
> > on-disk size update from being an unlogged update. If we continue to
> > send these potential append ioends to the workqueue for completion
> > processing, is there any reason we can't let the workqueue allocate the
> > transaction as it already does for unwritten conversion?
> 
> Frankly I've never understood what benefit we get from preallocating a
> transaction and letting it twist in the wind consuming log space while
> writeback pushes data to the disk.  It's perfectly fine to delay ioend
> processing while we wait for unwritten conversions and cow remapping to
> take effect, so what's the harm in a slight delay for this?
> 
> I guess it's an optimization to reduce wait times?  It's a pity that
> nobody left a comment justifying why it was done in that particular way,
> what with the freeze protection lockdep weirdness too.
> 

I thought it might have been to facilitate ioend completion in interrupt
(i.e. bio completion) context, but I'm really not sure. That's why I'm
asking. :) I'm hoping Christoph can provide some context since it
appears to be his original implementation.

> > If that is reasonable, I'm thinking of a couple patches:
> > 
> > 1. Optimize current append transaction processing with an inode field as
> > noted above.
> > 
> > 2. Replace the submission side append transaction entirely with a flag
> > or some such on the ioend that allocates the transaction at completion
> > time, but otherwise preserves batching behavior instituted in patch 1.
> 
> What happens if you replace the call to xfs_setfilesize_ioend in
> xfs_end_ioend with xfs_setfilesize, and skip the transaction
> preallocation altogether?
> 

That's pretty much what I'm referring to in step 2 above. I was just
thinking it might make sense to implement some sort of batching model
first to avoid scenarios where we have a bunch of discontiguous append
ioends in the completion list and really only need to update the file
size at the end. (Hence a flag or whatever to indicate the last ioend
must call xfs_setfilesize()).

That said, we do still have contiguous ioend merging happening already.
We could certainly just rely on that, update di_size as we process each
append ioend and worry about further optimization later. That might be
more simple and less code (and a safer first step)..

Brian

> --D
> 
> > Thoughts?
> > 
> > Brian
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-26 17:32 ` Darrick J. Wong
  2021-03-26 18:13   ` Brian Foster
@ 2021-03-29  2:28   ` Dave Chinner
  2021-03-29 18:04     ` Brian Foster
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-03-29  2:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, Christoph Hellwig

On Fri, Mar 26, 2021 at 10:32:44AM -0700, Darrick J. Wong wrote:
> On Fri, Mar 26, 2021 at 11:39:38AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > We have a report of a workload that deadlocks on log reservation via
> > iomap_ioend completion batching. To start, the fs format is somewhat
> > unique in that the log is on the smaller side (35MB) and the log stripe
> > unit is 256k, but this is actually a default mkfs for the underlying
> > storage. I don't have much more information wrt to the workload or
> > anything that contributes to the completion processing characteristics.
> > 
> > The overall scenario is that a workqueue task is executing in
> > xfs_end_io() and blocked on transaction reservation for an unwritten
> > extent conversion. Since this task began executing and pulled pending
> > items from ->i_ioend_list, the latter was repopulated with 90 ioends, 67
> > of which have append transactions. These append transactions account for
> > ~520k of log reservation each due to the log stripe unit. All together
> > this consumes nearly all of available log space, prevents allocation of
> > the aforementioned unwritten extent conversion transaction and thus
> > leaves the fs in a deadlocked state.
> > 
> > I can think of different ways we could probably optimize this problem
> > away. One example is to transfer the append transaction to the inode at
> > bio completion time such that we retain only one per pending batch of
> > ioends. The workqueue task would then pull this append transaction from
> > the inode along with the ioend list and transfer it back to the last
> > non-unwritten/shared ioend in the sorted list.
> > 
> > That said, I'm not totally convinced this addresses the fundamental
> > problem of acquiring transaction reservation from a context that
> > essentially already owns outstanding reservation vs. just making it hard
> > to reproduce. I'm wondering if/why we need the append transaction at
> > all. AFAICT it goes back to commit 281627df3eb5 ("xfs: log file size
> > updates at I/O completion time") in v3.4 which changed the completion
> > on-disk size update from being an unlogged update. If we continue to
> > send these potential append ioends to the workqueue for completion
> > processing, is there any reason we can't let the workqueue allocate the
> > transaction as it already does for unwritten conversion?
> 
> Frankly I've never understood what benefit we get from preallocating a
> transaction and letting it twist in the wind consuming log space while
> writeback pushes data to the disk.  It's perfectly fine to delay ioend
> processing while we wait for unwritten conversions and cow remapping to
> take effect, so what's the harm in a slight delay for this?

The difference was that file size updates used to be far, far more
common than unwritten extent updates for buffered IO. When this code
was written, we almost never did buffered writes into unwritten
regions, but we always do sequential writes that required a file
size update.

Given that this code was replacing an un-synchronised size update,
the performance impact of reserving transaction space in IO
completion was significant. There was also the problem of XFS using
global workqueues - the series that introduced the append
transaction also introduced per-mount IO completion workqueues and
so there were concerns about blowing out the number of completion
workers when we have thousands of pending completions all waiting on
log space.

There was a bunch of considerations that probably don't exist
anymore, plus a bunch of new ones, such as the fact that we now
queue and merge ioends to process in a single context rather than
just spraying ioends to worker threads to deal with. The old code
would have worked just fine - the unwritten extent conversion would
not have been blocking all those other IO completions...

Wait, hold on - we're putting both unwritten conversion and size
updates onto the same queue?

Ah, yes, that's exactly what we are doing. We've punted all the size
extension to the unwritten workqueue, regardless of whether it's
just a size update or not. And then the work processes them one at a
time after sorting them. We don't complete/free the append
transactions until we try to merge them one at a time as we walk the
ioend list on the inode. Hence we try to reserve more log space
while we still hold an unknown amount of log space on the queue we
are processin...

IOws, the completion queuing mixing unwritten extent conversion and
size updates is *nesting transactions*.  THat's the deadlock - we
can't take a new transaction reservation while holding another
transaction reservation in a state where it cannot make progress....

> What happens if you replace the call to xfs_setfilesize_ioend in
> xfs_end_ioend with xfs_setfilesize, and skip the transaction
> preallocation altogether?

I expect that the deadlock will go away at the expense of increased
log reservation contention in IO completion because this unbounds
the amount of transaction reservation concurrency that can occur in
buffered writeback. IOWs, this might just hammer the log really,
really hard and that's exactly what we don't want data IO completion
to do....

I'd say the first thing to fix is the ordering problem on the
completion queue. XFS needs more than just offset based ordering, it
needs ioend type based ordering, too. All the size updates should be
processed before the unwritten extent conversions, hence removing
the nesting of transactions....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-26 15:39 xfs ioend batching log reservation deadlock Brian Foster
  2021-03-26 17:32 ` Darrick J. Wong
@ 2021-03-29  5:45 ` Christoph Hellwig
  1 sibling, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2021-03-29  5:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Darrick J. Wong, Christoph Hellwig

On Fri, Mar 26, 2021 at 11:39:38AM -0400, Brian Foster wrote:
> 1. Optimize current append transaction processing with an inode field as
> noted above.
> 
> 2. Replace the submission side append transaction entirely with a flag
> or some such on the ioend that allocates the transaction at completion
> time, but otherwise preserves batching behavior instituted in patch 1.

I'm pretty sure I had a patch to kill off the transaction reservation
a while ago and Dave objected, mostly in performance grounds in that
we might have tons of reservations coming from the I/O completion
workqueue that would all be blocked on the transaction reservations.

This would probably be much better with the ioend merging, which should
significantly reduce the amount of transactions reservations - to probably
not more than common workloads using unwritten extents.

I'm all for killing the transaction pre-reservations as they have created
a lot of pain for us.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-29  2:28   ` Dave Chinner
@ 2021-03-29 18:04     ` Brian Foster
  2021-03-29 23:51       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2021-03-29 18:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, Christoph Hellwig

On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
> On Fri, Mar 26, 2021 at 10:32:44AM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 26, 2021 at 11:39:38AM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > We have a report of a workload that deadlocks on log reservation via
> > > iomap_ioend completion batching. To start, the fs format is somewhat
> > > unique in that the log is on the smaller side (35MB) and the log stripe
> > > unit is 256k, but this is actually a default mkfs for the underlying
> > > storage. I don't have much more information wrt to the workload or
> > > anything that contributes to the completion processing characteristics.
> > > 
> > > The overall scenario is that a workqueue task is executing in
> > > xfs_end_io() and blocked on transaction reservation for an unwritten
> > > extent conversion. Since this task began executing and pulled pending
> > > items from ->i_ioend_list, the latter was repopulated with 90 ioends, 67
> > > of which have append transactions. These append transactions account for
> > > ~520k of log reservation each due to the log stripe unit. All together
> > > this consumes nearly all of available log space, prevents allocation of
> > > the aforementioned unwritten extent conversion transaction and thus
> > > leaves the fs in a deadlocked state.
> > > 
> > > I can think of different ways we could probably optimize this problem
> > > away. One example is to transfer the append transaction to the inode at
> > > bio completion time such that we retain only one per pending batch of
> > > ioends. The workqueue task would then pull this append transaction from
> > > the inode along with the ioend list and transfer it back to the last
> > > non-unwritten/shared ioend in the sorted list.
> > > 
> > > That said, I'm not totally convinced this addresses the fundamental
> > > problem of acquiring transaction reservation from a context that
> > > essentially already owns outstanding reservation vs. just making it hard
> > > to reproduce. I'm wondering if/why we need the append transaction at
> > > all. AFAICT it goes back to commit 281627df3eb5 ("xfs: log file size
> > > updates at I/O completion time") in v3.4 which changed the completion
> > > on-disk size update from being an unlogged update. If we continue to
> > > send these potential append ioends to the workqueue for completion
> > > processing, is there any reason we can't let the workqueue allocate the
> > > transaction as it already does for unwritten conversion?
> > 
> > Frankly I've never understood what benefit we get from preallocating a
> > transaction and letting it twist in the wind consuming log space while
> > writeback pushes data to the disk.  It's perfectly fine to delay ioend
> > processing while we wait for unwritten conversions and cow remapping to
> > take effect, so what's the harm in a slight delay for this?
> 
> The difference was that file size updates used to be far, far more
> common than unwritten extent updates for buffered IO. When this code
> was written, we almost never did buffered writes into unwritten
> regions, but we always do sequential writes that required a file
> size update.
> 
> Given that this code was replacing an un-synchronised size update,
> the performance impact of reserving transaction space in IO
> completion was significant. There was also the problem of XFS using
> global workqueues - the series that introduced the append
> transaction also introduced per-mount IO completion workqueues and
> so there were concerns about blowing out the number of completion
> workers when we have thousands of pending completions all waiting on
> log space.
> 
> There was a bunch of considerations that probably don't exist
> anymore, plus a bunch of new ones, such as the fact that we now
> queue and merge ioends to process in a single context rather than
> just spraying ioends to worker threads to deal with. The old code
> would have worked just fine - the unwritten extent conversion would
> not have been blocking all those other IO completions...
> 
> Wait, hold on - we're putting both unwritten conversion and size
> updates onto the same queue?
> 
> Ah, yes, that's exactly what we are doing. We've punted all the size
> extension to the unwritten workqueue, regardless of whether it's
> just a size update or not. And then the work processes them one at a
> time after sorting them. We don't complete/free the append
> transactions until we try to merge them one at a time as we walk the
> ioend list on the inode. Hence we try to reserve more log space
> while we still hold an unknown amount of log space on the queue we
> are processin...
> 
> IOws, the completion queuing mixing unwritten extent conversion and
> size updates is *nesting transactions*.  THat's the deadlock - we
> can't take a new transaction reservation while holding another
> transaction reservation in a state where it cannot make progress....
> 

Yes.

> > What happens if you replace the call to xfs_setfilesize_ioend in
> > xfs_end_ioend with xfs_setfilesize, and skip the transaction
> > preallocation altogether?
> 
> I expect that the deadlock will go away at the expense of increased
> log reservation contention in IO completion because this unbounds
> the amount of transaction reservation concurrency that can occur in
> buffered writeback. IOWs, this might just hammer the log really,
> really hard and that's exactly what we don't want data IO completion
> to do....
> 

The point of the original idea was to try and eliminate most of the
transaction overhead for di_size updates in the first place by doing an
update per batch (as opposed to per ioend). See the appended diff for a
crude hacked up example of what I mean.

> I'd say the first thing to fix is the ordering problem on the
> completion queue. XFS needs more than just offset based ordering, it
> needs ioend type based ordering, too. All the size updates should be
> processed before the unwritten extent conversions, hence removing
> the nesting of transactions....
> 

Generally speaking this makes sense if we need to retain the
preallocated size update transactions for some reason. One thing to note
is that we'd be putting on-disk size updates ahead of other ioend
completions on background writeback. I suspect that might not matter
much for unwritten ioends since we'd just expose unwritten extents after
a crash, but the effect on cow writeback completion is not as clear to
me.

For one, it looks like a cow ioend can both require a transaction
allocation for fork remap as well as have an append transaction already
attached, so we'd probably have to tweak how individual ioends are
processed as opposed to just ordering them differently. I also thought
cow blocks don't necessarily have to cover shared (or even existing)
blocks in the data fork due to preallocation, so we'd probably need to
investigate things like whether this makes it possible to put an on-disk
update ahead of a cow remap that lands somewhere in the range between
the in-core inode size and the (smaller) on-disk inode size, and if so,
whether that could result in problematic behavior. I'm not sure this is
worth the associated complexity if there's opportunity to remove the
need for most of these transactions in the first place. Hm?

Brian

--- 8< ---

Hacked up RFC to demonstrate batched completion side transaction
allocations for on-disk updates.

---

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1cc7c36d98e9..04d200e5e70d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -18,6 +18,9 @@
 #include "xfs_bmap_util.h"
 #include "xfs_reflink.h"
 
+/* XXX */
+#define IOMAP_F_APPEND	0x2000
+
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
 	unsigned int		data_seq;
@@ -182,12 +185,10 @@ xfs_end_ioend(
 		error = xfs_reflink_end_cow(ip, offset, size);
 	else if (ioend->io_type == IOMAP_UNWRITTEN)
 		error = xfs_iomap_write_unwritten(ip, offset, size, false);
-	else
-		ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
 
 done:
-	if (ioend->io_private)
-		error = xfs_setfilesize_ioend(ioend, error);
+	if (ioend->io_flags & IOMAP_F_APPEND)
+		error = xfs_setfilesize(ip, offset, size);
 	iomap_finish_ioends(ioend, error);
 	memalloc_nofs_restore(nofs_flag);
 }
@@ -221,16 +222,28 @@ xfs_end_io(
 	struct iomap_ioend	*ioend;
 	struct list_head	tmp;
 	unsigned long		flags;
+	xfs_off_t		maxendoff;
 
 	spin_lock_irqsave(&ip->i_ioend_lock, flags);
 	list_replace_init(&ip->i_ioend_list, &tmp);
 	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
 
 	iomap_sort_ioends(&tmp);
+
+	/* XXX: track max endoff manually? */
+	ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
+	if (((ioend->io_flags & IOMAP_F_SHARED) ||
+	     (ioend->io_type != IOMAP_UNWRITTEN)) &&
+	    xfs_ioend_is_append(ioend)) {
+		ioend->io_flags |= IOMAP_F_APPEND;
+		maxendoff = ioend->io_offset + ioend->io_size;
+	}
+
 	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
 			io_list))) {
 		list_del_init(&ioend->io_list);
 		iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
+		ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
 		xfs_end_ioend(ioend);
 	}
 }
@@ -250,8 +263,6 @@ xfs_end_bio(
 	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
 	unsigned long		flags;
 
-	ASSERT(xfs_ioend_needs_workqueue(ioend));
-
 	spin_lock_irqsave(&ip->i_ioend_lock, flags);
 	if (list_empty(&ip->i_ioend_list))
 		WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue,
@@ -487,6 +498,7 @@ xfs_prepare_ioend(
 	int			status)
 {
 	unsigned int		nofs_flag;
+	bool			append = false;
 
 	/*
 	 * We can allocate memory here while doing writeback on behalf of
@@ -501,17 +513,17 @@ xfs_prepare_ioend(
 				ioend->io_offset, ioend->io_size);
 	}
 
-	/* Reserve log space if we might write beyond the on-disk inode size. */
+	/* XXX: quick hack to queue append ioends w/o transaction */
 	if (!status &&
 	    ((ioend->io_flags & IOMAP_F_SHARED) ||
 	     ioend->io_type != IOMAP_UNWRITTEN) &&
 	    xfs_ioend_is_append(ioend) &&
 	    !ioend->io_private)
-		status = xfs_setfilesize_trans_alloc(ioend);
+		append = true;
 
 	memalloc_nofs_restore(nofs_flag);
 
-	if (xfs_ioend_needs_workqueue(ioend))
+	if (xfs_ioend_needs_workqueue(ioend) || append)
 		ioend->io_bio->bi_end_io = xfs_end_bio;
 	return status;
 }


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-29 18:04     ` Brian Foster
@ 2021-03-29 23:51       ` Dave Chinner
  2021-03-30 14:42         ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-03-29 23:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, Christoph Hellwig

On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote:
> On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
> > I'd say the first thing to fix is the ordering problem on the
> > completion queue. XFS needs more than just offset based ordering, it
> > needs ioend type based ordering, too. All the size updates should be
> > processed before the unwritten extent conversions, hence removing
> > the nesting of transactions....
> > 
> 
> Generally speaking this makes sense if we need to retain the
> preallocated size update transactions for some reason. One thing to note
> is that we'd be putting on-disk size updates ahead of other ioend
> completions on background writeback. I suspect that might not matter
> much for unwritten ioends since we'd just expose unwritten extents after
> a crash, but the effect on cow writeback completion is not as clear to
> me.

The old code (pre-queue+merge) completed completions in random order
(i.e. whatever the workqueue and scheduler resulted in) so I don't
think this is a concern.

My concern is that it is a fairly major change of behaviour for IO
completion, we know this is a performance sensitive path, and that
we shouldn't make sweeping architectural changes without having a
bunch of performance regression tests demonstrating that the change
does not have unexpected, adverse side effects.

Hence my suggestion that we first the deadlock first, then do the
investigation work to determine if removing the pre-IO transaction
reservation is a benefit or not. We often do the simple, low risk
bug fix first, then do the more extensive rework that may have
unexpected side effects so that we have a less risky backport
candidate for stable distros. I don't see this as any different
here.

> For one, it looks like a cow ioend can both require a transaction
> allocation for fork remap as well as have an append transaction already
> attached, so we'd probably have to tweak how individual ioends are
> processed as opposed to just ordering them differently. I also thought
> cow blocks don't necessarily have to cover shared (or even existing)
> blocks in the data fork due to preallocation, so we'd probably need to
> investigate things like whether this makes it possible to put an on-disk
> update ahead of a cow remap that lands somewhere in the range between
> the in-core inode size and the (smaller) on-disk inode size, and if so,
> whether that could result in problematic behavior.

These things all used to happen before we (recently) added all the
per-inode queuing and merging code to IO completion. So there's no
real evidence that there is a problem completing them in different
orders. If someone runs fsync, then they'll all get completed before
fsync returns, so AFAICT this is just unnecessary complication of a
relatively simple mechanism.

> I'm not sure this is
> worth the associated complexity if there's opportunity to remove the
> need for most of these transactions in the first place. Hm?

Please keep in mind I'm not saying "we can't remove this code" as
Christoph has implied. Last time this came up, like the "use
unwritten extents for delalloc", I asked for performance/benchmark
results that indicate that the change doesn't introduce excessive
overhead or unexpected regressions. Again, I'm asking for that work
to be done *before* we make a signficant change in behaviour because
that change in behaviour is not necessary to fix the bug.

> @@ -182,12 +185,10 @@ xfs_end_ioend(
>  		error = xfs_reflink_end_cow(ip, offset, size);
>  	else if (ioend->io_type == IOMAP_UNWRITTEN)
>  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
> -	else
> -		ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
>  
>  done:
> -	if (ioend->io_private)
> -		error = xfs_setfilesize_ioend(ioend, error);
> +	if (ioend->io_flags & IOMAP_F_APPEND)
> +		error = xfs_setfilesize(ip, offset, size);
>  	iomap_finish_ioends(ioend, error);
>  	memalloc_nofs_restore(nofs_flag);
>  }
> @@ -221,16 +222,28 @@ xfs_end_io(
>  	struct iomap_ioend	*ioend;
>  	struct list_head	tmp;
>  	unsigned long		flags;
> +	xfs_off_t		maxendoff;
>  
>  	spin_lock_irqsave(&ip->i_ioend_lock, flags);
>  	list_replace_init(&ip->i_ioend_list, &tmp);
>  	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
>  
>  	iomap_sort_ioends(&tmp);
> +
> +	/* XXX: track max endoff manually? */
> +	ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
> +	if (((ioend->io_flags & IOMAP_F_SHARED) ||
> +	     (ioend->io_type != IOMAP_UNWRITTEN)) &&
> +	    xfs_ioend_is_append(ioend)) {
> +		ioend->io_flags |= IOMAP_F_APPEND;
> +		maxendoff = ioend->io_offset + ioend->io_size;
> +	}
> +
>  	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
>  			io_list))) {
>  		list_del_init(&ioend->io_list);
>  		iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
> +		ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
>  		xfs_end_ioend(ioend);
>  	}
>  }

So now when I run a workload that is untarring a large tarball full
of small files, we have as many transaction reserve operations
runnning concurrently as there are IO completions queued.

Right now, we the single threaded writeback (bdi-flusher) runs
reservations serially, so we are introducing a large amount of
concurrency to reservations here. IOWs, instead of throttling active
reservations before we submit the IO we end up with attempted
reservations only being bounded by the number of kworkers the
completion workqueue can throw at the system.  Then we might have
tens of active filesystems doing the same thing, each with their own
set of workqueues and kworkers...

Yes, we can make "lots of IO to a single inode" have less overhead,
but we do not want to do that at the expense of bad behaviour when
we have "single IOs to lots of inodes". That's the concern I have
here, and that's going to take a fair amount of work to characterise
and measure the impact, especially on large machines with slow
disks...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-29 23:51       ` Dave Chinner
@ 2021-03-30 14:42         ` Brian Foster
  2021-03-30 23:57           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Foster @ 2021-03-30 14:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, Christoph Hellwig

On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote:
> On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote:
> > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
...
> 
> > @@ -182,12 +185,10 @@ xfs_end_ioend(
> >  		error = xfs_reflink_end_cow(ip, offset, size);
> >  	else if (ioend->io_type == IOMAP_UNWRITTEN)
> >  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
> > -	else
> > -		ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
> >  
> >  done:
> > -	if (ioend->io_private)
> > -		error = xfs_setfilesize_ioend(ioend, error);
> > +	if (ioend->io_flags & IOMAP_F_APPEND)
> > +		error = xfs_setfilesize(ip, offset, size);
> >  	iomap_finish_ioends(ioend, error);
> >  	memalloc_nofs_restore(nofs_flag);
> >  }
> > @@ -221,16 +222,28 @@ xfs_end_io(
> >  	struct iomap_ioend	*ioend;
> >  	struct list_head	tmp;
> >  	unsigned long		flags;
> > +	xfs_off_t		maxendoff;
> >  
> >  	spin_lock_irqsave(&ip->i_ioend_lock, flags);
> >  	list_replace_init(&ip->i_ioend_list, &tmp);
> >  	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> >  
> >  	iomap_sort_ioends(&tmp);
> > +
> > +	/* XXX: track max endoff manually? */
> > +	ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
> > +	if (((ioend->io_flags & IOMAP_F_SHARED) ||
> > +	     (ioend->io_type != IOMAP_UNWRITTEN)) &&
> > +	    xfs_ioend_is_append(ioend)) {
> > +		ioend->io_flags |= IOMAP_F_APPEND;
> > +		maxendoff = ioend->io_offset + ioend->io_size;
> > +	}
> > +
> >  	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
> >  			io_list))) {
> >  		list_del_init(&ioend->io_list);
> >  		iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
> > +		ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
> >  		xfs_end_ioend(ioend);
> >  	}
> >  }
> 
> So now when I run a workload that is untarring a large tarball full
> of small files, we have as many transaction reserve operations
> runnning concurrently as there are IO completions queued.
> 

This patch has pretty much no effect on a typical untar workload because
it is dominated by delalloc -> unwritten extent conversions that already
require completion time transaction reservations. Indeed, a quick test
to untar a kernel source tree produces no setfilesize events at all.

I'm not sure we have many situations upstream where append transactions
are used outside of perhaps cow completions (which already have a
completion time transaction allocation for fork remaps) or intra-block
file extending writes (that thus produce an inode size change within a
mapped, already converted block). Otherwise a truncate down should
always remove post-eof blocks and speculative prealloc originates from
delalloc, so afaict those should follow the same general sequence. Eh?

As it is, I think the performance concern is overstated but I'm happy to
run any tests to confirm or deny that if you want to make more concrete
suggestions. This patch is easy to test and pretty much survived an
overnight regression run (outside of one or two things I have to look
into..). I'm happy to adjust the approach from there, but I also think
if it proves necessary there are fallback options (based around the
original suggestion in my first mail) to preserve current submission
time (serial) append transaction reservation that don't require to
categorize/split or partially process the pending ioend list.

Brian

> Right now, we the single threaded writeback (bdi-flusher) runs
> reservations serially, so we are introducing a large amount of
> concurrency to reservations here. IOWs, instead of throttling active
> reservations before we submit the IO we end up with attempted
> reservations only being bounded by the number of kworkers the
> completion workqueue can throw at the system.  Then we might have
> tens of active filesystems doing the same thing, each with their own
> set of workqueues and kworkers...
> 
> Yes, we can make "lots of IO to a single inode" have less overhead,
> but we do not want to do that at the expense of bad behaviour when
> we have "single IOs to lots of inodes". That's the concern I have
> here, and that's going to take a fair amount of work to characterise
> and measure the impact, especially on large machines with slow
> disks...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-30 14:42         ` Brian Foster
@ 2021-03-30 23:57           ` Dave Chinner
  2021-03-31 12:26             ` Brian Foster
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-03-30 23:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, Christoph Hellwig

On Tue, Mar 30, 2021 at 10:42:48AM -0400, Brian Foster wrote:
> On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote:
> > On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote:
> > > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
> ...
> > 
> > > @@ -182,12 +185,10 @@ xfs_end_ioend(
> > >  		error = xfs_reflink_end_cow(ip, offset, size);
> > >  	else if (ioend->io_type == IOMAP_UNWRITTEN)
> > >  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
> > > -	else
> > > -		ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
> > >  
> > >  done:
> > > -	if (ioend->io_private)
> > > -		error = xfs_setfilesize_ioend(ioend, error);
> > > +	if (ioend->io_flags & IOMAP_F_APPEND)
> > > +		error = xfs_setfilesize(ip, offset, size);
> > >  	iomap_finish_ioends(ioend, error);
> > >  	memalloc_nofs_restore(nofs_flag);
> > >  }
> > > @@ -221,16 +222,28 @@ xfs_end_io(
> > >  	struct iomap_ioend	*ioend;
> > >  	struct list_head	tmp;
> > >  	unsigned long		flags;
> > > +	xfs_off_t		maxendoff;
> > >  
> > >  	spin_lock_irqsave(&ip->i_ioend_lock, flags);
> > >  	list_replace_init(&ip->i_ioend_list, &tmp);
> > >  	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> > >  
> > >  	iomap_sort_ioends(&tmp);
> > > +
> > > +	/* XXX: track max endoff manually? */
> > > +	ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
> > > +	if (((ioend->io_flags & IOMAP_F_SHARED) ||
> > > +	     (ioend->io_type != IOMAP_UNWRITTEN)) &&
> > > +	    xfs_ioend_is_append(ioend)) {
> > > +		ioend->io_flags |= IOMAP_F_APPEND;
> > > +		maxendoff = ioend->io_offset + ioend->io_size;
> > > +	}
> > > +
> > >  	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
> > >  			io_list))) {
> > >  		list_del_init(&ioend->io_list);
> > >  		iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
> > > +		ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
> > >  		xfs_end_ioend(ioend);
> > >  	}
> > >  }
> > 
> > So now when I run a workload that is untarring a large tarball full
> > of small files, we have as many transaction reserve operations
> > runnning concurrently as there are IO completions queued.
> > 
> 
> This patch has pretty much no effect on a typical untar workload because
> it is dominated by delalloc -> unwritten extent conversions that already
> require completion time transaction reservations. Indeed, a quick test
> to untar a kernel source tree produces no setfilesize events at all.

Great, so it's not the obvious case because of the previous
delalloc->unwritten change. What you need to do now is find
out what workload it is that is generating so many setfilesize
completions despite delalloc->unwritten so we can understand what
workloads this will actually impact.

> I'm not sure we have many situations upstream where append transactions
> are used outside of perhaps cow completions (which already have a
> completion time transaction allocation for fork remaps) or intra-block
> file extending writes (that thus produce an inode size change within a
> mapped, already converted block). Otherwise a truncate down should
> always remove post-eof blocks and speculative prealloc originates from
> delalloc, so afaict those should follow the same general sequence. Eh?
> 
> As it is, I think the performance concern is overstated but I'm happy to
> run any tests to confirm or deny that if you want to make more concrete
> suggestions.

As I said: I'm happy to change the code as long as we understand
what workloads it will impact and by how much. We don't know what
workload is generating so many setfilesize transactions yet, so we
can't actually make educated guesses on the wider impact that this
change will have. We also don't have numbers from typical workloads,
just one data point, so nobody actually knows what the impact is.

> This patch is easy to test and pretty much survived an
> overnight regression run (outside of one or two things I have to look
> into..). I'm happy to adjust the approach from there, but I also think
> if it proves necessary there are fallback options (based around the
> original suggestion in my first mail) to preserve current submission
> time (serial) append transaction reservation that don't require to
> categorize/split or partially process the pending ioend list.

IO path behaviour changes require more than functional regression
tests. There is an amount of documented performance regression
testing that is required, too. The argument being made here is that
"this won't affect performance", so all I'm asking for is to be
provided with the evidence that shows this assertion to be true.

This is part of the reason I suggested breaking this up into
separate bug fix and removal patches - a pure bug fix doesn't need
performance regression testing to be done. Further, having the bug
fix separate to changing the behaviour of the code mitigates the
risk of finding an unexpected performance regression from changing
behaviour. Combining the bug fix with a significant change of
behaviour doesn't provide us with a simple method of addressing such
a regression...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs ioend batching log reservation deadlock
  2021-03-30 23:57           ` Dave Chinner
@ 2021-03-31 12:26             ` Brian Foster
  0 siblings, 0 replies; 10+ messages in thread
From: Brian Foster @ 2021-03-31 12:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, Christoph Hellwig

On Wed, Mar 31, 2021 at 10:57:22AM +1100, Dave Chinner wrote:
> On Tue, Mar 30, 2021 at 10:42:48AM -0400, Brian Foster wrote:
> > On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote:
> > > On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote:
> > > > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
> > ...
> > > 
> > > > @@ -182,12 +185,10 @@ xfs_end_ioend(
> > > >  		error = xfs_reflink_end_cow(ip, offset, size);
> > > >  	else if (ioend->io_type == IOMAP_UNWRITTEN)
> > > >  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
> > > > -	else
> > > > -		ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
> > > >  
> > > >  done:
> > > > -	if (ioend->io_private)
> > > > -		error = xfs_setfilesize_ioend(ioend, error);
> > > > +	if (ioend->io_flags & IOMAP_F_APPEND)
> > > > +		error = xfs_setfilesize(ip, offset, size);
> > > >  	iomap_finish_ioends(ioend, error);
> > > >  	memalloc_nofs_restore(nofs_flag);
> > > >  }
> > > > @@ -221,16 +222,28 @@ xfs_end_io(
> > > >  	struct iomap_ioend	*ioend;
> > > >  	struct list_head	tmp;
> > > >  	unsigned long		flags;
> > > > +	xfs_off_t		maxendoff;
> > > >  
> > > >  	spin_lock_irqsave(&ip->i_ioend_lock, flags);
> > > >  	list_replace_init(&ip->i_ioend_list, &tmp);
> > > >  	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> > > >  
> > > >  	iomap_sort_ioends(&tmp);
> > > > +
> > > > +	/* XXX: track max endoff manually? */
> > > > +	ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
> > > > +	if (((ioend->io_flags & IOMAP_F_SHARED) ||
> > > > +	     (ioend->io_type != IOMAP_UNWRITTEN)) &&
> > > > +	    xfs_ioend_is_append(ioend)) {
> > > > +		ioend->io_flags |= IOMAP_F_APPEND;
> > > > +		maxendoff = ioend->io_offset + ioend->io_size;
> > > > +	}
> > > > +
> > > >  	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
> > > >  			io_list))) {
> > > >  		list_del_init(&ioend->io_list);
> > > >  		iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
> > > > +		ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
> > > >  		xfs_end_ioend(ioend);
> > > >  	}
> > > >  }
> > > 
> > > So now when I run a workload that is untarring a large tarball full
> > > of small files, we have as many transaction reserve operations
> > > runnning concurrently as there are IO completions queued.
> > > 
> > 
> > This patch has pretty much no effect on a typical untar workload because
> > it is dominated by delalloc -> unwritten extent conversions that already
> > require completion time transaction reservations. Indeed, a quick test
> > to untar a kernel source tree produces no setfilesize events at all.
> 
> Great, so it's not the obvious case because of the previous
> delalloc->unwritten change. What you need to do now is find
> out what workload it is that is generating so many setfilesize
> completions despite delalloc->unwritten so we can understand what
> workloads this will actually impact.
> 

I think the discrepancy here is just that the original problem occurred
on a downstream kernel where per-inode ioend batching exists but the
delalloc -> unwritten change does not. The latter was a relatively more
recent upstream change. The log reservation deadlock vector still exists
regardless (i.e., if there's another source of reservation pressure),
but that probably explains why the append ioends might be more
predominant in the user kernel (and in general why this might be more
likely between upstream v5.2 and v5.9) but less of a concern on current
upstream.

In any event, I think this clarifies that there's no longer much need
for submission side append reservation in current kernels. I'll look
into that change and deal with the downstream kernel separately.

Brian

> > I'm not sure we have many situations upstream where append transactions
> > are used outside of perhaps cow completions (which already have a
> > completion time transaction allocation for fork remaps) or intra-block
> > file extending writes (that thus produce an inode size change within a
> > mapped, already converted block). Otherwise a truncate down should
> > always remove post-eof blocks and speculative prealloc originates from
> > delalloc, so afaict those should follow the same general sequence. Eh?
> > 
> > As it is, I think the performance concern is overstated but I'm happy to
> > run any tests to confirm or deny that if you want to make more concrete
> > suggestions.
> 
> As I said: I'm happy to change the code as long as we understand
> what workloads it will impact and by how much. We don't know what
> workload is generating so many setfilesize transactions yet, so we
> can't actually make educated guesses on the wider impact that this
> change will have. We also don't have numbers from typical workloads,
> just one data point, so nobody actually knows what the impact is.
> 
> > This patch is easy to test and pretty much survived an
> > overnight regression run (outside of one or two things I have to look
> > into..). I'm happy to adjust the approach from there, but I also think
> > if it proves necessary there are fallback options (based around the
> > original suggestion in my first mail) to preserve current submission
> > time (serial) append transaction reservation that don't require to
> > categorize/split or partially process the pending ioend list.
> 
> IO path behaviour changes require more than functional regression
> tests. There is an amount of documented performance regression
> testing that is required, too. The argument being made here is that
> "this won't affect performance", so all I'm asking for is to be
> provided with the evidence that shows this assertion to be true.
> 
> This is part of the reason I suggested breaking this up into
> separate bug fix and removal patches - a pure bug fix doesn't need
> performance regression testing to be done. Further, having the bug
> fix separate to changing the behaviour of the code mitigates the
> risk of finding an unexpected performance regression from changing
> behaviour. Combining the bug fix with a significant change of
> behaviour doesn't provide us with a simple method of addressing such
> a regression...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-03-31 12:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26 15:39 xfs ioend batching log reservation deadlock Brian Foster
2021-03-26 17:32 ` Darrick J. Wong
2021-03-26 18:13   ` Brian Foster
2021-03-29  2:28   ` Dave Chinner
2021-03-29 18:04     ` Brian Foster
2021-03-29 23:51       ` Dave Chinner
2021-03-30 14:42         ` Brian Foster
2021-03-30 23:57           ` Dave Chinner
2021-03-31 12:26             ` Brian Foster
2021-03-29  5:45 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.