xfs ioend batching log reservation deadlock

* xfs ioend batching log reservation deadlock
@ 2021-03-26 15:39 Brian Foster
  2021-03-26 17:32 ` Darrick J. Wong
  2021-03-29  5:45 ` Christoph Hellwig
  0 siblings, 2 replies; 10+ messages in thread
From: Brian Foster @ 2021-03-26 15:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: Darrick J. Wong, Christoph Hellwig

Hi all,

We have a report of a workload that deadlocks on log reservation via
iomap_ioend completion batching. To start, the fs format is somewhat
unique in that the log is on the smaller side (35MB) and the log stripe
unit is 256k, but this is actually a default mkfs for the underlying
storage. I don't have much more information wrt to the workload or
anything that contributes to the completion processing characteristics.

The overall scenario is that a workqueue task is executing in
xfs_end_io() and blocked on transaction reservation for an unwritten
extent conversion. Since this task began executing and pulled pending
items from ->i_ioend_list, the latter was repopulated with 90 ioends, 67
of which have append transactions. These append transactions account for
~520k of log reservation each due to the log stripe unit. All together
this consumes nearly all of available log space, prevents allocation of
the aforementioned unwritten extent conversion transaction and thus
leaves the fs in a deadlocked state.

I can think of different ways we could probably optimize this problem
away. One example is to transfer the append transaction to the inode at
bio completion time such that we retain only one per pending batch of
ioends. The workqueue task would then pull this append transaction from
the inode along with the ioend list and transfer it back to the last
non-unwritten/shared ioend in the sorted list.

That said, I'm not totally convinced this addresses the fundamental
problem of acquiring transaction reservation from a context that
essentially already owns outstanding reservation vs. just making it hard
to reproduce. I'm wondering if/why we need the append transaction at
all. AFAICT it goes back to commit 281627df3eb5 ("xfs: log file size
updates at I/O completion time") in v3.4 which changed the completion
on-disk size update from being an unlogged update. If we continue to
send these potential append ioends to the workqueue for completion
processing, is there any reason we can't let the workqueue allocate the
transaction as it already does for unwritten conversion?

If that is reasonable, I'm thinking of a couple patches:

1. Optimize current append transaction processing with an inode field as
noted above.

2. Replace the submission side append transaction entirely with a flag
or some such on the ioend that allocates the transaction at completion
time, but otherwise preserves batching behavior instituted in patch 1.

Thoughts?

Brian

^ permalink raw reply	[flat|nested] 10+ messages in thread