All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking
Date: Thu, 23 Apr 2020 10:36:37 -0400	[thread overview]
Message-ID: <20200423143637.GC43557@bfoster> (raw)
In-Reply-To: <20200423055919.GO27860@dread.disaster.area>

On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > The log item failed flag is used to indicate a log item was flushed
> > but the underlying buffer was not successfully written to disk. If
> > the error configuration allows for writeback retries, xfsaild uses
> > the flag to resubmit the underlying buffer without acquiring the
> > flush lock of the item.
> > 
> > The flag is currently set and cleared under the AIL lock and buffer
> > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > completion time. The flag is checked at xfsaild push time under AIL
> > lock and cleared under buffer lock before resubmission. If I/O
> > eventually succeeds, the flag is cleared in the _done() handler for
> > the associated item type, again under AIL lock and buffer lock.
> 
> Actually, I think you've missed the relevant lock here: the item's
> flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> state is protected by the flush lock. When the item has been flushed
> and attached to the buffer for completion callbacks, the flush lock
> context gets handed to the buffer.
> 
> i.e. the buffer owns the flush lock and so while that buffer is
> locked for IO we know that the item flush state (and hence the
> XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> lock.
> 

Right..

> (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> lock then walks the items on the buffer and changes the callback
> functions because those items are flush locked and hence holding the
> buffer lock gives exclusive access to the flush state of those
> items....)
> 
> > As far as I can tell, the only reason for holding the AIL lock
> > across sets/clears is to manage consistency between the log item
> > bitop state and the temporary buffer pointer that is attached to the
> > log item. The bit itself is used to manage consistency of the
> > attached buffer, but is not enough to guarantee the buffer is still
> > attached by the time xfsaild attempts to access it.
> 
> Correct. The guarantee that the buffer is still attached to the log
> item is what the AIL lock provides us with.
> 
> > However since
> > failure state is always set or cleared under buffer lock (either via
> > I/O completion or xfsaild), this particular case can be handled at
> > item push time by verifying failure state once under buffer lock.
> 
> In theory, yes, but there's a problem before you get that buffer
> lock. That is: what serialises access to lip->li_buf?
> 

That's partly what I was referring to above by the push time changes.
This patch was an attempt to replace reliance on ail_lock with a push
time sequence that would serialize access to a failed buffer
(essentially relying on the failed bit). Whether it's correct or not is
another matter... ;)

> The xfsaild does not hold a reference to the buffer and, without the
> AIL lock to provide synchronisation, the log item reference to the
> buffer can be dropped asynchronously by buffer IO completion. Hence
> the buffer could be freed between the xfsaild reading lip->li_buf
> and calling xfs_buf_trylock(bp). i.e. this introduces a
> use-after-free race condition on the initial buffer access.
> 

Hmm.. the log item holds a temporary reference to the buffer when the
failed state is set. On submission, xfsaild queues the failed buffer
(new ref for the delwri queue), clears the failed state and drops the
failure reference of every failed item that is attached. xfsaild is also
the only task that knows how to process LI_FAILED items, so I don't
think we'd ever race with a failed buffer becoming "unfailed" from
xfsaild (which is the scenario where a buffer could disappear from I/O
completion). In some sense, clearing LI_FAILED of an item is essentially
retaking ownership of the item flush lock and hold of the underlying
buffer, but the existing code is certainly not written that way.

The issue I was wrestling with wrt to this patch was primarily
maintaining consistency between the bit and the assignment of the
pointer on a failing I/O. E.g., the buffer pointer is set by the task
that sets the bit, but xfsaild checking the bit doesn't guarantee the
pointer had been set yet. I did add a post-buffer lock LI_FAILED check
as well, but that probably could have been an assert.

> IOWs, the xfsaild cannot access lip->li_buf safely unless the
> set/clear operations are protected by the same lock the xfsaild
> holds. The xfsaild holds neither the buffer lock, a buffer reference
> or an item flush lock, hence it's the AIL lock or nothing...
> 

Yeah, that was my impression from reading the code. I realize from this
discussion that this doesn't actually simplify the logic. That was the
primary goal, so I think I need to revisit the approach here. Even if
this is correct (which I'm still not totally sure of), it's fragile in
the sense that it partly relies on single-threadedness of xfsaild. I
initially thought about adding a log item spinlock for this kind of
thing, but it didn't (and still doesn't) seem appropriate to burden
every log item in the system with a spinlock for the rare case of buffer
I/O errors.

I mentioned in an earlier patch that it would be nice to consider
removing ->li_buf entirely but hadn't thought it through. That might
alleviate the need to serialize the pointer and the state in the first
place as well as remove the subtle ordering requirement from the
resubmit code to manage the buffer reference. Suppose we relied on
LI_FAILED to represent the buffer hold + flush lock, and then relied on
the hold to facilitate an in-core lookup from xfsaild. For example:

xfs_set_li_failed(lip, bp)
{
	/* keep the backing buffer in-core so long as the item is failed */
	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags))
		xfs_buf_hold(bp);
}

xfs_clear_li_failed(lip, bp)
{
	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags))
		xfs_buf_rele(bp);
}

/*
 * helper to get a mapping from a buffer backed log item and use it to
 * find an in-core buf
 */
xfs_item_to_bp(lip)
{
	if (inode item) {
		...
		blk = ip->imap->im_blkno;
		len = ip->imap->im_len;
	} else if (dquot item) {
		...
		blk = dqp->q_blkno;
		len = mp->m_quotainfo->qi_dqchunklen;
	}

	return xfs_buf_incore(blk, len, XBF_TRYLOCK);
}

xfsaild_resubmit_item(lip)
{
	...

	bp = xfs_item_to_bp(lip);
	if (!bp)
		return XFS_ITEM_LOCKED;
	if (!test_bit(XFS_LI_FAILED, &lip->li_flags))
		goto out_unlock;

	/* drop all failure refs */
	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
		xfs_clear_li_failed(lip, bp);

	xfs_buf_delwri_queue(...);

	xfs_buf_relse(bp);
	return XFS_ITEM_SUCCESS;
}

I think that would push everything under the buffer lock (and remove
->li_buf) with the caveat that we'd have to lock the inode/dquot to get
the buffer. Any thoughts on something like that? Note that at this point
I might drop this patch from the series regardless due to the complexity
mismatch and consider it separately.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


  reply	other threads:[~2020-04-23 14:36 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-22 17:54 [PATCH v2 00/13] xfs: flush related error handling cleanups Brian Foster
2020-04-22 17:54 ` [PATCH v2 01/13] xfs: refactor failed buffer resubmission into xfsaild Brian Foster
2020-04-23  4:09   ` Dave Chinner
2020-04-25 17:21   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 02/13] xfs: factor out buffer I/O failure simulation code Brian Foster
2020-04-23  4:10   ` Dave Chinner
2020-04-25 17:23   ` Christoph Hellwig
2020-04-27 11:11     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 03/13] xfs: fallthru to buffer attach on error and simplify error handling Brian Foster
2020-04-23  4:18   ` Dave Chinner
2020-04-23 14:28     ` Brian Foster
2020-04-25 17:26   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 04/13] xfs: remove unnecessary shutdown check from xfs_iflush() Brian Foster
2020-04-25 17:27   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 05/13] xfs: ratelimit unmount time per-buffer I/O error message Brian Foster
2020-04-23  4:46   ` Dave Chinner
2020-04-23 14:29     ` Brian Foster
2020-04-23 21:14       ` Dave Chinner
2020-04-24 11:12         ` Brian Foster
2020-04-24 22:08           ` Dave Chinner
2020-04-27 11:11             ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 06/13] xfs: fix duplicate verification from xfs_qm_dqflush() Brian Foster
2020-04-23  4:47   ` Dave Chinner
2020-04-25 17:28   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 07/13] xfs: abort consistently on dquot flush failure Brian Foster
2020-04-25 17:30   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 08/13] xfs: elide the AIL lock on log item failure tracking Brian Foster
2020-04-23  5:59   ` Dave Chinner
2020-04-23 14:36     ` Brian Foster [this message]
2020-04-23 21:38       ` Dave Chinner
2020-04-24 11:14         ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 09/13] xfs: clean up AIL log item removal functions Brian Foster
2020-04-23  4:54   ` Dave Chinner
2020-04-25 17:37   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 10/13] xfs: combine xfs_trans_ail_[remove|delete]() Brian Foster
2020-04-23  4:55   ` Dave Chinner
2020-04-22 17:54 ` [PATCH v2 11/13] xfs: remove unused iflush stale parameter Brian Foster
2020-04-25 17:37   ` Christoph Hellwig
2020-04-22 17:54 ` [PATCH v2 12/13] xfs: random buffer write failure errortag Brian Foster
2020-04-23  5:11   ` Dave Chinner
2020-04-25 17:38   ` Christoph Hellwig
2020-04-27 11:12     ` Brian Foster
2020-04-22 17:54 ` [PATCH v2 13/13] xfs: remove unused shutdown types Brian Foster
2020-04-23  5:13   ` Dave Chinner
2020-04-25 17:39   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200423143637.GC43557@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.