All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@infradead.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load
Date: Mon, 27 Nov 2023 17:50:41 -0800	[thread overview]
Message-ID: <20231128015041.GO2766956@frogsfrogsfrogs> (raw)
In-Reply-To: <ZWGK6Ig752wdBvwF@infradead.org>

On Fri, Nov 24, 2023 at 09:49:28PM -0800, Christoph Hellwig wrote:
> The code makes me feel we're fixing the wrong thing, but I have to
> admin I don't fully understand it. Let me go step by step.
> 
> > While stress-testing online repair of btrees, I noticed periodic
> > assertion failures from the buffer cache about buffer readers
> > encountering buffers with DELWRI_Q set, even though the btree bulk load
> > had already committed and the buffer itself wasn't on any delwri list.
> 
> What assert do these buffer reader hit?  The two I can spot that are in
> the read path in the broader sense are:
> 
>   1) in xfs_buf_find_lock for stale buffers.
>   2) in __xfs_buf_submit just before I/O submission

The second assert:

AIL:	Repair0:	Repair1:

pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list

	stale buf X:
	clear DELWRI_Q
	does not clear b_list
	free space X
	commit

delwri_submit	# oops

Though there's more to this.

> > I traced this to a misunderstanding of how the delwri lists work,
> > particularly with regards to the AIL's buffer list.  If a buffer is
> > logged and committed, the buffer can end up on that AIL buffer list.  If
> > btree repairs are run twice in rapid succession, it's possible that the
> > first repair will invalidate the buffer and free it before the next time
> > the AIL wakes up.  This clears DELWRI_Q from the buffer state.
> 
> Where "this clears" is xfs_buf_stale called from xfs_btree_free_block
> via xfs_trans_binval?

Yes.  I'll rework the last sentence to read: "Marking the buffer stale
clears DELWRI_Q from the buffer state without removing the buffer from
its delwri list."

> > If the second repair allocates the same block, it will then recycle the
> > buffer to start writing the new btree block.
> 
> If my above theory is correct: how do we end up reusing a stale buffer?
> If not, what am I misunderstanding above?

If we free a metadata buffer, we'll mark it stale, update the bnobt, and
add an entry to the extent busy list.  If a later repair finds:

1. The same free space in the bnobt;
2. The free space exactly coincides with one extent busy list entry;
3. The entry isn't in the middle of discarding the block;

Then the allocator will remove the extent busy list entry and let us
have the space.  At that point we could have a stale buffer that's also
on one of the AIL's lists:

AIL:	Repair0:	Repair1:

pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list

	stale buf X:
	clear DELWRI_Q
	does not clear b_list
	free space X
	commit

			find free space X
			get buffer
			rewrite buffer
			delwri_queue:
			set DELWRI_Q
			already on a list, do not add
			commit

			BAD: committed tree root before all blocks written

delwri_submit	# too late now

I could demonstrate this by injecting dmerror while delwri(ting) the new
btree blocks, and observing that some of the EIOs would not trigger the
rollback but instead would trip IO errors in the AIL.

> > wakes up and walks the buffer list, it will ignore the buffer because it
> > can't lock it, and go back to sleep.
> 
> And I think this is where the trouble starts - we have a buffer that
> is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared,
> it is stale and we then reuse it.  I don't think we just need to kick
> it off the delwri list just for btree staging, but in general.

...but how to do that?  We don't know whose delwri list the buffer's on,
let alone how to lock the list so that we can remove the buffer from
that list.

(Oh, you have some suggestions below.)

> > When the second repair calls delwri_queue to put the buffer on the
> > list of buffers to write before committing the new btree, it will set
> > DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
> > buffer list, it won't add it to the bulkload buffer's list.
> >
> > This is incorrect, because the bulkload caller relies on delwri_submit
> > to ensure that all the buffers have been sent to disk /before/
> > committing the new btree root pointer.  This ordering requirement is
> > required for data consistency.
> >
> > Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
> > drop it, so the next thread to walk through the btree will trip over a
> > debug assertion on that flag.
> 
> Where do it finally drop it?

TBH, it's been so long that I can't remember anymore. :(

> > To fix this, create a new function that waits for the buffer to be
> > removed from any other delwri lists before adding the buffer to the
> > caller's delwri list.  By waiting for the buffer to clear both the
> > delwri list and any potential delwri wait list, we can be sure that
> > repair will initiate writes of all buffers and report all write errors
> > back to userspace instead of committing the new structure.
> 
> If my understanding above is correct this just papers over the bug
> that a buffer that is marked stale and can be reused for something
> else is left on a delwri list.

I'm not sure it's a bug for *most* code that encounters it.  Regular
transactions don't directly use the delwri lists, so it will never see
that at all.  If the buffer gets rewritten and logged, then it'll just
end up on the AIL's delwri buffer list again.

The other delwri_submit users don't seem to care if the buffer gets
written directly by delwri_submit or by the AIL.  In the first case, the
caller will get the EIO and force a shutdown; in the second case, the
AIL will shut down the log.

Weirdly, xfs_bwrite clears DELWRI_Q but doesn't necessary shut down the
fs if the write fails.

Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does
it mean "b_list is active"?  Or does it merely mean "another thread will
write this buffer via b_list if nobody gets there first"?  I think it's
the second, since it's easy enough to list_empty().

It's only repair that requires this new behavior that all new buffers
have to be written through the delwri list, and only because it actually
/can/ cancel the transaction by finishing the autoreap EFIs.

> I've entirely thought about all the
> consequence, but here is what I'd try:
> 
>  - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q
>    call your new wait code instead of asserting (probably only
>    for the !trylock case)
>  - make sure we don't leak DELWRI_Q 

Way back when I first discovered this, my first impulse was to make
xfs_buf_stale wait for DELWRI_Q to clear.  That IIRC didn't work because
a transaction could hold an inode that the AIL will need to lock.  I
think xfs_buf_find_lock would have the same problem.

Seeing as repair is the only user with the requirement that it alone can
issue writes for the delwri list buffers, that's why I went with what's
in this patch.

Thank you for persevering so far! :D

--D

  reply	other threads:[~2023-11-28  1:50 UTC|newest]

Thread overview: 152+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong
2023-11-24 23:44 ` [PATCHSET v28.0 0/1] xfs: prevent livelocks in xchk_iget Darrick J. Wong
2023-11-24 23:46   ` [PATCH 1/1] xfs: make xchk_iget safer in the presence of corrupt inode btrees Darrick J. Wong
2023-11-25  4:57     ` Christoph Hellwig
2023-11-27 21:55       ` Darrick J. Wong
2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong
2023-11-24 23:47   ` [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects Darrick J. Wong
2023-11-25  5:04     ` Christoph Hellwig
2023-11-24 23:47   ` [PATCH 2/7] xfs: allow pausing of pending deferred work items Darrick J. Wong
2023-11-25  5:05     ` Christoph Hellwig
2023-11-24 23:47   ` [PATCH 3/7] xfs: remove __xfs_free_extent_later Darrick J. Wong
2023-11-25  5:05     ` Christoph Hellwig
2023-11-24 23:47   ` [PATCH 4/7] xfs: automatic freeing of freshly allocated unwritten space Darrick J. Wong
2023-11-25  5:06     ` Christoph Hellwig
2023-11-24 23:48   ` [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
2023-11-26 13:14     ` Christoph Hellwig
2023-11-27 22:34       ` Darrick J. Wong
2023-11-28  5:41         ` Christoph Hellwig
2023-11-28 17:02           ` Darrick J. Wong
2023-11-24 23:48   ` [PATCH 6/7] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong
2023-11-26 13:15     ` Christoph Hellwig
2023-11-24 23:48   ` [PATCH 7/7] xfs: force small EFIs for reaping btree extents Darrick J. Wong
2023-11-25  5:13     ` Christoph Hellwig
2023-11-27 22:46       ` Darrick J. Wong
2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong
2023-11-24 23:48   ` [PATCH 1/4] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
2023-11-25  5:49     ` Christoph Hellwig
2023-11-28  1:50       ` Darrick J. Wong [this message]
2023-11-28  7:13         ` Christoph Hellwig
2023-11-28 15:18           ` Christoph Hellwig
2023-11-28 17:07             ` Darrick J. Wong
2023-11-30  4:33               ` Christoph Hellwig
2023-11-24 23:49   ` [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong
2023-11-25  5:50     ` Christoph Hellwig
2023-11-28  1:44       ` Darrick J. Wong
2023-11-28  5:42         ` Christoph Hellwig
2023-11-28 17:07           ` Darrick J. Wong
2023-11-24 23:49   ` [PATCH 3/4] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
2023-11-25  5:51     ` Christoph Hellwig
2023-11-24 23:49   ` [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
2023-11-25  5:53     ` Christoph Hellwig
2023-11-27 22:56       ` Darrick J. Wong
2023-11-28 20:11         ` Darrick J. Wong
2023-11-29  5:50           ` Christoph Hellwig
2023-11-29  5:57             ` Darrick J. Wong
2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
2023-11-24 23:50   ` [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps Darrick J. Wong
2023-11-25  5:57     ` Christoph Hellwig
2023-11-28  1:34       ` Darrick J. Wong
2023-11-28  5:43         ` Christoph Hellwig
2023-11-24 23:50   ` [PATCH 2/5] xfs: roll the scrub transaction after completing a repair Darrick J. Wong
2023-11-25  6:05     ` Christoph Hellwig
2023-11-28  1:29       ` Darrick J. Wong
2023-11-24 23:50   ` [PATCH 3/5] xfs: repair free space btrees Darrick J. Wong
2023-11-25  6:11     ` Christoph Hellwig
2023-11-28  1:05       ` Darrick J. Wong
2023-11-28 15:10     ` Christoph Hellwig
2023-11-28 21:13       ` Darrick J. Wong
2023-11-29  5:56         ` Christoph Hellwig
2023-11-29  6:18           ` Darrick J. Wong
2023-11-29  6:24             ` Christoph Hellwig
2023-11-29  6:26               ` Darrick J. Wong
2023-11-24 23:50   ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong
2023-11-25  6:12     ` Christoph Hellwig
2023-11-28  1:09       ` Darrick J. Wong
2023-11-28 15:57     ` Christoph Hellwig
2023-11-28 21:37       ` Darrick J. Wong
2023-11-24 23:51   ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong
2023-11-28 16:07     ` Christoph Hellwig
2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong
2023-11-24 23:51   ` [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong
2023-11-25  6:13     ` Christoph Hellwig
2023-11-24 23:51   ` [PATCH 2/7] xfs: try to attach dquots to files before repairing them Darrick J. Wong
2023-11-25  6:14     ` Christoph Hellwig
2023-11-24 23:51   ` [PATCH 3/7] xfs: repair inode records Darrick J. Wong
2023-11-28 17:08     ` Christoph Hellwig
2023-11-28 23:08       ` Darrick J. Wong
2023-11-29  6:02         ` Christoph Hellwig
2023-12-05 23:08           ` Darrick J. Wong
2023-12-06  5:16             ` Christoph Hellwig
2023-11-24 23:52   ` [PATCH 4/7] xfs: zap broken inode forks Darrick J. Wong
2023-11-30  4:44     ` Christoph Hellwig
2023-11-30 21:08       ` Darrick J. Wong
2023-12-04  4:39         ` Christoph Hellwig
2023-12-04 20:43           ` Darrick J. Wong
2023-12-05  4:28             ` Christoph Hellwig
2023-11-24 23:52   ` [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong
2023-11-30  4:47     ` Christoph Hellwig
2023-11-30 21:37       ` Darrick J. Wong
2023-12-04  4:41         ` Christoph Hellwig
2023-12-04 20:44           ` Darrick J. Wong
2023-11-24 23:52   ` [PATCH 6/7] xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped Darrick J. Wong
2023-11-24 23:52   ` [PATCH 7/7] xfs: repair obviously broken inode modes Darrick J. Wong
2023-11-30  4:49     ` Christoph Hellwig
2023-11-30 21:18       ` Darrick J. Wong
2023-12-04  4:42         ` Christoph Hellwig
2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
2023-11-24 23:53   ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong
2023-11-30  4:53     ` Christoph Hellwig
2023-11-30 21:48       ` Darrick J. Wong
2023-12-04  4:42         ` Christoph Hellwig
2023-11-24 23:53   ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong
2023-11-30  5:07     ` Christoph Hellwig
2023-12-01  1:38       ` Darrick J. Wong
2023-11-24 23:53   ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong
2023-11-28 14:20     ` Christoph Hellwig
2023-11-29  5:42       ` Darrick J. Wong
2023-11-29  6:03         ` Christoph Hellwig
2023-11-24 23:53   ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong
2023-11-28 13:59     ` Christoph Hellwig
2023-11-24 23:54   ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong
2023-11-30  5:10     ` Christoph Hellwig
2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong
2023-11-24 23:54   ` [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly Darrick J. Wong
2023-11-28 14:04     ` Christoph Hellwig
2023-11-28 23:27       ` Darrick J. Wong
2023-11-29  6:05         ` Christoph Hellwig
2023-11-29  6:20           ` Darrick J. Wong
2023-11-24 23:54   ` [PATCH 2/6] xfs: check rt summary " Darrick J. Wong
2023-11-28 14:05     ` Christoph Hellwig
2023-11-28 23:30       ` Darrick J. Wong
2023-11-29  1:23         ` Darrick J. Wong
2023-11-29  6:05         ` Christoph Hellwig
2023-11-29  6:21           ` Darrick J. Wong
2023-11-29  6:23             ` Christoph Hellwig
2023-11-30  0:10               ` Darrick J. Wong
2023-11-24 23:54   ` [PATCH 3/6] xfs: always check the rtbitmap and rtsummary files Darrick J. Wong
2023-11-28 14:06     ` Christoph Hellwig
2023-11-24 23:55   ` [PATCH 4/6] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong
2023-11-30  5:12     ` Christoph Hellwig
2023-11-24 23:55   ` [PATCH 5/6] xfs: create a new inode fork block unmap helper Darrick J. Wong
2023-11-25  6:17     ` Christoph Hellwig
2023-11-24 23:55   ` [PATCH 6/6] xfs: online repair of realtime bitmaps Darrick J. Wong
2023-11-30  5:16     ` Christoph Hellwig
2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong
2023-11-24 23:56   ` [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot Darrick J. Wong
2023-11-30  5:17     ` Christoph Hellwig
2023-11-24 23:56   ` [PATCH 2/5] xfs: check dquot resource timers Darrick J. Wong
2023-11-30  5:17     ` Christoph Hellwig
2023-11-24 23:56   ` [PATCH 3/5] xfs: pull xfs_qm_dqiterate back into scrub Darrick J. Wong
2023-11-30  5:22     ` Christoph Hellwig
2023-11-24 23:56   ` [PATCH 4/5] xfs: improve dquot iteration for scrub Darrick J. Wong
2023-11-30  5:25     ` Christoph Hellwig
2023-11-24 23:57   ` [PATCH 5/5] xfs: repair quotas Darrick J. Wong
2023-11-30  5:33     ` Christoph Hellwig
2023-11-30 22:10       ` Darrick J. Wong
2023-12-04  4:48         ` Christoph Hellwig
2023-12-04 20:52           ` Darrick J. Wong
2023-12-05  4:27             ` Christoph Hellwig
2023-12-05  5:20               ` Darrick J. Wong
2023-12-04  4:49         ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2023-09-26 23:29 [PATCHSET v27.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong
2023-09-26 23:33 ` [PATCH 1/4] xfs: force all buffers to be written during btree bulk load Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231128015041.GO2766956@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.