* [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading @ 2023-12-07 2:38 Darrick J. Wong 2023-12-07 2:38 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong ` (5 more replies) 0 siblings, 6 replies; 19+ messages in thread From: Darrick J. Wong @ 2023-12-07 2:38 UTC (permalink / raw) To: djwong; +Cc: Christoph Hellwig, linux-xfs Hi all, Before we start merging the online repair functions, let's improve the bulk loading code a bit. First, we need to fix a misinteraction between the AIL and the btree bulkloader wherein the delwri at the end of the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list. Second, we introduce a defer ops barrier object so that the process of reaping blocks after a repair cannot queue more than two extents per EFI log item. This increases our exposure to leaking blocks if the system goes down during a reap, but also should prevent transaction overflows, which result in the system going down. Third, we change the bulkloader itself to copy multiple records into a block if possible, and add some debugging knobs so that developers can control the slack factors, just like they can do for xfs_repair. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-prep-for-bulk-loading --- fs/xfs/libxfs/xfs_btree.c | 2 + fs/xfs/libxfs/xfs_btree.h | 3 ++ fs/xfs/libxfs/xfs_btree_staging.c | 72 ++++++++++++++++++++++++++----------- fs/xfs/libxfs/xfs_btree_staging.h | 25 ++++++++++--- fs/xfs/scrub/newbt.c | 12 +++++- fs/xfs/xfs_buf.c | 52 +++++++++++++++++++++++++-- fs/xfs/xfs_buf.h | 1 + fs/xfs/xfs_globals.c | 12 ++++++ fs/xfs/xfs_sysctl.h | 2 + fs/xfs/xfs_sysfs.c | 54 ++++++++++++++++++++++++++++ 10 files changed, 200 insertions(+), 35 deletions(-) ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2023-12-07 2:38 [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading Darrick J. Wong @ 2023-12-07 2:38 ` Darrick J. Wong 2023-12-07 5:25 ` Christoph Hellwig 2023-12-07 2:38 ` [PATCH 2/6] xfs: set XBF_DONE on newly formatted btree block that are ready for writing Darrick J. Wong ` (4 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-12-07 2:38 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffers with incorrect DELWRI_Q state. Looking further, I observed this race between the AIL trying to write out a btree block and repair zapping a btree block after the fact: AIL: Repair0: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit delwri_submit # oops Worse yet, I discovered that running the same repair over and over in a tight loop can result in a second race that cause data integrity problems with the repair: AIL: Repair0: Repair1: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit find free space X get buffer rewrite buffer delwri_queue: set DELWRI_Q already on a list, do not add commit BAD: committed tree root before all blocks written delwri_submit # too late now I traced this to my own misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the buffer state without removing the buffer from its delwri list. The buffer doesn't know which list it's on, so it cannot know which lock to take to protect the list for a removal. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. By waiting for the buffer to clear both the delwri list and any potential delwri wait list, we can be sure that repair will initiate writes of all buffers and report all write errors back to userspace instead of committing the new structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree_staging.c | 4 +-- fs/xfs/xfs_buf.c | 44 ++++++++++++++++++++++++++++++++++--- fs/xfs/xfs_buf.h | 1 + 3 files changed, 42 insertions(+), 7 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index dd75e208b543e..29e3f8ccb1852 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf( if (*bpp == NULL) return; - if (!xfs_buf_delwri_queue(*bpp, buffers_list)) - ASSERT(0); - + xfs_buf_delwri_queue_here(*bpp, buffers_list); xfs_buf_relse(*bpp); *bpp = NULL; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 545c7991b9b58..ec4bd7a24d88c 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2049,6 +2049,14 @@ xfs_alloc_buftarg( return NULL; } +static inline void +xfs_buf_list_del( + struct xfs_buf *bp) +{ + list_del_init(&bp->b_list); + wake_up_var(&bp->b_list); +} + /* * Cancel a delayed write list. * @@ -2066,7 +2074,7 @@ xfs_buf_delwri_cancel( xfs_buf_lock(bp); bp->b_flags &= ~_XBF_DELWRI_Q; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); } } @@ -2119,6 +2127,34 @@ xfs_buf_delwri_queue( return true; } +/* + * Queue a buffer to this delwri list as part of a data integrity operation. + * If the buffer is on any other delwri list, we'll wait for that to clear + * so that the caller can submit the buffer for IO and wait for the result. + * Callers must ensure the buffer is not already on the list. + */ +void +xfs_buf_delwri_queue_here( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + /* + * We need this buffer to end up on the /caller's/ delwri list, not any + * old list. This can happen if the buffer is marked stale (which + * clears DELWRI_Q) after the AIL queues the buffer to its list but + * before the AIL has a chance to submit the list. + */ + while (!list_empty(&bp->b_list)) { + xfs_buf_unlock(bp); + wait_var_event(&bp->b_list, list_empty(&bp->b_list)); + xfs_buf_lock(bp); + } + + ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + + xfs_buf_delwri_queue(bp, buffer_list); +} + /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons @@ -2181,7 +2217,7 @@ xfs_buf_delwri_submit_buffers( * reference and remove it from the list here. */ if (!(bp->b_flags & _XBF_DELWRI_Q)) { - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); continue; } @@ -2201,7 +2237,7 @@ xfs_buf_delwri_submit_buffers( list_move_tail(&bp->b_list, wait_list); } else { bp->b_flags |= XBF_ASYNC; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); } __xfs_buf_submit(bp, false); } @@ -2255,7 +2291,7 @@ xfs_buf_delwri_submit( while (!list_empty(&wait_list)) { bp = list_first_entry(&wait_list, struct xfs_buf, b_list); - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); /* * Wait on the locked buffer, check for errors and unlock and diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index c86e164196568..b470de08a46ca 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -319,6 +319,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); +void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2023-12-07 2:38 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong @ 2023-12-07 5:25 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2023-12-07 5:25 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs So as discussed last round I don't particularly like how it is done caller dependent, but it seems to be the only way forward without completely reworking the delwri list handling, so: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/6] xfs: set XBF_DONE on newly formatted btree block that are ready for writing 2023-12-07 2:38 [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading Darrick J. Wong 2023-12-07 2:38 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong @ 2023-12-07 2:38 ` Darrick J. Wong 2023-12-07 5:26 ` Christoph Hellwig 2023-12-07 2:39 ` [PATCH 3/6] xfs: read leaf blocks when computing keys for bulkloading into node blocks Darrick J. Wong ` (3 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-12-07 2:38 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> The btree bulkloading code calls xfs_buf_delwri_queue_here when it has finished formatting a new btree block and wants to queue it to be written to disk. Once the new btree root has been committed, the blocks (and hence the buffers) will be accessible to the rest of the filesystem. Mark each new buffer as DONE when adding it to the delwri list so that the next btree traversal can skip reloading the contents from disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_buf.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index ec4bd7a24d88c..702b3a1f9d1c4 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2152,6 +2152,14 @@ xfs_buf_delwri_queue_here( ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + /* + * The buffer is locked. The _delwri_queue below will bhold the buffer + * so it cannot be reclaimed until the blocks are written to disk. + * Mark this buffer XBF_DONE (i.e. uptodate) so that a subsequent + * xfs_buf_read will not pointlessly reread the contents from the disk. + */ + bp->b_flags |= XBF_DONE; + xfs_buf_delwri_queue(bp, buffer_list); } ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/6] xfs: set XBF_DONE on newly formatted btree block that are ready for writing 2023-12-07 2:38 ` [PATCH 2/6] xfs: set XBF_DONE on newly formatted btree block that are ready for writing Darrick J. Wong @ 2023-12-07 5:26 ` Christoph Hellwig 2023-12-11 18:55 ` Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2023-12-07 5:26 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Wed, Dec 06, 2023 at 06:38:50PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > The btree bulkloading code calls xfs_buf_delwri_queue_here when it has > finished formatting a new btree block and wants to queue it to be > written to disk. Once the new btree root has been committed, the blocks > (and hence the buffers) will be accessible to the rest of the > filesystem. Mark each new buffer as DONE when adding it to the delwri > list so that the next btree traversal can skip reloading the contents > from disk. This still seems like the wrong place to me - it really is the caller that fills it out that should set the DONE flag, not a non-standard delwri helper that should hopefully go away in the future. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/6] xfs: set XBF_DONE on newly formatted btree block that are ready for writing 2023-12-07 5:26 ` Christoph Hellwig @ 2023-12-11 18:55 ` Darrick J. Wong 0 siblings, 0 replies; 19+ messages in thread From: Darrick J. Wong @ 2023-12-11 18:55 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Wed, Dec 06, 2023 at 09:26:06PM -0800, Christoph Hellwig wrote: > On Wed, Dec 06, 2023 at 06:38:50PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > The btree bulkloading code calls xfs_buf_delwri_queue_here when it has > > finished formatting a new btree block and wants to queue it to be > > written to disk. Once the new btree root has been committed, the blocks > > (and hence the buffers) will be accessible to the rest of the > > filesystem. Mark each new buffer as DONE when adding it to the delwri > > list so that the next btree traversal can skip reloading the contents > > from disk. > > This still seems like the wrong place to me - it really is the caller > that fills it out that should set the DONE flag, not a non-standard > delwri helper that should hopefully go away in the future. I'll move it to xfs_btree_bload_drop_buf then. --D ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/6] xfs: read leaf blocks when computing keys for bulkloading into node blocks 2023-12-07 2:38 [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading Darrick J. Wong 2023-12-07 2:38 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 2023-12-07 2:38 ` [PATCH 2/6] xfs: set XBF_DONE on newly formatted btree block that are ready for writing Darrick J. Wong @ 2023-12-07 2:39 ` Darrick J. Wong 2023-12-07 5:26 ` Christoph Hellwig 2023-12-07 2:39 ` [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong ` (2 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-12-07 2:39 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> When constructing a new btree, xfs_btree_bload_node needs to read the btree blocks for level N to compute the keyptrs for the blocks that will be loaded into level N+1. The level N blocks must be formatted at that point. A subsequent patch will change the btree bulkloader to write new btree blocks in 256K chunks to moderate memory consumption if the new btree is very large. As a consequence of that, it's possible that the buffers for lower level blocks might have been reclaimed by the time the node builder comes back to the block. Therefore, change xfs_btree_bload_node to read the lower level blocks to handle the reclaimed buffer case. As a side effect, the read will increase the LRU refs, which will bias towards keeping new btree buffers in memory after the new btree commits. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree.c | 2 +- fs/xfs/libxfs/xfs_btree.h | 3 +++ fs/xfs/libxfs/xfs_btree_staging.c | 7 ++++++- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 6a6503ab0cd76..c100e92140be1 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -1330,7 +1330,7 @@ xfs_btree_get_buf_block( * Read in the buffer at the given ptr and return the buffer and * the block pointer within the buffer. */ -STATIC int +int xfs_btree_read_buf_block( struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr, diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index 4d68a58be160c..e0875cec49392 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -700,6 +700,9 @@ void xfs_btree_set_ptr_null(struct xfs_btree_cur *cur, int xfs_btree_get_buf_block(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr, struct xfs_btree_block **block, struct xfs_buf **bpp); +int xfs_btree_read_buf_block(struct xfs_btree_cur *cur, + const union xfs_btree_ptr *ptr, int flags, + struct xfs_btree_block **block, struct xfs_buf **bpp); void xfs_btree_set_sibling(struct xfs_btree_cur *cur, struct xfs_btree_block *block, const union xfs_btree_ptr *ptr, int lr); diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index 29e3f8ccb1852..ee0594a4c3d32 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -483,7 +483,12 @@ xfs_btree_bload_node( ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr)); - ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block, + /* + * Read the lower-level block in case the buffer for it has + * been reclaimed. LRU refs will be set on the block, which is + * desirable if the new btree commits. + */ + ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, &child_bp); if (ret) return ret; ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/6] xfs: read leaf blocks when computing keys for bulkloading into node blocks 2023-12-07 2:39 ` [PATCH 3/6] xfs: read leaf blocks when computing keys for bulkloading into node blocks Darrick J. Wong @ 2023-12-07 5:26 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2023-12-07 5:26 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors 2023-12-07 2:38 [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading Darrick J. Wong ` (2 preceding siblings ...) 2023-12-07 2:39 ` [PATCH 3/6] xfs: read leaf blocks when computing keys for bulkloading into node blocks Darrick J. Wong @ 2023-12-07 2:39 ` Darrick J. Wong 2023-12-07 5:26 ` Christoph Hellwig 2023-12-07 2:39 ` [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong 2023-12-07 2:39 ` [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong 5 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-12-07 2:39 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add some debug knobs so that we can control the leaf and node block slack when rebuilding btrees. For developers, it might be useful to construct btrees of various heights by crafting a filesystem with a certain number of records and then using repair+knobs to rebuild the index with a certain shape. Practically speaking, you'd only ever do that for extreme stress testing of the runtime code or the btree generator. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/newbt.c | 11 +++++++--- fs/xfs/xfs_globals.c | 12 +++++++++++ fs/xfs/xfs_sysctl.h | 2 ++ fs/xfs/xfs_sysfs.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 76 insertions(+), 3 deletions(-) diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c index 992cf34a13e70..46883606ad883 100644 --- a/fs/xfs/scrub/newbt.c +++ b/fs/xfs/scrub/newbt.c @@ -32,6 +32,7 @@ * btree bulk loading code calculates for us. However, there are some * exceptions to this rule: * + * (0) If someone turned one of the debug knobs. * (1) If this is a per-AG btree and the AG has less than 10% space free. * (2) If this is an inode btree and the FS has less than 10% space free. @@ -47,9 +48,13 @@ xrep_newbt_estimate_slack( uint64_t free; uint64_t sz; - /* Let the btree code compute the default slack values. */ - bload->leaf_slack = -1; - bload->node_slack = -1; + /* + * The xfs_globals values are set to -1 (i.e. take the bload defaults) + * unless someone has set them otherwise, so we just pull the values + * here. + */ + bload->leaf_slack = xfs_globals.bload_leaf_slack; + bload->node_slack = xfs_globals.bload_node_slack; if (sc->ops->type == ST_PERAG) { free = sc->sa.pag->pagf_freeblks; diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c index 9edc1f2bc9399..f18fec0adf666 100644 --- a/fs/xfs/xfs_globals.c +++ b/fs/xfs/xfs_globals.c @@ -44,4 +44,16 @@ struct xfs_globals xfs_globals = { .pwork_threads = -1, /* automatic thread detection */ .larp = false, /* log attribute replay */ #endif + + /* + * Leave this many record slots empty when bulk loading btrees. By + * default we load new btree leaf blocks 75% full. + */ + .bload_leaf_slack = -1, + + /* + * Leave this many key/ptr slots empty when bulk loading btrees. By + * default we load new btree node blocks 75% full. + */ + .bload_node_slack = -1, }; diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h index f78ad6b10ea58..276696a07040c 100644 --- a/fs/xfs/xfs_sysctl.h +++ b/fs/xfs/xfs_sysctl.h @@ -85,6 +85,8 @@ struct xfs_globals { int pwork_threads; /* parallel workqueue threads */ bool larp; /* log attribute replay */ #endif + int bload_leaf_slack; /* btree bulk load leaf slack */ + int bload_node_slack; /* btree bulk load node slack */ int log_recovery_delay; /* log recovery delay (secs) */ int mount_delay; /* mount setup delay (secs) */ bool bug_on_assert; /* BUG() the kernel on assert failure */ diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index 871f16a4a5d80..17485666b6723 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -262,6 +262,58 @@ larp_show( XFS_SYSFS_ATTR_RW(larp); #endif /* DEBUG */ +STATIC ssize_t +bload_leaf_slack_store( + struct kobject *kobject, + const char *buf, + size_t count) +{ + int ret; + int val; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + xfs_globals.bload_leaf_slack = val; + return count; +} + +STATIC ssize_t +bload_leaf_slack_show( + struct kobject *kobject, + char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.bload_leaf_slack); +} +XFS_SYSFS_ATTR_RW(bload_leaf_slack); + +STATIC ssize_t +bload_node_slack_store( + struct kobject *kobject, + const char *buf, + size_t count) +{ + int ret; + int val; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + xfs_globals.bload_node_slack = val; + return count; +} + +STATIC ssize_t +bload_node_slack_show( + struct kobject *kobject, + char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.bload_node_slack); +} +XFS_SYSFS_ATTR_RW(bload_node_slack); + static struct attribute *xfs_dbg_attrs[] = { ATTR_LIST(bug_on_assert), ATTR_LIST(log_recovery_delay), @@ -271,6 +323,8 @@ static struct attribute *xfs_dbg_attrs[] = { ATTR_LIST(pwork_threads), ATTR_LIST(larp), #endif + ATTR_LIST(bload_leaf_slack), + ATTR_LIST(bload_node_slack), NULL, }; ATTRIBUTE_GROUPS(xfs_dbg); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors 2023-12-07 2:39 ` [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong @ 2023-12-07 5:26 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2023-12-07 5:26 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations 2023-12-07 2:38 [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading Darrick J. Wong ` (3 preceding siblings ...) 2023-12-07 2:39 ` [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong @ 2023-12-07 2:39 ` Darrick J. Wong 2023-12-07 2:39 ` [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong 5 siblings, 0 replies; 19+ messages in thread From: Darrick J. Wong @ 2023-12-07 2:39 UTC (permalink / raw) To: djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When we're performing a bulk load of a btree, move the code that actually stores the btree record in the new btree block out of the generic code and into the individual ->get_record implementations. This is preparation for being able to store multiple records with a single indirect call. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_btree_staging.c | 17 +++++++---------- fs/xfs/libxfs/xfs_btree_staging.h | 15 ++++++++++----- 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index ee0594a4c3d32..a14be6f120600 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -434,22 +434,19 @@ STATIC int xfs_btree_bload_leaf( struct xfs_btree_cur *cur, unsigned int recs_this_block, - xfs_btree_bload_get_record_fn get_record, + xfs_btree_bload_get_records_fn get_records, struct xfs_btree_block *block, void *priv) { - unsigned int j; + unsigned int j = 1; int ret; /* Fill the leaf block with records. */ - for (j = 1; j <= recs_this_block; j++) { - union xfs_btree_rec *block_rec; - - ret = get_record(cur, priv); - if (ret) + while (j <= recs_this_block) { + ret = get_records(cur, j, block, recs_this_block - j + 1, priv); + if (ret < 0) return ret; - block_rec = xfs_btree_rec_addr(cur, j, block); - cur->bc_ops->init_rec_from_cur(cur, block_rec); + j += ret; } return 0; @@ -792,7 +789,7 @@ xfs_btree_bload( trace_xfs_btree_bload_block(cur, level, i, blocks, &ptr, nr_this_block); - ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_record, + ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_records, block, priv); if (ret) goto out; diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h index 5f638f711246e..bd5b3f004823a 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.h +++ b/fs/xfs/libxfs/xfs_btree_staging.h @@ -47,7 +47,9 @@ void xfs_btree_commit_ifakeroot(struct xfs_btree_cur *cur, struct xfs_trans *tp, int whichfork, const struct xfs_btree_ops *ops); /* Bulk loading of staged btrees. */ -typedef int (*xfs_btree_bload_get_record_fn)(struct xfs_btree_cur *cur, void *priv); +typedef int (*xfs_btree_bload_get_records_fn)(struct xfs_btree_cur *cur, + unsigned int idx, struct xfs_btree_block *block, + unsigned int nr_wanted, void *priv); typedef int (*xfs_btree_bload_claim_block_fn)(struct xfs_btree_cur *cur, union xfs_btree_ptr *ptr, void *priv); typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur, @@ -55,11 +57,14 @@ typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur, struct xfs_btree_bload { /* - * This function will be called nr_records times to load records into - * the btree. The function does this by setting the cursor's bc_rec - * field in in-core format. Records must be returned in sort order. + * This function will be called to load @nr_wanted records into the + * btree. The implementation does this by setting the cursor's bc_rec + * field in in-core format and using init_rec_from_cur to set the + * records in the btree block. Records must be returned in sort order. + * The function must return the number of records loaded or the usual + * negative errno. */ - xfs_btree_bload_get_record_fn get_record; + xfs_btree_bload_get_records_fn get_records; /* * This function will be called nr_blocks times to obtain a pointer ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree 2023-12-07 2:38 [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading Darrick J. Wong ` (4 preceding siblings ...) 2023-12-07 2:39 ` [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong @ 2023-12-07 2:39 ` Darrick J. Wong 2023-12-07 5:27 ` Christoph Hellwig 5 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-12-07 2:39 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Constrain the number of dirty buffers that are locked by the btree staging code at any given time by establishing a threshold at which we put them all on the delwri queue and push them to disk. This limits memory consumption while writing out new btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree_staging.c | 48 +++++++++++++++++++++++++++++-------- fs/xfs/libxfs/xfs_btree_staging.h | 10 ++++++++ fs/xfs/scrub/newbt.c | 1 + 3 files changed, 49 insertions(+), 10 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index a14be6f120600..9a935c8a51f91 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -333,18 +333,35 @@ xfs_btree_commit_ifakeroot( /* * Put a btree block that we're loading onto the ordered list and release it. * The btree blocks will be written to disk when bulk loading is finished. + * If we reach the dirty buffer threshold, flush them to disk before + * continuing. */ -static void +static int xfs_btree_bload_drop_buf( - struct list_head *buffers_list, - struct xfs_buf **bpp) + struct xfs_btree_bload *bbl, + struct list_head *buffers_list, + struct xfs_buf **bpp) { - if (*bpp == NULL) - return; + struct xfs_buf *bp = *bpp; + int error; - xfs_buf_delwri_queue_here(*bpp, buffers_list); - xfs_buf_relse(*bpp); + if (!bp) + return 0; + + xfs_buf_delwri_queue_here(bp, buffers_list); + xfs_buf_relse(bp); *bpp = NULL; + bbl->nr_dirty++; + + if (!bbl->max_dirty || bbl->nr_dirty < bbl->max_dirty) + return 0; + + error = xfs_buf_delwri_submit(buffers_list); + if (error) + return error; + + bbl->nr_dirty = 0; + return 0; } /* @@ -416,7 +433,10 @@ xfs_btree_bload_prep_block( */ if (*blockp) xfs_btree_set_sibling(cur, *blockp, &new_ptr, XFS_BB_RIGHTSIB); - xfs_btree_bload_drop_buf(buffers_list, bpp); + + ret = xfs_btree_bload_drop_buf(bbl, buffers_list, bpp); + if (ret) + return ret; /* Initialize the new btree block. */ xfs_btree_init_block_cur(cur, new_bp, level, nr_this_block); @@ -764,6 +784,7 @@ xfs_btree_bload( cur->bc_nlevels = bbl->btree_height; xfs_btree_set_ptr_null(cur, &child_ptr); xfs_btree_set_ptr_null(cur, &ptr); + bbl->nr_dirty = 0; xfs_btree_bload_level_geometry(cur, bbl, level, nr_this_level, &avg_per_block, &blocks, &blocks_with_extra); @@ -802,7 +823,10 @@ xfs_btree_bload( xfs_btree_copy_ptrs(cur, &child_ptr, &ptr, 1); } total_blocks += blocks; - xfs_btree_bload_drop_buf(&buffers_list, &bp); + + ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp); + if (ret) + goto out; /* Populate the internal btree nodes. */ for (level = 1; level < cur->bc_nlevels; level++) { @@ -844,7 +868,11 @@ xfs_btree_bload( xfs_btree_copy_ptrs(cur, &first_ptr, &ptr, 1); } total_blocks += blocks; - xfs_btree_bload_drop_buf(&buffers_list, &bp); + + ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp); + if (ret) + goto out; + xfs_btree_copy_ptrs(cur, &child_ptr, &first_ptr, 1); } diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h index bd5b3f004823a..f0a5007284ef1 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.h +++ b/fs/xfs/libxfs/xfs_btree_staging.h @@ -112,6 +112,16 @@ struct xfs_btree_bload { * height of the new btree. */ unsigned int btree_height; + + /* + * Flush the new btree block buffer list to disk after this many blocks + * have been formatted. Zero prohibits writing any buffers until all + * blocks have been formatted. + */ + uint16_t max_dirty; + + /* Number of dirty buffers. */ + uint16_t nr_dirty; }; int xfs_btree_bload_compute_geometry(struct xfs_btree_cur *cur, diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c index 46883606ad883..81919eeabcdb8 100644 --- a/fs/xfs/scrub/newbt.c +++ b/fs/xfs/scrub/newbt.c @@ -94,6 +94,7 @@ xrep_newbt_init_ag( xnr->alloc_hint = alloc_hint; xnr->resv = resv; INIT_LIST_HEAD(&xnr->resv_list); + xnr->bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */ xrep_newbt_estimate_slack(xnr); } ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree 2023-12-07 2:39 ` [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong @ 2023-12-07 5:27 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2023-12-07 5:27 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCHSET v28.2 0/6] xfs: prepare repair for bulk loading @ 2023-12-13 22:51 Darrick J. Wong 2023-12-13 22:51 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-12-13 22:51 UTC (permalink / raw) To: djwong, hch, chandanbabu; +Cc: linux-xfs Hi all, Before we start merging the online repair functions, let's improve the bulk loading code a bit. First, we need to fix a misinteraction between the AIL and the btree bulkloader wherein the delwri at the end of the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list. Second, we introduce a defer ops barrier object so that the process of reaping blocks after a repair cannot queue more than two extents per EFI log item. This increases our exposure to leaking blocks if the system goes down during a reap, but also should prevent transaction overflows, which result in the system going down. Third, we change the bulkloader itself to copy multiple records into a block if possible, and add some debugging knobs so that developers can control the slack factors, just like they can do for xfs_repair. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading-6.8 --- fs/xfs/libxfs/xfs_btree.c | 2 - fs/xfs/libxfs/xfs_btree.h | 3 + fs/xfs/libxfs/xfs_btree_staging.c | 78 +++++++++++++++++++++++++++---------- fs/xfs/libxfs/xfs_btree_staging.h | 25 +++++++++--- fs/xfs/scrub/newbt.c | 12 ++++-- fs/xfs/xfs_buf.c | 44 +++++++++++++++++++-- fs/xfs/xfs_buf.h | 1 fs/xfs/xfs_globals.c | 12 ++++++ fs/xfs/xfs_sysctl.h | 2 + fs/xfs/xfs_sysfs.c | 54 ++++++++++++++++++++++++++ 10 files changed, 198 insertions(+), 35 deletions(-) ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2023-12-13 22:51 [PATCHSET v28.2 0/6] xfs: prepare repair for bulk loading Darrick J. Wong @ 2023-12-13 22:51 ` Darrick J. Wong 0 siblings, 0 replies; 19+ messages in thread From: Darrick J. Wong @ 2023-12-13 22:51 UTC (permalink / raw) To: djwong, hch, chandanbabu; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffers with incorrect DELWRI_Q state. Looking further, I observed this race between the AIL trying to write out a btree block and repair zapping a btree block after the fact: AIL: Repair0: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit delwri_submit # oops Worse yet, I discovered that running the same repair over and over in a tight loop can result in a second race that cause data integrity problems with the repair: AIL: Repair0: Repair1: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit find free space X get buffer rewrite buffer delwri_queue: set DELWRI_Q already on a list, do not add commit BAD: committed tree root before all blocks written delwri_submit # too late now I traced this to my own misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the buffer state without removing the buffer from its delwri list. The buffer doesn't know which list it's on, so it cannot know which lock to take to protect the list for a removal. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. By waiting for the buffer to clear both the delwri list and any potential delwri wait list, we can be sure that repair will initiate writes of all buffers and report all write errors back to userspace instead of committing the new structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_btree_staging.c | 4 +-- fs/xfs/xfs_buf.c | 44 ++++++++++++++++++++++++++++++++++--- fs/xfs/xfs_buf.h | 1 + 3 files changed, 42 insertions(+), 7 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index dd75e208b543..29e3f8ccb185 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf( if (*bpp == NULL) return; - if (!xfs_buf_delwri_queue(*bpp, buffers_list)) - ASSERT(0); - + xfs_buf_delwri_queue_here(*bpp, buffers_list); xfs_buf_relse(*bpp); *bpp = NULL; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 545c7991b9b5..ec4bd7a24d88 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2049,6 +2049,14 @@ xfs_alloc_buftarg( return NULL; } +static inline void +xfs_buf_list_del( + struct xfs_buf *bp) +{ + list_del_init(&bp->b_list); + wake_up_var(&bp->b_list); +} + /* * Cancel a delayed write list. * @@ -2066,7 +2074,7 @@ xfs_buf_delwri_cancel( xfs_buf_lock(bp); bp->b_flags &= ~_XBF_DELWRI_Q; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); } } @@ -2119,6 +2127,34 @@ xfs_buf_delwri_queue( return true; } +/* + * Queue a buffer to this delwri list as part of a data integrity operation. + * If the buffer is on any other delwri list, we'll wait for that to clear + * so that the caller can submit the buffer for IO and wait for the result. + * Callers must ensure the buffer is not already on the list. + */ +void +xfs_buf_delwri_queue_here( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + /* + * We need this buffer to end up on the /caller's/ delwri list, not any + * old list. This can happen if the buffer is marked stale (which + * clears DELWRI_Q) after the AIL queues the buffer to its list but + * before the AIL has a chance to submit the list. + */ + while (!list_empty(&bp->b_list)) { + xfs_buf_unlock(bp); + wait_var_event(&bp->b_list, list_empty(&bp->b_list)); + xfs_buf_lock(bp); + } + + ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + + xfs_buf_delwri_queue(bp, buffer_list); +} + /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons @@ -2181,7 +2217,7 @@ xfs_buf_delwri_submit_buffers( * reference and remove it from the list here. */ if (!(bp->b_flags & _XBF_DELWRI_Q)) { - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); continue; } @@ -2201,7 +2237,7 @@ xfs_buf_delwri_submit_buffers( list_move_tail(&bp->b_list, wait_list); } else { bp->b_flags |= XBF_ASYNC; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); } __xfs_buf_submit(bp, false); } @@ -2255,7 +2291,7 @@ xfs_buf_delwri_submit( while (!list_empty(&wait_list)) { bp = list_first_entry(&wait_list, struct xfs_buf, b_list); - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); /* * Wait on the locked buffer, check for errors and unlock and diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index c86e16419656..b470de08a46c 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -319,6 +319,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); +void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCHSET v26.0 0/6] xfs: prepare repair for bulk loading @ 2023-07-27 22:18 Darrick J. Wong 2023-07-27 22:24 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-07-27 22:18 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, Before we start merging the online repair functions, let's improve the bulk loading code a bit. First, we need to fix a misinteraction between the AIL and the btree bulkloader wherein the delwri at the end of the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list. Second, we introduce EFIs in the btree bulkloader block allocator to to guarantee that staging blocks are freed if the filesystem goes down before committing the new btree. Third, we change the bulkloader itself to copy multiple records into a block if possible, and add some debugging knobs so that developers can control the slack factors, just like they can do for xfs_repair. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-prep-for-bulk-loading --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_btree.c | 2 fs/xfs/libxfs/xfs_btree.h | 3 fs/xfs/libxfs/xfs_btree_staging.c | 67 +++- fs/xfs/libxfs/xfs_btree_staging.h | 32 +- fs/xfs/scrub/agheader_repair.c | 1 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/newbt.c | 629 +++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.h | 66 ++++ fs/xfs/scrub/repair.c | 10 + fs/xfs/scrub/repair.h | 1 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 37 ++ fs/xfs/xfs_buf.c | 47 +++ fs/xfs/xfs_buf.h | 1 fs/xfs/xfs_globals.c | 12 + fs/xfs/xfs_sysctl.h | 2 fs/xfs/xfs_sysfs.c | 54 +++ 18 files changed, 931 insertions(+), 37 deletions(-) create mode 100644 fs/xfs/scrub/newbt.c create mode 100644 fs/xfs/scrub/newbt.h ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2023-07-27 22:18 [PATCHSET v26.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong @ 2023-07-27 22:24 ` Darrick J. Wong 0 siblings, 0 replies; 19+ messages in thread From: Darrick J. Wong @ 2023-07-27 22:24 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffer readers encountering buffers with DELWRI_Q set, even though the btree bulk load had already committed and the buffer itself wasn't on any delwri list. I traced this to a misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. This clears DELWRI_Q from the buffer state. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. By waiting for the buffer to clear both the delwri list and any potential delwri wait list, we can be sure that repair will initiate writes of all buffers and report all write errors back to userspace instead of committing the new structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree_staging.c | 4 +-- fs/xfs/xfs_buf.c | 47 ++++++++++++++++++++++++++++++++++--- fs/xfs/xfs_buf.h | 1 + 3 files changed, 45 insertions(+), 7 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index dd75e208b543e..29e3f8ccb1852 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf( if (*bpp == NULL) return; - if (!xfs_buf_delwri_queue(*bpp, buffers_list)) - ASSERT(0); - + xfs_buf_delwri_queue_here(*bpp, buffers_list); xfs_buf_relse(*bpp); *bpp = NULL; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index fa392c43ba166..683f07c929681 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2046,6 +2046,14 @@ xfs_alloc_buftarg( return NULL; } +static inline void +xfs_buf_list_del( + struct xfs_buf *bp) +{ + list_del_init(&bp->b_list); + wake_up_var(&bp->b_list); +} + /* * Cancel a delayed write list. * @@ -2063,7 +2071,7 @@ xfs_buf_delwri_cancel( xfs_buf_lock(bp); bp->b_flags &= ~_XBF_DELWRI_Q; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); } } @@ -2116,6 +2124,37 @@ xfs_buf_delwri_queue( return true; } +/* + * Queue a buffer to this delwri list as part of a data integrity operation. + * If the buffer is on any other delwri list, we'll wait for that to clear + * so that the caller can submit the buffer for IO and wait for the result. + * Callers must ensure the buffer is not already on the list. + */ +void +xfs_buf_delwri_queue_here( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + /* + * We need this buffer to end up on the /caller's/ delwri list, not any + * old list. This can happen if the buffer is marked stale (which + * clears DELWRI_Q) after the AIL queues the buffer to its list but + * before the AIL has a chance to submit the list. + */ + while (!list_empty(&bp->b_list)) { + xfs_buf_unlock(bp); + wait_var_event(&bp->b_list, list_empty(&bp->b_list)); + xfs_buf_lock(bp); + } + + ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + + /* This buffer is uptodate; don't let it get reread. */ + bp->b_flags |= XBF_DONE; + + xfs_buf_delwri_queue(bp, buffer_list); +} + /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons @@ -2178,7 +2217,7 @@ xfs_buf_delwri_submit_buffers( * reference and remove it from the list here. */ if (!(bp->b_flags & _XBF_DELWRI_Q)) { - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); continue; } @@ -2198,7 +2237,7 @@ xfs_buf_delwri_submit_buffers( list_move_tail(&bp->b_list, wait_list); } else { bp->b_flags |= XBF_ASYNC; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); } __xfs_buf_submit(bp, false); } @@ -2252,7 +2291,7 @@ xfs_buf_delwri_submit( while (!list_empty(&wait_list)) { bp = list_first_entry(&wait_list, struct xfs_buf, b_list); - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); /* * Wait on the locked buffer, check for errors and unlock and diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index df8f47953bb4e..5896b58c5f4db 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -318,6 +318,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); +void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCHSET v25.0 0/6] xfs: prepare repair for bulk loading @ 2023-05-26 0:28 Darrick J. Wong 2023-05-26 0:45 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-05-26 0:28 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, Before we start merging the online repair functions, let's improve the bulk loading code a bit. First, we need to fix a misinteraction between the AIL and the btree bulkloader wherein the delwri at the end of the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list. Second, we introduce EFIs in the btree bulkloader block allocator to to guarantee that staging blocks are freed if the filesystem goes down before committing the new btree. Third, we change the bulkloader itself to copy multiple records into a block if possible, and add some debugging knobs so that developers can control the slack factors, just like they can do for xfs_repair. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-prep-for-bulk-loading --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_btree.c | 2 fs/xfs/libxfs/xfs_btree.h | 3 fs/xfs/libxfs/xfs_btree_staging.c | 67 +++- fs/xfs/libxfs/xfs_btree_staging.h | 32 +- fs/xfs/scrub/agheader_repair.c | 1 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/newbt.c | 622 +++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.h | 66 ++++ fs/xfs/scrub/repair.c | 10 + fs/xfs/scrub/repair.h | 1 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 37 ++ fs/xfs/xfs_buf.c | 31 ++ fs/xfs/xfs_buf.h | 1 fs/xfs/xfs_globals.c | 12 + fs/xfs/xfs_sysctl.h | 2 fs/xfs/xfs_sysfs.c | 54 +++ 18 files changed, 912 insertions(+), 33 deletions(-) create mode 100644 fs/xfs/scrub/newbt.c create mode 100644 fs/xfs/scrub/newbt.h ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2023-05-26 0:28 [PATCHSET v25.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong @ 2023-05-26 0:45 ` Darrick J. Wong 2023-06-21 2:05 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2023-05-26 0:45 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffer readers encountering buffers with DELWRI_Q set, even though the btree bulk load had already committed and the buffer itself wasn't on any delwri list. I traced this to a misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. This clears DELWRI_Q from the buffer state. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree_staging.c | 4 +--- fs/xfs/xfs_buf.c | 31 +++++++++++++++++++++++++++++++ fs/xfs/xfs_buf.h | 1 + 3 files changed, 33 insertions(+), 3 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index dd75e208b543..29e3f8ccb185 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf( if (*bpp == NULL) return; - if (!xfs_buf_delwri_queue(*bpp, buffers_list)) - ASSERT(0); - + xfs_buf_delwri_queue_here(*bpp, buffers_list); xfs_buf_relse(*bpp); *bpp = NULL; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index b31e6d09a056..2a1a641c2b87 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2112,6 +2112,37 @@ xfs_buf_delwri_queue( return true; } +/* + * Queue a buffer to this delwri list as part of a data integrity operation. + * If the buffer is on any other delwri list, we'll wait for that to clear + * so that the caller can submit the buffer for IO and wait for the result. + * Callers must ensure the buffer is not already on the list. + */ +void +xfs_buf_delwri_queue_here( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + /* + * We need this buffer to end up on the /caller's/ delwri list, not any + * old list. This can happen if the buffer is marked stale (which + * clears DELWRI_Q) after the AIL queues the buffer to its list but + * before the AIL has a chance to submit the list. + */ + while (!list_empty(&bp->b_list)) { + xfs_buf_unlock(bp); + delay(1); + xfs_buf_lock(bp); + } + + ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + + /* This buffer is uptodate; don't let it get reread. */ + bp->b_flags |= XBF_DONE; + + xfs_buf_delwri_queue(bp, buffer_list); +} + /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index d6e8c3bab9f6..467ddb2e2f0d 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -315,6 +315,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); +void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2023-05-26 0:45 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong @ 2023-06-21 2:05 ` Dave Chinner 2023-07-05 23:37 ` Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2023-06-21 2:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Thu, May 25, 2023 at 05:45:35PM -0700, Darrick J. Wong wrote: > @@ -2112,6 +2112,37 @@ xfs_buf_delwri_queue( > return true; > } > > +/* > + * Queue a buffer to this delwri list as part of a data integrity operation. > + * If the buffer is on any other delwri list, we'll wait for that to clear > + * so that the caller can submit the buffer for IO and wait for the result. > + * Callers must ensure the buffer is not already on the list. > + */ > +void > +xfs_buf_delwri_queue_here( This is more of an "exclusive" queuing semantic. i.e. queue this buffer exclusively on the list provided, rather than just ensuring it is queued on some delwri list.... > + struct xfs_buf *bp, > + struct list_head *buffer_list) > +{ > + /* > + * We need this buffer to end up on the /caller's/ delwri list, not any > + * old list. This can happen if the buffer is marked stale (which > + * clears DELWRI_Q) after the AIL queues the buffer to its list but > + * before the AIL has a chance to submit the list. > + */ > + while (!list_empty(&bp->b_list)) { > + xfs_buf_unlock(bp); > + delay(1); > + xfs_buf_lock(bp); > + } Not a big fan of this as it the buffer can be on the AIL buffer list for some time (e.g. AIL might have a hundred thousand buffers to push). This seems more like a case for: while (!list_empty(&bp->b_list)) { xfs_buf_unlock(bp); wait_event_var(bp->b_flags, !(bp->b_flags & _XBF_DELWRI_Q)); xfs_buf_lock(bp); } And a wrapper: void xfs_buf_remove_delwri( struct xfs_buf *bp) { list_del(&bp->b_list); bp->b_flags &= ~_XBF_DELWRI_Q; wake_up_var(bp->b_flags); } And we replace all the places where the buffer is taken off the delwri list with calls to xfs_buf_remove_delwri()... This will greatly reduce the number of context switches during a wait cycle, and reduce the latency of waiting for buffers that are queued for delwri... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2023-06-21 2:05 ` Dave Chinner @ 2023-07-05 23:37 ` Darrick J. Wong 0 siblings, 0 replies; 19+ messages in thread From: Darrick J. Wong @ 2023-07-05 23:37 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Jun 21, 2023 at 12:05:32PM +1000, Dave Chinner wrote: > On Thu, May 25, 2023 at 05:45:35PM -0700, Darrick J. Wong wrote: > > @@ -2112,6 +2112,37 @@ xfs_buf_delwri_queue( > > return true; > > } > > > > +/* > > + * Queue a buffer to this delwri list as part of a data integrity operation. > > + * If the buffer is on any other delwri list, we'll wait for that to clear > > + * so that the caller can submit the buffer for IO and wait for the result. > > + * Callers must ensure the buffer is not already on the list. > > + */ > > +void > > +xfs_buf_delwri_queue_here( > > This is more of an "exclusive" queuing semantic. i.e. queue this > buffer exclusively on the list provided, rather than just ensuring > it is queued on some delwri list.... > > > + struct xfs_buf *bp, > > + struct list_head *buffer_list) > > +{ > > + /* > > + * We need this buffer to end up on the /caller's/ delwri list, not any > > + * old list. This can happen if the buffer is marked stale (which > > + * clears DELWRI_Q) after the AIL queues the buffer to its list but > > + * before the AIL has a chance to submit the list. > > + */ > > + while (!list_empty(&bp->b_list)) { > > + xfs_buf_unlock(bp); > > + delay(1); > > + xfs_buf_lock(bp); > > + } > > Not a big fan of this as it the buffer can be on the AIL buffer list > for some time (e.g. AIL might have a hundred thousand buffers to > push). > > This seems more like a case for: > > while (!list_empty(&bp->b_list)) { > xfs_buf_unlock(bp); > wait_event_var(bp->b_flags, !(bp->b_flags & _XBF_DELWRI_Q)); > xfs_buf_lock(bp); > } > > And a wrapper: > > void xfs_buf_remove_delwri( > struct xfs_buf *bp) > { > list_del(&bp->b_list); > bp->b_flags &= ~_XBF_DELWRI_Q; > wake_up_var(bp->b_flags); > } > > And we replace all the places where the buffer is taken off the > delwri list with calls to xfs_buf_remove_delwri()... > > This will greatly reduce the number of context switches during a > wait cycle, and reduce the latency of waiting for buffers that are > queued for delwri... The thing is, we're really waiting for the buffer to clear /all/ delwri-related lists. This could be the actual buffer list, but it could also be the xfs_buf_delwri_submit's wait_list. It's not sufficient for repair to allow some other (probably the AIL) thread to write the new structure's buffer because that caller could see an EIO/ENOSPC error, and repair needs to return the specific condition to the caller. That said, I suppose we could spring a wait_var_event on bp->b_list.next to look for list_empty(&bp->b_list). I'll try that out tonight. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading @ 2022-12-30 22:12 Darrick J. Wong 2022-12-30 22:12 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, Before we start merging the online repair functions, let's improve the bulk loading code a bit. First, we need to fix a misinteraction between the AIL and the btree bulkloader wherein the delwri at the end of the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list. Second, we introduce EFIs in the btree bulkloader block allocator to to guarantee that staging blocks are freed if the filesystem goes down before committing the new btree. Third, we change the bulkloader itself to copy multiple records into a block if possible, and add some debugging knobs so that developers can control the slack factors, just like they can do for xfs_repair. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-prep-for-bulk-loading --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_btree.c | 2 fs/xfs/libxfs/xfs_btree.h | 3 fs/xfs/libxfs/xfs_btree_staging.c | 67 +++- fs/xfs/libxfs/xfs_btree_staging.h | 32 +- fs/xfs/scrub/agheader_repair.c | 1 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/newbt.c | 567 +++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.h | 68 ++++ fs/xfs/scrub/repair.c | 10 + fs/xfs/scrub/repair.h | 1 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 36 ++ fs/xfs/xfs_buf.c | 31 ++ fs/xfs/xfs_buf.h | 1 fs/xfs/xfs_globals.c | 12 + fs/xfs/xfs_sysctl.h | 2 fs/xfs/xfs_sysfs.c | 54 ++++ 18 files changed, 858 insertions(+), 33 deletions(-) create mode 100644 fs/xfs/scrub/newbt.c create mode 100644 fs/xfs/scrub/newbt.h ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/6] xfs: force all buffers to be written during btree bulk load 2022-12-30 22:12 [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong @ 2022-12-30 22:12 ` Darrick J. Wong 0 siblings, 0 replies; 19+ messages in thread From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffer readers encountering buffers with DELWRI_Q set, even though the btree bulk load had already committed and the buffer itself wasn't on any delwri list. I traced this to a misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. This clears DELWRI_Q from the buffer state. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree_staging.c | 4 +--- fs/xfs/xfs_buf.c | 31 +++++++++++++++++++++++++++++++ fs/xfs/xfs_buf.h | 1 + 3 files changed, 33 insertions(+), 3 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index dd75e208b543..29e3f8ccb185 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf( if (*bpp == NULL) return; - if (!xfs_buf_delwri_queue(*bpp, buffers_list)) - ASSERT(0); - + xfs_buf_delwri_queue_here(*bpp, buffers_list); xfs_buf_relse(*bpp); *bpp = NULL; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index a538501b652b..2bea2c3f9ead 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2113,6 +2113,37 @@ xfs_buf_delwri_queue( return true; } +/* + * Queue a buffer to this delwri list as part of a data integrity operation. + * If the buffer is on any other delwri list, we'll wait for that to clear + * so that the caller can submit the buffer for IO and wait for the result. + * Callers must ensure the buffer is not already on the list. + */ +void +xfs_buf_delwri_queue_here( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + /* + * We need this buffer to end up on the /caller's/ delwri list, not any + * old list. This can happen if the buffer is marked stale (which + * clears DELWRI_Q) after the AIL queues the buffer to its list but + * before the AIL has a chance to submit the list. + */ + while (!list_empty(&bp->b_list)) { + xfs_buf_unlock(bp); + delay(1); + xfs_buf_lock(bp); + } + + ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + + /* This buffer is uptodate; don't let it get reread. */ + bp->b_flags |= XBF_DONE; + + xfs_buf_delwri_queue(bp, buffer_list); +} + /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index d6e8c3bab9f6..467ddb2e2f0d 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -315,6 +315,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); +void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2023-12-13 22:51 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-12-07 2:38 [PATCHSET v28.1 0/6] xfs: prepare repair for bulk loading Darrick J. Wong 2023-12-07 2:38 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 2023-12-07 5:25 ` Christoph Hellwig 2023-12-07 2:38 ` [PATCH 2/6] xfs: set XBF_DONE on newly formatted btree block that are ready for writing Darrick J. Wong 2023-12-07 5:26 ` Christoph Hellwig 2023-12-11 18:55 ` Darrick J. Wong 2023-12-07 2:39 ` [PATCH 3/6] xfs: read leaf blocks when computing keys for bulkloading into node blocks Darrick J. Wong 2023-12-07 5:26 ` Christoph Hellwig 2023-12-07 2:39 ` [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong 2023-12-07 5:26 ` Christoph Hellwig 2023-12-07 2:39 ` [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong 2023-12-07 2:39 ` [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong 2023-12-07 5:27 ` Christoph Hellwig -- strict thread matches above, loose matches on Subject: below -- 2023-12-13 22:51 [PATCHSET v28.2 0/6] xfs: prepare repair for bulk loading Darrick J. Wong 2023-12-13 22:51 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 2023-07-27 22:18 [PATCHSET v26.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong 2023-07-27 22:24 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 2023-05-26 0:28 [PATCHSET v25.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong 2023-05-26 0:45 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 2023-06-21 2:05 ` Dave Chinner 2023-07-05 23:37 ` Darrick J. Wong 2022-12-30 22:12 [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong 2022-12-30 22:12 ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.