* [MEGAPATCHSET v28] xfs: online repair, second part of part 1 @ 2023-11-24 23:39 Darrick J. Wong 2023-11-24 23:44 ` [PATCHSET v28.0 0/1] xfs: prevent livelocks in xchk_iget Darrick J. Wong ` (7 more replies) 0 siblings, 8 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:39 UTC (permalink / raw) To: Dave Chinner, Chandan Babu R Cc: xfs, linux-fsdevel, Carlos Maiolino, Catherine Hoang Hi everyone, [this time not as a reply to v27] [[this time really not as a reply to v27]] [[[***** email, this way is insane]]] I've rebased the online fsck development branches atop 6.7, applied the changes requested during the review of v27, and reworked the automatic space reaping code to avoid open-coding EFI log item handling, and cleaned up a few other things. In other words, I'm formally submitting part 1 for inclusion in 6.8. Just like the last several submissions, I would like people to focus the following: - Are the major subsystems sufficiently documented that you could figure out what the code does? - Do you see any problems that are severe enough to cause long term support hassles? (e.g. bad API design, writing weird metadata to disk) - Can you spot mis-interactions between the subsystems? - What were my blind spots in devising this feature? - Are there missing pieces that you'd like to help build? - Can I just merge all of this? The one thing that is /not/ in scope for this review are requests for more refactoring of existing subsystems. I'm still running QA round the clock. To spare vger, I'm only sending a few patchsets this time. I will of course stress test the new mailing infrastructure on 31 Dec with a full posting, like I always do. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/1] xfs: prevent livelocks in xchk_iget 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong @ 2023-11-24 23:44 ` Darrick J. Wong 2023-11-24 23:46 ` [PATCH 1/1] xfs: make xchk_iget safer in the presence of corrupt inode btrees Darrick J. Wong 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong ` (6 subsequent siblings) 7 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:44 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs Hi all, Prevent scrub from live locking in xchk_iget if there's a cycle in the inobt by allocating an empty transaction. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-livelock-prevention --- fs/xfs/scrub/common.c | 6 ++++-- fs/xfs/scrub/common.h | 19 +++++++++++++++++++ fs/xfs/scrub/inode.c | 4 ++-- 3 files changed, 25 insertions(+), 4 deletions(-) ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/1] xfs: make xchk_iget safer in the presence of corrupt inode btrees 2023-11-24 23:44 ` [PATCHSET v28.0 0/1] xfs: prevent livelocks in xchk_iget Darrick J. Wong @ 2023-11-24 23:46 ` Darrick J. Wong 2023-11-25 4:57 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:46 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When scrub is trying to iget an inode, ensure that it won't end up deadlocked on a cycle in the inode btree by using an empty transaction to store all the buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/scrub/common.c | 6 ++++-- fs/xfs/scrub/common.h | 19 +++++++++++++++++++ fs/xfs/scrub/inode.c | 4 ++-- 3 files changed, 25 insertions(+), 4 deletions(-) diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index de24532fe0830..23944fcc1a6ca 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -733,6 +733,8 @@ xchk_iget( xfs_ino_t inum, struct xfs_inode **ipp) { + ASSERT(sc->tp != NULL); + return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp); } @@ -882,8 +884,8 @@ xchk_iget_for_scrubbing( if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino)) return -ENOENT; - /* Try a regular untrusted iget. */ - error = xchk_iget(sc, sc->sm->sm_ino, &ip); + /* Try a safe untrusted iget. */ + error = xchk_iget_safe(sc, sc->sm->sm_ino, &ip); if (!error) return xchk_install_handle_inode(sc, ip); if (error == -ENOENT) diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h index cabdc0e16838c..a39dbe6be1e59 100644 --- a/fs/xfs/scrub/common.h +++ b/fs/xfs/scrub/common.h @@ -157,6 +157,25 @@ int xchk_iget_agi(struct xfs_scrub *sc, xfs_ino_t inum, void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip); int xchk_install_handle_inode(struct xfs_scrub *sc, struct xfs_inode *ip); +/* + * Safe version of (untrusted) xchk_iget that uses an empty transaction to + * avoid deadlocking on loops in the inobt. + */ +static inline int +xchk_iget_safe(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp) +{ + int error; + + ASSERT(sc->tp == NULL); + + error = xchk_trans_alloc(sc, 0); + if (error) + return error; + error = xchk_iget(sc, inum, ipp); + xchk_trans_cancel(sc); + return error; +} + /* * Don't bother cross-referencing if we already found corruption or cross * referencing discrepancies. diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index 889f556bc98f6..b7a93380a1ab0 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -95,8 +95,8 @@ xchk_setup_inode( if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino)) return -ENOENT; - /* Try a regular untrusted iget. */ - error = xchk_iget(sc, sc->sm->sm_ino, &ip); + /* Try a safe untrusted iget. */ + error = xchk_iget_safe(sc, sc->sm->sm_ino, &ip); if (!error) return xchk_install_handle_iscrub(sc, ip); if (error == -ENOENT) ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/1] xfs: make xchk_iget safer in the presence of corrupt inode btrees 2023-11-24 23:46 ` [PATCH 1/1] xfs: make xchk_iget safer in the presence of corrupt inode btrees Darrick J. Wong @ 2023-11-25 4:57 ` Christoph Hellwig 2023-11-27 21:55 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 4:57 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs On Fri, Nov 24, 2023 at 03:46:54PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > When scrub is trying to iget an inode, ensure that it won't end up > deadlocked on a cycle in the inode btree by using an empty transaction > to store all the buffers. My only concern here is how I'm suppsed to know when to use the _safe version or not. Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/1] xfs: make xchk_iget safer in the presence of corrupt inode btrees 2023-11-25 4:57 ` Christoph Hellwig @ 2023-11-27 21:55 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-27 21:55 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Fri, Nov 24, 2023 at 08:57:57PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:46:54PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > When scrub is trying to iget an inode, ensure that it won't end up > > deadlocked on a cycle in the inode btree by using an empty transaction > > to store all the buffers. > > My only concern here is how I'm suppsed to know when to use the _safe > version or not. For xchk_iget_safe, I'll amend the comment to read: /* * Safe version of (untrusted) xchk_iget that uses an empty transaction to * avoid deadlocking on loops in the inobt. This should only be used in a * scrub or repair setup routine, and only prior to grabbing a transaction. */ and add a comment for xchk_iget that reads: /* * Grab the inode at @inum. The caller must have created a scrub transaction * so that we can confirm the inumber by walking the inobt and not deadlock on * a loop in the inobt. */ > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> Thanks! --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong 2023-11-24 23:44 ` [PATCHSET v28.0 0/1] xfs: prevent livelocks in xchk_iget Darrick J. Wong @ 2023-11-24 23:45 ` Darrick J. Wong 2023-11-24 23:47 ` [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects Darrick J. Wong ` (6 more replies) 2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong ` (5 subsequent siblings) 7 siblings, 7 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:45 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs Hi all, Online repair fixes metadata structures by writing a new copy out to disk and atomically committing the new structure into the filesystem. For this to work, we need to reserve all the space we're going to need ahead of time so that the atomic commit transaction is as small as possible. We also require the reserved space to be freed if the system goes down, or if we decide not to commit the repair, or if we reserve too much space. To keep the atomic commit transaction as small as possible, we would like to allocate some space and simultaneously schedule automatic reaping of the reserved space, even on log recovery. EFIs are the mechanism to get us there, but we need to use them in a novel manner. Once we allocate the space, we want to hold on to the EFI (relogging as necessary) until we can commit or cancel the repair. EFIs for written committed blocks need to go away, but unwritten or uncommitted blocks can be freed like normal. Earlier versions of this patchset directly manipulated the log items, but Dave thought that to be a layering violation. For v27, I've modified the defer ops handling code to be capable of pausing a deferred work item. Log intent items are created as they always have been, but paused items are pushed onto a side list when finishing deferred work items, and pushed back onto the transaction after that. Log intent done item are not created for paused work. The second part adds a "stale" flag to the EFI so that the repair reservation code can dispose of an EFI the normal way, but without the space actually being freed. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-auto-reap-space-reservations xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-auto-reap-space-reservations --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_ag.c | 2 fs/xfs/libxfs/xfs_alloc.c | 104 +++++++ fs/xfs/libxfs/xfs_alloc.h | 22 +- fs/xfs/libxfs/xfs_bmap.c | 4 fs/xfs/libxfs/xfs_bmap_btree.c | 2 fs/xfs/libxfs/xfs_btree_staging.h | 7 fs/xfs/libxfs/xfs_defer.c | 229 ++++++++++++++-- fs/xfs/libxfs/xfs_defer.h | 20 + fs/xfs/libxfs/xfs_ialloc.c | 5 fs/xfs/libxfs/xfs_ialloc_btree.c | 2 fs/xfs/libxfs/xfs_refcount.c | 6 fs/xfs/libxfs/xfs_refcount_btree.c | 2 fs/xfs/scrub/agheader_repair.c | 1 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/newbt.c | 510 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.h | 65 +++++ fs/xfs/scrub/reap.c | 7 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 37 +++ fs/xfs/xfs_extfree_item.c | 12 - fs/xfs/xfs_reflink.c | 2 fs/xfs/xfs_trace.h | 13 + 23 files changed, 990 insertions(+), 66 deletions(-) create mode 100644 fs/xfs/scrub/newbt.c create mode 100644 fs/xfs/scrub/newbt.h ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong @ 2023-11-24 23:47 ` Darrick J. Wong 2023-11-25 5:04 ` Christoph Hellwig 2023-11-24 23:47 ` [PATCH 2/7] xfs: allow pausing of pending deferred work items Darrick J. Wong ` (5 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:47 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When someone tries to add a deferred work item to xfs_defer_add, it will try to attach the work item to the most recently added xfs_defer_pending object attached to the transaction. However, it doesn't check if the pending object has a log intent item attached to it. This is incorrect behavior because we cannot add more work to an object that has already been committed to the ondisk log. Therefore, change the behavior not to append to pending items with a non null dfp_intent. In practice this has not been an issue because the only way xfs_defer_add gets called after log intent items have been committed is from the defer ops ->finish_item functions themselves, and the @dop_pending isolation in xfs_defer_finish_noroll protects the pending items that have already been logged. However, the next patch will add the ability to pause a deferred extent free object during online btree rebuilding, and any new extfree work items need to have their own pending event. While we're at it, hoist the predicate to its own static inline function for readability. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_defer.c | 48 ++++++++++++++++++++++++++++++++++----------- 1 file changed, 36 insertions(+), 12 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index f71679ce23b95..6c283b30ea054 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -624,6 +624,40 @@ xfs_defer_cancel( xfs_defer_cancel_list(mp, &tp->t_dfops); } +/* + * Decide if we can add a deferred work item to the last dfops item attached + * to the transaction. + */ +static inline struct xfs_defer_pending * +xfs_defer_try_append( + struct xfs_trans *tp, + enum xfs_defer_ops_type type, + const struct xfs_defer_op_type *ops) +{ + struct xfs_defer_pending *dfp = NULL; + + /* No dfops at all? */ + if (list_empty(&tp->t_dfops)) + return NULL; + + dfp = list_last_entry(&tp->t_dfops, struct xfs_defer_pending, + dfp_list); + + /* Wrong type? */ + if (dfp->dfp_type != type) + return NULL; + + /* Already logged? */ + if (dfp->dfp_intent) + return NULL; + + /* Already full? */ + if (ops->max_items && dfp->dfp_count >= ops->max_items) + return NULL; + + return dfp; +} + /* Add an item for later deferred processing. */ void xfs_defer_add( @@ -637,19 +671,9 @@ xfs_defer_add( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX); - /* - * Add the item to a pending item at the end of the intake list. - * If the last pending item has the same type, reuse it. Else, - * create a new pending item at the end of the intake list. - */ - if (!list_empty(&tp->t_dfops)) { - dfp = list_last_entry(&tp->t_dfops, - struct xfs_defer_pending, dfp_list); - if (dfp->dfp_type != type || - (ops->max_items && dfp->dfp_count >= ops->max_items)) - dfp = NULL; - } + dfp = xfs_defer_try_append(tp, type, ops); if (!dfp) { + /* Create a new pending item at the end of the intake list. */ dfp = kmem_cache_zalloc(xfs_defer_pending_cache, GFP_NOFS | __GFP_NOFAIL); dfp->dfp_type = type; ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects 2023-11-24 23:47 ` [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects Darrick J. Wong @ 2023-11-25 5:04 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:04 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs This looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> To make the code nicer for the later addition of the barrier defer ops I'd fold the hunk belw to split xfs_defer_try_append, but we could also do that later: index 6c283b30ea054a..7be2f9063e0ded 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -624,17 +624,12 @@ xfs_defer_cancel( xfs_defer_cancel_list(mp, &tp->t_dfops); } -/* - * Decide if we can add a deferred work item to the last dfops item attached - * to the transaction. - */ static inline struct xfs_defer_pending * -xfs_defer_try_append( +xfs_defer_find( struct xfs_trans *tp, - enum xfs_defer_ops_type type, - const struct xfs_defer_op_type *ops) + enum xfs_defer_ops_type type) { - struct xfs_defer_pending *dfp = NULL; + struct xfs_defer_pending *dfp; /* No dfops at all? */ if (list_empty(&tp->t_dfops)) @@ -646,16 +641,25 @@ xfs_defer_try_append( /* Wrong type? */ if (dfp->dfp_type != type) return NULL; + return dfp; +} +/* + * Decide if we can add a deferred work item to the last dfops item attached + * to the transaction. + */ +static inline bool +xfs_defer_can_append( + struct xfs_defer_pending *dfp, + const struct xfs_defer_op_type *ops) +{ /* Already logged? */ if (dfp->dfp_intent) - return NULL; - + return false; /* Already full? */ if (ops->max_items && dfp->dfp_count >= ops->max_items) return NULL; - - return dfp; + return true; } /* Add an item for later deferred processing. */ @@ -671,8 +675,8 @@ xfs_defer_add( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX); - dfp = xfs_defer_try_append(tp, type, ops); - if (!dfp) { + dfp = xfs_defer_find(tp, type); + if (!dfp || !xfs_defer_can_append(dfp, ops)) { /* Create a new pending item at the end of the intake list. */ dfp = kmem_cache_zalloc(xfs_defer_pending_cache, GFP_NOFS | __GFP_NOFAIL); ^ permalink raw reply related [flat|nested] 156+ messages in thread
* [PATCH 2/7] xfs: allow pausing of pending deferred work items 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong 2023-11-24 23:47 ` [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects Darrick J. Wong @ 2023-11-24 23:47 ` Darrick J. Wong 2023-11-25 5:05 ` Christoph Hellwig 2023-11-24 23:47 ` [PATCH 3/7] xfs: remove __xfs_free_extent_later Darrick J. Wong ` (4 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:47 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Traditionally, all pending deferred work attached to a transaction is finished when one of the xfs_defer_finish* functions is called. However, online repair wants to be able to allocate space for a new data structure, format a new metadata structure into the allocated space, and commit that into the filesystem. As a hedge against system crashes during repairs, we also want to log some EFI items for the allocated space speculatively, and cancel them if we elect to commit the new data structure. Therefore, introduce the idea of pausing a pending deferred work item. Log intent items are still created for paused items and relogged as necessary. However, paused items are pushed onto a side list before we start calling ->finish_item, and the whole list is reattach to the transaction afterwards. New work items are never attached to paused pending items. Modify xfs_defer_cancel to clean up pending deferred work items holding a log intent item but not a log intent done item, since that is now possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_defer.c | 98 +++++++++++++++++++++++++++++++++++++++------ fs/xfs/libxfs/xfs_defer.h | 17 +++++++- fs/xfs/xfs_trace.h | 13 +++++- 3 files changed, 112 insertions(+), 16 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index 6c283b30ea054..6604eb50058ba 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -417,7 +417,7 @@ xfs_defer_cancel_list( * done item to release the intent item; and then log a new intent item. * The caller should provide a fresh transaction and roll it after we're done. */ -static int +static void xfs_defer_relog( struct xfs_trans **tpp, struct list_head *dfops) @@ -458,10 +458,6 @@ xfs_defer_relog( XFS_STATS_INC((*tpp)->t_mountp, defer_relog); dfp->dfp_intent = xfs_trans_item_relog(dfp->dfp_intent, *tpp); } - - if ((*tpp)->t_flags & XFS_TRANS_DIRTY) - return xfs_defer_trans_roll(tpp); - return 0; } /* @@ -517,6 +513,24 @@ xfs_defer_finish_one( return error; } +/* Move all paused deferred work from @tp to @paused_list. */ +static void +xfs_defer_isolate_paused( + struct xfs_trans *tp, + struct list_head *paused_list) +{ + struct xfs_defer_pending *dfp; + struct xfs_defer_pending *pli; + + list_for_each_entry_safe(dfp, pli, &tp->t_dfops, dfp_list) { + if (!(dfp->dfp_flags & XFS_DEFER_PAUSED)) + continue; + + list_move_tail(&dfp->dfp_list, paused_list); + trace_xfs_defer_isolate_paused(tp->t_mountp, dfp); + } +} + /* * Finish all the pending work. This involves logging intent items for * any work items that wandered in since the last transaction roll (if @@ -532,6 +546,7 @@ xfs_defer_finish_noroll( struct xfs_defer_pending *dfp = NULL; int error = 0; LIST_HEAD(dop_pending); + LIST_HEAD(dop_paused); ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES); @@ -550,6 +565,8 @@ xfs_defer_finish_noroll( */ int has_intents = xfs_defer_create_intents(*tp); + xfs_defer_isolate_paused(*tp, &dop_paused); + list_splice_init(&(*tp)->t_dfops, &dop_pending); if (has_intents < 0) { @@ -562,22 +579,33 @@ xfs_defer_finish_noroll( goto out_shutdown; /* Relog intent items to keep the log moving. */ - error = xfs_defer_relog(tp, &dop_pending); - if (error) - goto out_shutdown; + xfs_defer_relog(tp, &dop_pending); + xfs_defer_relog(tp, &dop_paused); + + if ((*tp)->t_flags & XFS_TRANS_DIRTY) { + error = xfs_defer_trans_roll(tp); + if (error) + goto out_shutdown; + } } - dfp = list_first_entry(&dop_pending, struct xfs_defer_pending, - dfp_list); + dfp = list_first_entry_or_null(&dop_pending, + struct xfs_defer_pending, dfp_list); + if (!dfp) + break; error = xfs_defer_finish_one(*tp, dfp); if (error && error != -EAGAIN) goto out_shutdown; } + /* Requeue the paused items in the outgoing transaction. */ + list_splice_tail_init(&dop_paused, &(*tp)->t_dfops); + trace_xfs_defer_finish_done(*tp, _RET_IP_); return 0; out_shutdown: + list_splice_tail_init(&dop_paused, &dop_pending); xfs_defer_trans_abort(*tp, &dop_pending); xfs_force_shutdown((*tp)->t_mountp, SHUTDOWN_CORRUPT_INCORE); trace_xfs_defer_finish_error(*tp, error); @@ -590,6 +618,9 @@ int xfs_defer_finish( struct xfs_trans **tp) { +#ifdef DEBUG + struct xfs_defer_pending *dfp; +#endif int error; /* @@ -609,7 +640,10 @@ xfs_defer_finish( } /* Reset LOWMODE now that we've finished all the dfops. */ - ASSERT(list_empty(&(*tp)->t_dfops)); +#ifdef DEBUG + list_for_each_entry(dfp, &(*tp)->t_dfops, dfp_list) + ASSERT(dfp->dfp_flags & XFS_DEFER_PAUSED); +#endif (*tp)->t_flags &= ~XFS_TRANS_LOWMODE; return 0; } @@ -621,6 +655,7 @@ xfs_defer_cancel( struct xfs_mount *mp = tp->t_mountp; trace_xfs_defer_cancel(tp, _RET_IP_); + xfs_defer_trans_abort(tp, &tp->t_dfops); xfs_defer_cancel_list(mp, &tp->t_dfops); } @@ -651,6 +686,10 @@ xfs_defer_try_append( if (dfp->dfp_intent) return NULL; + /* Paused items cannot absorb more work */ + if (dfp->dfp_flags & XFS_DEFER_PAUSED) + return NULL; + /* Already full? */ if (ops->max_items && dfp->dfp_count >= ops->max_items) return NULL; @@ -659,7 +698,7 @@ xfs_defer_try_append( } /* Add an item for later deferred processing. */ -void +struct xfs_defer_pending * xfs_defer_add( struct xfs_trans *tp, enum xfs_defer_ops_type type, @@ -687,6 +726,8 @@ xfs_defer_add( list_add_tail(li, &dfp->dfp_work); trace_xfs_defer_add_item(tp->t_mountp, dfp, li); dfp->dfp_count++; + + return dfp; } /* @@ -962,3 +1003,36 @@ xfs_defer_destroy_item_caches(void) xfs_rmap_intent_destroy_cache(); xfs_defer_destroy_cache(); } + +/* + * Mark a deferred work item so that it will be requeued indefinitely without + * being finished. Caller must ensure there are no data dependencies on this + * work item in the meantime. + */ +void +xfs_defer_item_pause( + struct xfs_trans *tp, + struct xfs_defer_pending *dfp) +{ + ASSERT(!(dfp->dfp_flags & XFS_DEFER_PAUSED)); + + dfp->dfp_flags |= XFS_DEFER_PAUSED; + + trace_xfs_defer_item_pause(tp->t_mountp, dfp); +} + +/* + * Release a paused deferred work item so that it will be finished during the + * next transaction roll. + */ +void +xfs_defer_item_unpause( + struct xfs_trans *tp, + struct xfs_defer_pending *dfp) +{ + ASSERT(dfp->dfp_flags & XFS_DEFER_PAUSED); + + dfp->dfp_flags &= ~XFS_DEFER_PAUSED; + + trace_xfs_defer_item_unpause(tp->t_mountp, dfp); +} diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 8788ad5f6a731..094ff9062b251 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -34,11 +34,24 @@ struct xfs_defer_pending { struct xfs_log_item *dfp_intent; /* log intent item */ struct xfs_log_item *dfp_done; /* log done item */ unsigned int dfp_count; /* # extent items */ + unsigned int dfp_flags; enum xfs_defer_ops_type dfp_type; }; -void xfs_defer_add(struct xfs_trans *tp, enum xfs_defer_ops_type type, - struct list_head *h); +/* + * Create a log intent item for this deferred item, but don't actually finish + * the work. Caller must clear this before the final transaction commit. + */ +#define XFS_DEFER_PAUSED (1U << 0) + +#define XFS_DEFER_PENDING_STRINGS \ + { XFS_DEFER_PAUSED, "paused" } + +void xfs_defer_item_pause(struct xfs_trans *tp, struct xfs_defer_pending *dfp); +void xfs_defer_item_unpause(struct xfs_trans *tp, struct xfs_defer_pending *dfp); + +struct xfs_defer_pending *xfs_defer_add(struct xfs_trans *tp, + enum xfs_defer_ops_type type, struct list_head *h); int xfs_defer_finish_noroll(struct xfs_trans **tp); int xfs_defer_finish(struct xfs_trans **tp); void xfs_defer_cancel(struct xfs_trans *); diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 3926cf7f2a6ed..514095b6ba2bd 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -2551,6 +2551,7 @@ DECLARE_EVENT_CLASS(xfs_defer_pending_class, __field(dev_t, dev) __field(int, type) __field(void *, intent) + __field(unsigned int, flags) __field(char, committed) __field(int, nr) ), @@ -2558,13 +2559,15 @@ DECLARE_EVENT_CLASS(xfs_defer_pending_class, __entry->dev = mp ? mp->m_super->s_dev : 0; __entry->type = dfp->dfp_type; __entry->intent = dfp->dfp_intent; + __entry->flags = dfp->dfp_flags; __entry->committed = dfp->dfp_done != NULL; __entry->nr = dfp->dfp_count; ), - TP_printk("dev %d:%d optype %d intent %p committed %d nr %d", + TP_printk("dev %d:%d optype %d intent %p flags %s committed %d nr %d", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->type, __entry->intent, + __print_flags(__entry->flags, "|", XFS_DEFER_PENDING_STRINGS), __entry->committed, __entry->nr) ) @@ -2675,6 +2678,9 @@ DEFINE_DEFER_PENDING_EVENT(xfs_defer_cancel_list); DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish); DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort); DEFINE_DEFER_PENDING_EVENT(xfs_defer_relog_intent); +DEFINE_DEFER_PENDING_EVENT(xfs_defer_isolate_paused); +DEFINE_DEFER_PENDING_EVENT(xfs_defer_item_pause); +DEFINE_DEFER_PENDING_EVENT(xfs_defer_item_unpause); #define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer); @@ -2692,6 +2698,7 @@ DECLARE_EVENT_CLASS(xfs_defer_pending_item_class, __field(void *, intent) __field(void *, item) __field(char, committed) + __field(unsigned int, flags) __field(int, nr) ), TP_fast_assign( @@ -2700,13 +2707,15 @@ DECLARE_EVENT_CLASS(xfs_defer_pending_item_class, __entry->intent = dfp->dfp_intent; __entry->item = item; __entry->committed = dfp->dfp_done != NULL; + __entry->flags = dfp->dfp_flags; __entry->nr = dfp->dfp_count; ), - TP_printk("dev %d:%d optype %d intent %p item %p committed %d nr %d", + TP_printk("dev %d:%d optype %d intent %p item %p flags %s committed %d nr %d", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->type, __entry->intent, __entry->item, + __print_flags(__entry->flags, "|", XFS_DEFER_PENDING_STRINGS), __entry->committed, __entry->nr) ) ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 2/7] xfs: allow pausing of pending deferred work items 2023-11-24 23:47 ` [PATCH 2/7] xfs: allow pausing of pending deferred work items Darrick J. Wong @ 2023-11-25 5:05 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 3/7] xfs: remove __xfs_free_extent_later 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong 2023-11-24 23:47 ` [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects Darrick J. Wong 2023-11-24 23:47 ` [PATCH 2/7] xfs: allow pausing of pending deferred work items Darrick J. Wong @ 2023-11-24 23:47 ` Darrick J. Wong 2023-11-25 5:05 ` Christoph Hellwig 2023-11-24 23:47 ` [PATCH 4/7] xfs: automatic freeing of freshly allocated unwritten space Darrick J. Wong ` (3 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:47 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> xfs_free_extent_later is a trivial helper, so remove it to reduce the amount of thinking required to understand the deferred freeing interface. This will make it easier to introduce automatic reaping of speculative allocations in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_ag.c | 2 +- fs/xfs/libxfs/xfs_alloc.c | 2 +- fs/xfs/libxfs/xfs_alloc.h | 14 +------------- fs/xfs/libxfs/xfs_bmap.c | 4 ++-- fs/xfs/libxfs/xfs_bmap_btree.c | 2 +- fs/xfs/libxfs/xfs_ialloc.c | 5 +++-- fs/xfs/libxfs/xfs_ialloc_btree.c | 2 +- fs/xfs/libxfs/xfs_refcount.c | 6 +++--- fs/xfs/libxfs/xfs_refcount_btree.c | 2 +- fs/xfs/scrub/reap.c | 2 +- fs/xfs/xfs_extfree_item.c | 2 +- fs/xfs/xfs_reflink.c | 2 +- 12 files changed, 17 insertions(+), 28 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index f9f4d694640d0..f62ff125a50ac 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -984,7 +984,7 @@ xfs_ag_shrink_space( if (err2 != -ENOSPC) goto resv_err; - err2 = __xfs_free_extent_later(*tpp, args.fsbno, delta, NULL, + err2 = xfs_free_extent_later(*tpp, args.fsbno, delta, NULL, XFS_AG_RESV_NONE, true); if (err2) goto resv_err; diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 100ab5931b313..c35224ad9428a 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2523,7 +2523,7 @@ xfs_defer_agfl_block( * The list is maintained sorted (by block number). */ int -__xfs_free_extent_later( +xfs_free_extent_later( struct xfs_trans *tp, xfs_fsblock_t bno, xfs_filblks_t len, diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 6bb8d295c321d..6b95d1d8a8537 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -231,7 +231,7 @@ xfs_buf_to_agfl_bno( return bp->b_addr; } -int __xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno, +int xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno, xfs_filblks_t len, const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type, bool skip_discard); @@ -256,18 +256,6 @@ void xfs_extent_free_get_group(struct xfs_mount *mp, #define XFS_EFI_ATTR_FORK (1U << 1) /* freeing attr fork block */ #define XFS_EFI_BMBT_BLOCK (1U << 2) /* freeing bmap btree block */ -static inline int -xfs_free_extent_later( - struct xfs_trans *tp, - xfs_fsblock_t bno, - xfs_filblks_t len, - const struct xfs_owner_info *oinfo, - enum xfs_ag_resv_type type) -{ - return __xfs_free_extent_later(tp, bno, len, oinfo, type, false); -} - - extern struct kmem_cache *xfs_extfree_item_cache; int __init xfs_extfree_intent_init_cache(void); diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index be62acffad6cc..68be1dd4f0f26 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -575,7 +575,7 @@ xfs_bmap_btree_to_extents( xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork); error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo, - XFS_AG_RESV_NONE); + XFS_AG_RESV_NONE, false); if (error) return error; @@ -5218,7 +5218,7 @@ xfs_bmap_del_extent_real( if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) { xfs_refcount_decrease_extent(tp, del); } else { - error = __xfs_free_extent_later(tp, del->br_startblock, + error = xfs_free_extent_later(tp, del->br_startblock, del->br_blockcount, NULL, XFS_AG_RESV_NONE, ((bflags & XFS_BMAPI_NODISCARD) || diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c index bf3f1b36fdd23..8360256cff168 100644 --- a/fs/xfs/libxfs/xfs_bmap_btree.c +++ b/fs/xfs/libxfs/xfs_bmap_btree.c @@ -272,7 +272,7 @@ xfs_bmbt_free_block( xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_ino.whichfork); error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo, - XFS_AG_RESV_NONE); + XFS_AG_RESV_NONE, false); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index b83e54c709069..d61d03e5b853b 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -1854,7 +1854,7 @@ xfs_difree_inode_chunk( return xfs_free_extent_later(tp, XFS_AGB_TO_FSB(mp, agno, sagbno), M_IGEO(mp)->ialloc_blks, &XFS_RMAP_OINFO_INODES, - XFS_AG_RESV_NONE); + XFS_AG_RESV_NONE, false); } /* holemask is only 16-bits (fits in an unsigned long) */ @@ -1900,7 +1900,8 @@ xfs_difree_inode_chunk( ASSERT(contigblk % mp->m_sb.sb_spino_align == 0); error = xfs_free_extent_later(tp, XFS_AGB_TO_FSB(mp, agno, agbno), contigblk, - &XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE); + &XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE, + false); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c index 9258f01c0015e..42a5e1f227a05 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.c +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c @@ -161,7 +161,7 @@ __xfs_inobt_free_block( xfs_inobt_mod_blockcount(cur, -1); fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)); return xfs_free_extent_later(cur->bc_tp, fsbno, 1, - &XFS_RMAP_OINFO_INOBT, resv); + &XFS_RMAP_OINFO_INOBT, resv, false); } STATIC int diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index 646b3fa362ad0..3702b4a071100 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -1153,7 +1153,7 @@ xfs_refcount_adjust_extents( tmp.rc_startblock); error = xfs_free_extent_later(cur->bc_tp, fsbno, tmp.rc_blockcount, NULL, - XFS_AG_RESV_NONE); + XFS_AG_RESV_NONE, false); if (error) goto out_error; } @@ -1215,7 +1215,7 @@ xfs_refcount_adjust_extents( ext.rc_startblock); error = xfs_free_extent_later(cur->bc_tp, fsbno, ext.rc_blockcount, NULL, - XFS_AG_RESV_NONE); + XFS_AG_RESV_NONE, false); if (error) goto out_error; } @@ -1985,7 +1985,7 @@ xfs_refcount_recover_cow_leftovers( /* Free the block. */ error = xfs_free_extent_later(tp, fsb, rr->rr_rrec.rc_blockcount, NULL, - XFS_AG_RESV_NONE); + XFS_AG_RESV_NONE, false); if (error) goto out_trans; diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index 5c3987d8dc242..3fa795e2488dd 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -112,7 +112,7 @@ xfs_refcountbt_free_block( be32_add_cpu(&agf->agf_refcount_blocks, -1); xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS); return xfs_free_extent_later(cur->bc_tp, fsbno, 1, - &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA, false); } STATIC int diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c index 86a62420e02c6..78c9f2085db46 100644 --- a/fs/xfs/scrub/reap.c +++ b/fs/xfs/scrub/reap.c @@ -410,7 +410,7 @@ xreap_agextent_iter( * Use deferred frees to get rid of the old btree blocks to try to * minimize the window in which we could crash and lose the old blocks. */ - error = __xfs_free_extent_later(sc->tp, fsbno, *aglenp, rs->oinfo, + error = xfs_free_extent_later(sc->tp, fsbno, *aglenp, rs->oinfo, rs->resv, true); if (error) return error; diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 3fa8789820ad9..9e7b58f3566c0 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -717,7 +717,7 @@ xfs_efi_item_recover( error = xfs_free_extent_later(tp, fake.xefi_startblock, fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER, - fake.xefi_agresv); + fake.xefi_agresv, false); if (!error) { requeue_only = true; continue; diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index e5b62dc284664..d5ca8bcae65b6 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -618,7 +618,7 @@ xfs_reflink_cancel_cow_blocks( error = xfs_free_extent_later(*tpp, del.br_startblock, del.br_blockcount, NULL, - XFS_AG_RESV_NONE); + XFS_AG_RESV_NONE, false); if (error) break; ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 3/7] xfs: remove __xfs_free_extent_later 2023-11-24 23:47 ` [PATCH 3/7] xfs: remove __xfs_free_extent_later Darrick J. Wong @ 2023-11-25 5:05 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 4/7] xfs: automatic freeing of freshly allocated unwritten space 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:47 ` [PATCH 3/7] xfs: remove __xfs_free_extent_later Darrick J. Wong @ 2023-11-24 23:47 ` Darrick J. Wong 2023-11-25 5:06 ` Christoph Hellwig 2023-11-24 23:48 ` [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong ` (2 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:47 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> As mentioned in the previous commit, online repair wants to allocate space to write out a new metadata structure, and it also wants to hedge against system crashes during repairs by logging (and later cancelling) EFIs to free the space if we crash before committing the new data structure. Therefore, create a trio of functions to schedule automatic reaping of freshly allocated unwritten space. xfs_alloc_schedule_autoreap creates a paused EFI representing the space we just allocated. Once the allocations are made and the autoreaps scheduled, we can start writing to disk. If the writes succeed, xfs_alloc_cancel_autoreap marks the EFI work items as stale and unpauses the pending deferred work item. Assuming that's done in the same transaction that commits the new structure into the filesystem, we guarantee that either the new object is fully visible, or that all the space gets reclaimed. If the writes succeed but only part of an extent was used, repair must call the same _cancel_autoreap function to kill the first EFI and then log a new EFI to free the unused space. The first EFI is already committed, so it cannot be changed. For full extents that aren't used, xfs_alloc_commit_autoreap will unpause the EFI, which results in the space being freed during the next _defer_finish cycle. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_alloc.c | 104 +++++++++++++++++++++++++++++++++++++++++++-- fs/xfs/libxfs/xfs_alloc.h | 12 +++++ fs/xfs/xfs_extfree_item.c | 10 +++- 3 files changed, 118 insertions(+), 8 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index c35224ad9428a..4940f9377f21a 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2522,14 +2522,15 @@ xfs_defer_agfl_block( * Add the extent to the list of extents to be free at transaction end. * The list is maintained sorted (by block number). */ -int -xfs_free_extent_later( +static int +xfs_defer_extent_free( struct xfs_trans *tp, xfs_fsblock_t bno, xfs_filblks_t len, const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type, - bool skip_discard) + bool skip_discard, + struct xfs_defer_pending **dfpp) { struct xfs_extent_free_item *xefi; struct xfs_mount *mp = tp->t_mountp; @@ -2577,10 +2578,105 @@ xfs_free_extent_later( XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len); xfs_extent_free_get_group(mp, xefi); - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list); + *dfpp = xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list); return 0; } +int +xfs_free_extent_later( + struct xfs_trans *tp, + xfs_fsblock_t bno, + xfs_filblks_t len, + const struct xfs_owner_info *oinfo, + enum xfs_ag_resv_type type, + bool skip_discard) +{ + struct xfs_defer_pending *dontcare = NULL; + + return xfs_defer_extent_free(tp, bno, len, oinfo, type, skip_discard, + &dontcare); +} + +/* + * Set up automatic freeing of unwritten space in the filesystem. + * + * This function attached a paused deferred extent free item to the + * transaction. Pausing means that the EFI will be logged in the next + * transaction commit, but the pending EFI will not be finished until the + * pending item is unpaused. + * + * If the system goes down after the EFI has been persisted to the log but + * before the pending item is unpaused, log recovery will find the EFI, fail to + * find the EFD, and free the space. + * + * If the pending item is unpaused, the next transaction commit will log an EFD + * without freeing the space. + * + * Caller must ensure that the tp, fsbno, len, oinfo, and resv flags of the + * @args structure are set to the relevant values. + */ +int +xfs_alloc_schedule_autoreap( + const struct xfs_alloc_arg *args, + bool skip_discard, + struct xfs_alloc_autoreap *aarp) +{ + int error; + + error = xfs_defer_extent_free(args->tp, args->fsbno, args->len, + &args->oinfo, args->resv, skip_discard, &aarp->dfp); + if (error) + return error; + + xfs_defer_item_pause(args->tp, aarp->dfp); + return 0; +} + +/* + * Cancel automatic freeing of unwritten space in the filesystem. + * + * Earlier, we created a paused deferred extent free item and attached it to + * this transaction so that we could automatically roll back a new space + * allocation if the system went down. Now we want to cancel the paused work + * item by marking the EFI stale so we don't actually free the space, unpausing + * the pending item and logging an EFD. + * + * The caller generally should have already mapped the space into the ondisk + * filesystem. If the reserved space was partially used, the caller must call + * xfs_free_extent_later to create a new EFI to free the unused space. + */ +void +xfs_alloc_cancel_autoreap( + struct xfs_trans *tp, + struct xfs_alloc_autoreap *aarp) +{ + struct xfs_defer_pending *dfp = aarp->dfp; + struct xfs_extent_free_item *xefi; + + if (!dfp) + return; + + list_for_each_entry(xefi, &dfp->dfp_work, xefi_list) + xefi->xefi_flags |= XFS_EFI_CANCELLED; + + xfs_defer_item_unpause(tp, dfp); +} + +/* + * Commit automatic freeing of unwritten space in the filesystem. + * + * This unpauses an earlier _schedule_autoreap and commits to freeing the + * allocated space. Call this if none of the reserved space was used. + */ +void +xfs_alloc_commit_autoreap( + struct xfs_trans *tp, + struct xfs_alloc_autoreap *aarp) +{ + if (aarp->dfp) + xfs_defer_item_unpause(tp, aarp->dfp); +} + #ifdef DEBUG /* * Check if an AGF has a free extent record whose length is equal to diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 6b95d1d8a8537..851cafbd64494 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -255,6 +255,18 @@ void xfs_extent_free_get_group(struct xfs_mount *mp, #define XFS_EFI_SKIP_DISCARD (1U << 0) /* don't issue discard */ #define XFS_EFI_ATTR_FORK (1U << 1) /* freeing attr fork block */ #define XFS_EFI_BMBT_BLOCK (1U << 2) /* freeing bmap btree block */ +#define XFS_EFI_CANCELLED (1U << 3) /* dont actually free the space */ + +struct xfs_alloc_autoreap { + struct xfs_defer_pending *dfp; +}; + +int xfs_alloc_schedule_autoreap(const struct xfs_alloc_arg *args, + bool skip_discard, struct xfs_alloc_autoreap *aarp); +void xfs_alloc_cancel_autoreap(struct xfs_trans *tp, + struct xfs_alloc_autoreap *aarp); +void xfs_alloc_commit_autoreap(struct xfs_trans *tp, + struct xfs_alloc_autoreap *aarp); extern struct kmem_cache *xfs_extfree_item_cache; diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 9e7b58f3566c0..4af5b3b338b19 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -381,7 +381,7 @@ xfs_trans_free_extent( uint next_extent; xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(mp, xefi->xefi_startblock); - int error; + int error = 0; oinfo.oi_owner = xefi->xefi_owner; if (xefi->xefi_flags & XFS_EFI_ATTR_FORK) @@ -392,9 +392,11 @@ xfs_trans_free_extent( trace_xfs_bmap_free_deferred(tp->t_mountp, xefi->xefi_pag->pag_agno, 0, agbno, xefi->xefi_blockcount); - error = __xfs_free_extent(tp, xefi->xefi_pag, agbno, - xefi->xefi_blockcount, &oinfo, xefi->xefi_agresv, - xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); + if (!(xefi->xefi_flags & XFS_EFI_CANCELLED)) + error = __xfs_free_extent(tp, xefi->xefi_pag, agbno, + xefi->xefi_blockcount, &oinfo, + xefi->xefi_agresv, + xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); /* * Mark the transaction dirty, even on error. This ensures the ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 4/7] xfs: automatic freeing of freshly allocated unwritten space 2023-11-24 23:47 ` [PATCH 4/7] xfs: automatic freeing of freshly allocated unwritten space Darrick J. Wong @ 2023-11-25 5:06 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:06 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong ` (3 preceding siblings ...) 2023-11-24 23:47 ` [PATCH 4/7] xfs: automatic freeing of freshly allocated unwritten space Darrick J. Wong @ 2023-11-24 23:48 ` Darrick J. Wong 2023-11-26 13:14 ` Christoph Hellwig 2023-11-24 23:48 ` [PATCH 6/7] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong 2023-11-24 23:48 ` [PATCH 7/7] xfs: force small EFIs for reaping btree extents Darrick J. Wong 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:48 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a new xrep_newbt structure to encapsulate a fake root for creating a staged btree cursor as well as to track all the blocks that we need to reserve in order to build that btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_btree_staging.h | 7 - fs/xfs/scrub/agheader_repair.c | 1 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/newbt.c | 492 +++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.h | 62 +++++ fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 37 +++ 8 files changed, 598 insertions(+), 5 deletions(-) create mode 100644 fs/xfs/scrub/newbt.c create mode 100644 fs/xfs/scrub/newbt.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 7762c01a85cfb..1537d66e5ab01 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -181,6 +181,7 @@ xfs-$(CONFIG_XFS_QUOTA) += scrub/quota.o ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) xfs-y += $(addprefix scrub/, \ agheader_repair.o \ + newbt.o \ reap.o \ repair.o \ ) diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h index f0d2976050aea..d6dea3f0088c6 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.h +++ b/fs/xfs/libxfs/xfs_btree_staging.h @@ -38,11 +38,8 @@ struct xbtree_ifakeroot { /* Number of bytes available for this fork in the inode. */ unsigned int if_fork_size; - /* Fork format. */ - unsigned int if_format; - - /* Number of records. */ - unsigned int if_extents; + /* Which fork is this btree being built for? */ + int if_whichfork; }; /* Cursor interactions with fake roots for inode-rooted btrees. */ diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c index 876a2f41b0637..36c511f96b004 100644 --- a/fs/xfs/scrub/agheader_repair.c +++ b/fs/xfs/scrub/agheader_repair.c @@ -10,6 +10,7 @@ #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_btree.h" +#include "xfs_btree_staging.h" #include "xfs_log_format.h" #include "xfs_trans.h" #include "xfs_sb.h" diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index 23944fcc1a6ca..4bba3c49f8c59 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -10,6 +10,7 @@ #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_btree.h" +#include "xfs_btree_staging.h" #include "xfs_log_format.h" #include "xfs_trans.h" #include "xfs_inode.h" diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c new file mode 100644 index 0000000000000..4e8d6637426e4 --- /dev/null +++ b/fs/xfs/scrub/newbt.c @@ -0,0 +1,492 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_alloc.h" +#include "xfs_rmap.h" +#include "xfs_ag.h" +#include "xfs_defer.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/newbt.h" + +/* + * Estimate proper slack values for a btree that's being reloaded. + * + * Under most circumstances, we'll take whatever default loading value the + * btree bulk loading code calculates for us. However, there are some + * exceptions to this rule: + * + * (1) If someone turned one of the debug knobs. + * (2) If this is a per-AG btree and the AG has less than ~9% space free. + * (3) If this is an inode btree and the FS has less than ~9% space free. + * + * Note that we actually use 3/32 for the comparison to avoid division. + */ +static void +xrep_newbt_estimate_slack( + struct xrep_newbt *xnr) +{ + struct xfs_scrub *sc = xnr->sc; + struct xfs_btree_bload *bload = &xnr->bload; + uint64_t free; + uint64_t sz; + + /* Let the btree code compute the default slack values. */ + bload->leaf_slack = -1; + bload->node_slack = -1; + + if (sc->ops->type == ST_PERAG) { + free = sc->sa.pag->pagf_freeblks; + sz = xfs_ag_block_count(sc->mp, sc->sa.pag->pag_agno); + } else { + free = percpu_counter_sum(&sc->mp->m_fdblocks); + sz = sc->mp->m_sb.sb_dblocks; + } + + /* No further changes if there's more than 3/32ths space left. */ + if (free >= ((sz * 3) >> 5)) + return; + + /* We're low on space; load the btrees as tightly as possible. */ + if (bload->leaf_slack < 0) + bload->leaf_slack = 0; + if (bload->node_slack < 0) + bload->node_slack = 0; +} + +/* Initialize accounting resources for staging a new AG btree. */ +void +xrep_newbt_init_ag( + struct xrep_newbt *xnr, + struct xfs_scrub *sc, + const struct xfs_owner_info *oinfo, + xfs_fsblock_t alloc_hint, + enum xfs_ag_resv_type resv) +{ + memset(xnr, 0, sizeof(struct xrep_newbt)); + xnr->sc = sc; + xnr->oinfo = *oinfo; /* structure copy */ + xnr->alloc_hint = alloc_hint; + xnr->resv = resv; + INIT_LIST_HEAD(&xnr->resv_list); + xrep_newbt_estimate_slack(xnr); +} + +/* Initialize accounting resources for staging a new inode fork btree. */ +int +xrep_newbt_init_inode( + struct xrep_newbt *xnr, + struct xfs_scrub *sc, + int whichfork, + const struct xfs_owner_info *oinfo) +{ + struct xfs_ifork *ifp; + + ifp = kmem_cache_zalloc(xfs_ifork_cache, XCHK_GFP_FLAGS); + if (!ifp) + return -ENOMEM; + + xrep_newbt_init_ag(xnr, sc, oinfo, + XFS_INO_TO_FSB(sc->mp, sc->ip->i_ino), + XFS_AG_RESV_NONE); + xnr->ifake.if_fork = ifp; + xnr->ifake.if_fork_size = xfs_inode_fork_size(sc->ip, whichfork); + xnr->ifake.if_whichfork = whichfork; + return 0; +} + +/* + * Initialize accounting resources for staging a new btree. Callers are + * expected to add their own reservations (and clean them up) manually. + */ +void +xrep_newbt_init_bare( + struct xrep_newbt *xnr, + struct xfs_scrub *sc) +{ + xrep_newbt_init_ag(xnr, sc, &XFS_RMAP_OINFO_ANY_OWNER, NULLFSBLOCK, + XFS_AG_RESV_NONE); +} + +/* + * Designate specific blocks to be used to build our new btree. @pag must be + * a passive reference. + */ +STATIC int +xrep_newbt_add_blocks( + struct xrep_newbt *xnr, + struct xfs_perag *pag, + const struct xfs_alloc_arg *args) +{ + struct xfs_mount *mp = xnr->sc->mp; + struct xrep_newbt_resv *resv; + + resv = kmalloc(sizeof(struct xrep_newbt_resv), XCHK_GFP_FLAGS); + if (!resv) + return -ENOMEM; + + INIT_LIST_HEAD(&resv->list); + resv->agbno = XFS_FSB_TO_AGBNO(mp, args->fsbno); + resv->len = args->len; + resv->used = 0; + resv->pag = xfs_perag_hold(pag); + + list_add_tail(&resv->list, &xnr->resv_list); + return 0; +} + +/* Don't let our allocation hint take us beyond this AG */ +static inline void +xrep_newbt_validate_ag_alloc_hint( + struct xrep_newbt *xnr) +{ + struct xfs_scrub *sc = xnr->sc; + xfs_agnumber_t agno = XFS_FSB_TO_AGNO(sc->mp, xnr->alloc_hint); + + if (agno == sc->sa.pag->pag_agno && + xfs_verify_fsbno(sc->mp, xnr->alloc_hint)) + return; + + xnr->alloc_hint = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno, + XFS_AGFL_BLOCK(sc->mp) + 1); +} + +/* Allocate disk space for a new per-AG btree. */ +STATIC int +xrep_newbt_alloc_ag_blocks( + struct xrep_newbt *xnr, + uint64_t nr_blocks) +{ + struct xfs_scrub *sc = xnr->sc; + struct xfs_mount *mp = sc->mp; + int error = 0; + + ASSERT(sc->sa.pag != NULL); + + while (nr_blocks > 0) { + struct xfs_alloc_arg args = { + .tp = sc->tp, + .mp = mp, + .oinfo = xnr->oinfo, + .minlen = 1, + .maxlen = nr_blocks, + .prod = 1, + .resv = xnr->resv, + }; + xfs_agnumber_t agno; + + xrep_newbt_validate_ag_alloc_hint(xnr); + + error = xfs_alloc_vextent_near_bno(&args, xnr->alloc_hint); + if (error) + return error; + if (args.fsbno == NULLFSBLOCK) + return -ENOSPC; + + agno = XFS_FSB_TO_AGNO(mp, args.fsbno); + + trace_xrep_newbt_alloc_ag_blocks(mp, agno, + XFS_FSB_TO_AGBNO(mp, args.fsbno), args.len, + xnr->oinfo.oi_owner); + + if (agno != sc->sa.pag->pag_agno) { + ASSERT(agno == sc->sa.pag->pag_agno); + return -EFSCORRUPTED; + } + + error = xrep_newbt_add_blocks(xnr, sc->sa.pag, &args); + if (error) + return error; + + nr_blocks -= args.len; + xnr->alloc_hint = args.fsbno + args.len; + + error = xrep_defer_finish(sc); + if (error) + return error; + } + + return 0; +} + +/* Don't let our allocation hint take us beyond EOFS */ +static inline void +xrep_newbt_validate_file_alloc_hint( + struct xrep_newbt *xnr) +{ + struct xfs_scrub *sc = xnr->sc; + + if (xfs_verify_fsbno(sc->mp, xnr->alloc_hint)) + return; + + xnr->alloc_hint = XFS_AGB_TO_FSB(sc->mp, 0, XFS_AGFL_BLOCK(sc->mp) + 1); +} + +/* Allocate disk space for our new file-based btree. */ +STATIC int +xrep_newbt_alloc_file_blocks( + struct xrep_newbt *xnr, + uint64_t nr_blocks) +{ + struct xfs_scrub *sc = xnr->sc; + struct xfs_mount *mp = sc->mp; + int error = 0; + + while (nr_blocks > 0) { + struct xfs_alloc_arg args = { + .tp = sc->tp, + .mp = mp, + .oinfo = xnr->oinfo, + .minlen = 1, + .maxlen = nr_blocks, + .prod = 1, + .resv = xnr->resv, + }; + struct xfs_perag *pag; + xfs_agnumber_t agno; + + xrep_newbt_validate_file_alloc_hint(xnr); + + error = xfs_alloc_vextent_start_ag(&args, xnr->alloc_hint); + if (error) + return error; + if (args.fsbno == NULLFSBLOCK) + return -ENOSPC; + + agno = XFS_FSB_TO_AGNO(mp, args.fsbno); + + trace_xrep_newbt_alloc_file_blocks(mp, agno, + XFS_FSB_TO_AGBNO(mp, args.fsbno), args.len, + xnr->oinfo.oi_owner); + + pag = xfs_perag_get(mp, agno); + if (!pag) { + ASSERT(0); + return -EFSCORRUPTED; + } + + error = xrep_newbt_add_blocks(xnr, pag, &args); + xfs_perag_put(pag); + if (error) + return error; + + nr_blocks -= args.len; + xnr->alloc_hint = args.fsbno + args.len; + + error = xrep_defer_finish(sc); + if (error) + return error; + } + + return 0; +} + +/* Allocate disk space for our new btree. */ +int +xrep_newbt_alloc_blocks( + struct xrep_newbt *xnr, + uint64_t nr_blocks) +{ + if (xnr->sc->ip) + return xrep_newbt_alloc_file_blocks(xnr, nr_blocks); + return xrep_newbt_alloc_ag_blocks(xnr, nr_blocks); +} + +/* + * Free the unused part of a space extent that was reserved for a new ondisk + * structure. Returns the number of EFIs logged or a negative errno. + */ +STATIC int +xrep_newbt_free_extent( + struct xrep_newbt *xnr, + struct xrep_newbt_resv *resv, + bool btree_committed) +{ + struct xfs_scrub *sc = xnr->sc; + xfs_agblock_t free_agbno = resv->agbno; + xfs_extlen_t free_aglen = resv->len; + xfs_fsblock_t fsbno; + int error; + + if (!btree_committed || resv->used == 0) { + /* + * If we're not committing a new btree or we didn't use the + * space reservation, free the entire space extent. + */ + goto free; + } + + /* + * We used space and committed the btree. Remove the written blocks + * from the reservation and possibly log a new EFI to free any unused + * reservation space. + */ + free_agbno += resv->used; + free_aglen -= resv->used; + + if (free_aglen == 0) + return 0; + + trace_xrep_newbt_free_blocks(sc->mp, resv->pag->pag_agno, free_agbno, + free_aglen, xnr->oinfo.oi_owner); + + ASSERT(xnr->resv != XFS_AG_RESV_AGFL); + +free: + /* + * Use EFIs to free the reservations. This reduces the chance + * that we leak blocks if the system goes down. + */ + fsbno = XFS_AGB_TO_FSB(sc->mp, resv->pag->pag_agno, free_agbno); + error = xfs_free_extent_later(sc->tp, fsbno, free_aglen, &xnr->oinfo, + xnr->resv, true); + if (error) + return error; + + return 1; +} + +/* Free all the accounting info and disk space we reserved for a new btree. */ +STATIC int +xrep_newbt_free( + struct xrep_newbt *xnr, + bool btree_committed) +{ + struct xfs_scrub *sc = xnr->sc; + struct xrep_newbt_resv *resv, *n; + unsigned int freed = 0; + int error = 0; + + /* + * If the filesystem already went down, we can't free the blocks. Skip + * ahead to freeing the incore metadata because we can't fix anything. + */ + if (xfs_is_shutdown(sc->mp)) + goto junkit; + + list_for_each_entry_safe(resv, n, &xnr->resv_list, list) { + int ret; + + ret = xrep_newbt_free_extent(xnr, resv, btree_committed); + list_del(&resv->list); + xfs_perag_put(resv->pag); + kfree(resv); + if (ret < 0) { + error = ret; + goto junkit; + } + + freed += ret; + if (freed >= XREP_MAX_ITRUNCATE_EFIS) { + error = xrep_defer_finish(sc); + if (error) + goto junkit; + freed = 0; + } + } + + if (freed) + error = xrep_defer_finish(sc); + +junkit: + /* + * If we still have reservations attached to @newbt, cleanup must have + * failed and the filesystem is about to go down. Clean up the incore + * reservations. + */ + list_for_each_entry_safe(resv, n, &xnr->resv_list, list) { + list_del(&resv->list); + xfs_perag_put(resv->pag); + kfree(resv); + } + + if (sc->ip) { + kmem_cache_free(xfs_ifork_cache, xnr->ifake.if_fork); + xnr->ifake.if_fork = NULL; + } + + return error; +} + +/* + * Free all the accounting info and unused disk space allocations after + * committing a new btree. + */ +int +xrep_newbt_commit( + struct xrep_newbt *xnr) +{ + return xrep_newbt_free(xnr, true); +} + +/* + * Free all the accounting info and all of the disk space we reserved for a new + * btree that we're not going to commit. We want to try to roll things back + * cleanly for things like ENOSPC midway through allocation. + */ +void +xrep_newbt_cancel( + struct xrep_newbt *xnr) +{ + xrep_newbt_free(xnr, false); +} + +/* Feed one of the reserved btree blocks to the bulk loader. */ +int +xrep_newbt_claim_block( + struct xfs_btree_cur *cur, + struct xrep_newbt *xnr, + union xfs_btree_ptr *ptr) +{ + struct xrep_newbt_resv *resv; + struct xfs_mount *mp = cur->bc_mp; + xfs_agblock_t agbno; + + /* + * The first item in the list should always have a free block unless + * we're completely out. + */ + resv = list_first_entry(&xnr->resv_list, struct xrep_newbt_resv, list); + if (resv->used == resv->len) + return -ENOSPC; + + /* + * Peel off a block from the start of the reservation. We allocate + * blocks in order to place blocks on disk in increasing record or key + * order. The block reservations tend to end up on the list in + * decreasing order, which hopefully results in leaf blocks ending up + * together. + */ + agbno = resv->agbno + resv->used; + resv->used++; + + /* If we used all the blocks in this reservation, move it to the end. */ + if (resv->used == resv->len) + list_move_tail(&resv->list, &xnr->resv_list); + + trace_xrep_newbt_claim_block(mp, resv->pag->pag_agno, agbno, 1, + xnr->oinfo.oi_owner); + + if (cur->bc_flags & XFS_BTREE_LONG_PTRS) + ptr->l = cpu_to_be64(XFS_AGB_TO_FSB(mp, resv->pag->pag_agno, + agbno)); + else + ptr->s = cpu_to_be32(agbno); + return 0; +} diff --git a/fs/xfs/scrub/newbt.h b/fs/xfs/scrub/newbt.h new file mode 100644 index 0000000000000..ca53271f3a4c6 --- /dev/null +++ b/fs/xfs/scrub/newbt.h @@ -0,0 +1,62 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_NEWBT_H__ +#define __XFS_SCRUB_NEWBT_H__ + +struct xrep_newbt_resv { + /* Link to list of extents that we've reserved. */ + struct list_head list; + + struct xfs_perag *pag; + + /* AG block of the extent we reserved. */ + xfs_agblock_t agbno; + + /* Length of the reservation. */ + xfs_extlen_t len; + + /* How much of this reservation has been used. */ + xfs_extlen_t used; +}; + +struct xrep_newbt { + struct xfs_scrub *sc; + + /* List of extents that we've reserved. */ + struct list_head resv_list; + + /* Fake root for new btree. */ + union { + struct xbtree_afakeroot afake; + struct xbtree_ifakeroot ifake; + }; + + /* rmap owner of these blocks */ + struct xfs_owner_info oinfo; + + /* btree geometry for the bulk loader */ + struct xfs_btree_bload bload; + + /* Allocation hint */ + xfs_fsblock_t alloc_hint; + + /* per-ag reservation type */ + enum xfs_ag_resv_type resv; +}; + +void xrep_newbt_init_bare(struct xrep_newbt *xnr, struct xfs_scrub *sc); +void xrep_newbt_init_ag(struct xrep_newbt *xnr, struct xfs_scrub *sc, + const struct xfs_owner_info *oinfo, xfs_fsblock_t alloc_hint, + enum xfs_ag_resv_type resv); +int xrep_newbt_init_inode(struct xrep_newbt *xnr, struct xfs_scrub *sc, + int whichfork, const struct xfs_owner_info *oinfo); +int xrep_newbt_alloc_blocks(struct xrep_newbt *xnr, uint64_t nr_blocks); +void xrep_newbt_cancel(struct xrep_newbt *xnr); +int xrep_newbt_commit(struct xrep_newbt *xnr); +int xrep_newbt_claim_block(struct xfs_btree_cur *cur, struct xrep_newbt *xnr, + union xfs_btree_ptr *ptr); + +#endif /* __XFS_SCRUB_NEWBT_H__ */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 4849efcaa33ae..474f4c4a9cd3b 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -17,6 +17,8 @@ #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_scrub.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 4a8bc6f3c8f2e..aa76830753196 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1332,6 +1332,43 @@ TRACE_EVENT(xrep_ialloc_insert, __entry->freemask) ) +DECLARE_EVENT_CLASS(xrep_newbt_extent_class, + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, + xfs_agblock_t agbno, xfs_extlen_t len, + int64_t owner), + TP_ARGS(mp, agno, agbno, len, owner), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agblock_t, agbno) + __field(xfs_extlen_t, len) + __field(int64_t, owner) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->agno = agno; + __entry->agbno = agbno; + __entry->len = len; + __entry->owner = owner; + ), + TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x owner 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->agbno, + __entry->len, + __entry->owner) +); +#define DEFINE_NEWBT_EXTENT_EVENT(name) \ +DEFINE_EVENT(xrep_newbt_extent_class, name, \ + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \ + xfs_agblock_t agbno, xfs_extlen_t len, \ + int64_t owner), \ + TP_ARGS(mp, agno, agbno, len, owner)) +DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_alloc_ag_blocks); +DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_alloc_file_blocks); +DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_free_blocks); +DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_claim_block); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging 2023-11-24 23:48 ` [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong @ 2023-11-26 13:14 ` Christoph Hellwig 2023-11-27 22:34 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-26 13:14 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs The data structure and support code looks fine to me, but I do have some nitpicky comments and questions: > - /* Fork format. */ > - unsigned int if_format; > - > - /* Number of records. */ > - unsigned int if_extents; > + /* Which fork is this btree being built for? */ > + int if_whichfork; The two removed fields seems to be unused even before this patch. Should they have been in a separate removal patch? > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c > index 876a2f41b0637..36c511f96b004 100644 > --- a/fs/xfs/scrub/agheader_repair.c > +++ b/fs/xfs/scrub/agheader_repair.c > @@ -10,6 +10,7 @@ > #include "xfs_trans_resv.h" > #include "xfs_mount.h" > #include "xfs_btree.h" > +#include "xfs_btree_staging.h" > #include "xfs_log_format.h" > #include "xfs_trans.h" > #include "xfs_sb.h" I also don't think all the #include churn belongs into this patch, as the only existing header touched by it is xfs_btree_staging.h, which means that anything that didn't need it before still won't need it with the changes. > +/* > + * Estimate proper slack values for a btree that's being reloaded. > + * > + * Under most circumstances, we'll take whatever default loading value the > + * btree bulk loading code calculates for us. However, there are some > + * exceptions to this rule: > + * > + * (1) If someone turned one of the debug knobs. > + * (2) If this is a per-AG btree and the AG has less than ~9% space free. > + * (3) If this is an inode btree and the FS has less than ~9% space free. Where does this ~9% number come from? Obviously it is a low-space condition of some sort, but I wonder what are the criteria. It would be nice to document that here, even if the answer is answer is "out of thin air". > + * Note that we actually use 3/32 for the comparison to avoid division. > + */ > +static void > + /* No further changes if there's more than 3/32ths space left. */ > + if (free >= ((sz * 3) >> 5)) > + return; Is this code really in the critical path that a division (or relying on the compiler to do the right thing) is out of question? Because these shits by magic numbers are really annyoing to read (unlike say normal SECTOR_SHIFT or PAGE_SHIFT ones that are fairly easy to read). ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging 2023-11-26 13:14 ` Christoph Hellwig @ 2023-11-27 22:34 ` Darrick J. Wong 2023-11-28 5:41 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-27 22:34 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Sun, Nov 26, 2023 at 05:14:53AM -0800, Christoph Hellwig wrote: > The data structure and support code looks fine to me, but I do have > some nitpicky comments and questions: > > > - /* Fork format. */ > > - unsigned int if_format; > > - > > - /* Number of records. */ > > - unsigned int if_extents; > > + /* Which fork is this btree being built for? */ > > + int if_whichfork; > > The two removed fields seems to be unused even before this patch. > Should they have been in a separate removal patch? They should have been in the patch that moved if_{format,extents} into xfs_inode_fork: daf83964a3681 ("xfs: move the per-fork nextents fields into struct xfs_ifork") f7e67b20ecbbc ("xfs: move the fork format fields into struct xfs_ifork") but I think it just got lost in the review of all that back in May 2020. Since then the design has changed enough that I don't even think the if_whichfork field is in use anywhere: $ git grep if_whichfork fs/xfs/libxfs/xfs_bmap_btree.c:664: cur = xfs_bmbt_init_common(mp, NULL, ip, ifake->if_whichfork); fs/xfs/libxfs/xfs_btree_staging.h:42: int if_whichfork; fs/xfs/scrub/newbt.c:117: xnr->ifake.if_whichfork = whichfork; fs/xfs/scrub/newbt.c:156: xnr->ifake.if_whichfork = XFS_DATA_FORK; $ cd ../xfsprogs/ $ git grep if_whichfork db/bmap_inflate.c:367: ifake.if_whichfork = XFS_DATA_FORK; db/bmap_inflate.c:421: ifake.if_whichfork = XFS_DATA_FORK; libxfs/xfs_bmap_btree.c:662: cur = xfs_bmbt_init_common(mp, NULL, ip, ifake->if_whichfork); libxfs/xfs_btree_staging.h:42: int if_whichfork; repair/bulkload.c:38: bkl->ifake.if_whichfork = whichfork; So that can all go away. > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c > > index 876a2f41b0637..36c511f96b004 100644 > > --- a/fs/xfs/scrub/agheader_repair.c > > +++ b/fs/xfs/scrub/agheader_repair.c > > @@ -10,6 +10,7 @@ > > #include "xfs_trans_resv.h" > > #include "xfs_mount.h" > > #include "xfs_btree.h" > > +#include "xfs_btree_staging.h" > > #include "xfs_log_format.h" > > #include "xfs_trans.h" > > #include "xfs_sb.h" > > I also don't think all the #include churn belongs into this patch, > as the only existing header touched by it is xfs_btree_staging.h, > which means that anything that didn't need it before still won't > need it with the changes. Hmm yeah. Not sure when this detritus started accumulating here. :( > > +/* > > + * Estimate proper slack values for a btree that's being reloaded. > > + * > > + * Under most circumstances, we'll take whatever default loading value the > > + * btree bulk loading code calculates for us. However, there are some > > + * exceptions to this rule: > > + * > > + * (1) If someone turned one of the debug knobs. > > + * (2) If this is a per-AG btree and the AG has less than ~9% space free. > > + * (3) If this is an inode btree and the FS has less than ~9% space free. > > Where does this ~9% number come from? Obviously it is a low-space > condition of some sort, but I wonder what are the criteria. It would > be nice to document that here, even if the answer is > answer is "out of thin air". It comes from xfs_repair: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=for-next#n114 Before xfs_btree_staging.[ch] came along, it was open coded in repair/phase5.c in a most unglorious fashion: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349 At that point the slack factors were arbitrary quantities per btree. The rmapbt automatically left 10 slots free; everything else left zero. That had a noticeable effect on performance straight after mounting because touching /any/ btree would result in splits. IIRC Dave and I decided that repair should generate btree blocks that were 75% full unless space was tight. We defined tight as ~10% free to avoid repair failures and settled on 3/32 to avoid div64. IOWs, we mostly pulled it out of thin air. ;) OFC the other weird thing is that originally I thought that online repair would land sooner than would the retroport of xfs_repair to the new btree bulk loading code. > > + * Note that we actually use 3/32 for the comparison to avoid division. > > + */ > > +static void > > > + /* No further changes if there's more than 3/32ths space left. */ > > + if (free >= ((sz * 3) >> 5)) > > + return; > > Is this code really in the critical path that a division (or relying > on the compiler to do the right thing) is out of question? Because > these shits by magic numbers are really annyoing to read (unlike > say normal SECTOR_SHIFT or PAGE_SHIFT ones that are fairly easy to > read). Nah, it's not performance critical. Collecting records and formatting blocks is a much bigger strain on the system than a few divisions. I'll change it to div_u64(sz, 10). Hmm now that I look at it, I've also noticed that even the lowspace btree rebuild in userspace will leave two open slots per block. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging 2023-11-27 22:34 ` Darrick J. Wong @ 2023-11-28 5:41 ` Christoph Hellwig 2023-11-28 17:02 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 5:41 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs On Mon, Nov 27, 2023 at 02:34:51PM -0800, Darrick J. Wong wrote: > That had a noticeable effect on performance straight after mounting > because touching /any/ btree would result in splits. IIRC Dave and I > decided that repair should generate btree blocks that were 75% full > unless space was tight. We defined tight as ~10% free to avoid repair > failures and settled on 3/32 to avoid div64. > > IOWs, we mostly pulled it out of thin air. ;) Maybe throw a little comment about this in. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging 2023-11-28 5:41 ` Christoph Hellwig @ 2023-11-28 17:02 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 17:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Mon, Nov 27, 2023 at 09:41:28PM -0800, Christoph Hellwig wrote: > On Mon, Nov 27, 2023 at 02:34:51PM -0800, Darrick J. Wong wrote: > > That had a noticeable effect on performance straight after mounting > > because touching /any/ btree would result in splits. IIRC Dave and I > > decided that repair should generate btree blocks that were 75% full > > unless space was tight. We defined tight as ~10% free to avoid repair > > failures and settled on 3/32 to avoid div64. > > > > IOWs, we mostly pulled it out of thin air. ;) > > Maybe throw a little comment about this in. Yeah, I'll paste the sordid history into the commit message when I change the code to use div_u64. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 6/7] xfs: log EFIs for all btree blocks being used to stage a btree 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong ` (4 preceding siblings ...) 2023-11-24 23:48 ` [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong @ 2023-11-24 23:48 ` Darrick J. Wong 2023-11-26 13:15 ` Christoph Hellwig 2023-11-24 23:48 ` [PATCH 7/7] xfs: force small EFIs for reaping btree extents Darrick J. Wong 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:48 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> We need to log EFIs for every extent that we allocate for the purpose of staging a new btree so that if we fail then the blocks will be freed during log recovery. Use the autoreaping mechanism provided by the previous patch to attach paused freeing work to the scrub transaction. We can then mark the EFIs stale if we decide to commit the new btree, or we can unpause the EFIs if we decide to abort the repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/scrub/newbt.c | 34 ++++++++++++++++++++++++++-------- fs/xfs/scrub/newbt.h | 3 +++ 2 files changed, 29 insertions(+), 8 deletions(-) diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c index 4e8d6637426e4..2932fd317ab23 100644 --- a/fs/xfs/scrub/newbt.c +++ b/fs/xfs/scrub/newbt.c @@ -136,6 +136,7 @@ xrep_newbt_add_blocks( { struct xfs_mount *mp = xnr->sc->mp; struct xrep_newbt_resv *resv; + int error; resv = kmalloc(sizeof(struct xrep_newbt_resv), XCHK_GFP_FLAGS); if (!resv) @@ -147,8 +148,18 @@ xrep_newbt_add_blocks( resv->used = 0; resv->pag = xfs_perag_hold(pag); + ASSERT(xnr->oinfo.oi_offset == 0); + + error = xfs_alloc_schedule_autoreap(args, true, &resv->autoreap); + if (error) + goto out_pag; + list_add_tail(&resv->list, &xnr->resv_list); return 0; +out_pag: + xfs_perag_put(resv->pag); + kfree(resv); + return error; } /* Don't let our allocation hint take us beyond this AG */ @@ -327,16 +338,21 @@ xrep_newbt_free_extent( if (!btree_committed || resv->used == 0) { /* * If we're not committing a new btree or we didn't use the - * space reservation, free the entire space extent. + * space reservation, let the existing EFI free the entire + * space extent. */ - goto free; + trace_xrep_newbt_free_blocks(sc->mp, resv->pag->pag_agno, + free_agbno, free_aglen, xnr->oinfo.oi_owner); + xfs_alloc_commit_autoreap(sc->tp, &resv->autoreap); + return 1; } /* - * We used space and committed the btree. Remove the written blocks - * from the reservation and possibly log a new EFI to free any unused - * reservation space. + * We used space and committed the btree. Cancel the autoreap, remove + * the written blocks from the reservation, and possibly log a new EFI + * to free any unused reservation space. */ + xfs_alloc_cancel_autoreap(sc->tp, &resv->autoreap); free_agbno += resv->used; free_aglen -= resv->used; @@ -348,7 +364,6 @@ xrep_newbt_free_extent( ASSERT(xnr->resv != XFS_AG_RESV_AGFL); -free: /* * Use EFIs to free the reservations. This reduces the chance * that we leak blocks if the system goes down. @@ -408,9 +423,10 @@ xrep_newbt_free( /* * If we still have reservations attached to @newbt, cleanup must have * failed and the filesystem is about to go down. Clean up the incore - * reservations. + * reservations and try to commit to freeing the space we used. */ list_for_each_entry_safe(resv, n, &xnr->resv_list, list) { + xfs_alloc_commit_autoreap(sc->tp, &resv->autoreap); list_del(&resv->list); xfs_perag_put(resv->pag); kfree(resv); @@ -488,5 +504,7 @@ xrep_newbt_claim_block( agbno)); else ptr->s = cpu_to_be32(agbno); - return 0; + + /* Relog all the EFIs. */ + return xrep_defer_finish(xnr->sc); } diff --git a/fs/xfs/scrub/newbt.h b/fs/xfs/scrub/newbt.h index ca53271f3a4c6..d2baffa17b1ae 100644 --- a/fs/xfs/scrub/newbt.h +++ b/fs/xfs/scrub/newbt.h @@ -12,6 +12,9 @@ struct xrep_newbt_resv { struct xfs_perag *pag; + /* Auto-freeing this reservation if we don't commit. */ + struct xfs_alloc_autoreap autoreap; + /* AG block of the extent we reserved. */ xfs_agblock_t agbno; ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 6/7] xfs: log EFIs for all btree blocks being used to stage a btree 2023-11-24 23:48 ` [PATCH 6/7] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong @ 2023-11-26 13:15 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-26 13:15 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 7/7] xfs: force small EFIs for reaping btree extents 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong ` (5 preceding siblings ...) 2023-11-24 23:48 ` [PATCH 6/7] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong @ 2023-11-24 23:48 ` Darrick J. Wong 2023-11-25 5:13 ` Christoph Hellwig 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:48 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Introduce the concept of a defer ops barrier to separate consecutively queued pending work items of the same type. With a barrier in place, the two work items will be tracked separately, and receive separate log intent items. The goal here is to prevent reaping of old metadata blocks from creating unnecessarily huge EFIs that could then run the risk of overflowing the scrub transaction. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_defer.c | 83 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_defer.h | 3 ++ fs/xfs/scrub/reap.c | 5 +++ 3 files changed, 91 insertions(+) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index 6604eb50058ba..6b0d4c2e844b0 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -181,6 +181,58 @@ static struct kmem_cache *xfs_defer_pending_cache; * Note that the continuation requested between t2 and t3 is likely to * reoccur. */ +STATIC struct xfs_log_item * +xfs_defer_barrier_create_intent( + struct xfs_trans *tp, + struct list_head *items, + unsigned int count, + bool sort) +{ + return NULL; +} + +STATIC void +xfs_defer_barrier_abort_intent( + struct xfs_log_item *intent) +{ + /* empty */ +} + +STATIC struct xfs_log_item * +xfs_defer_barrier_create_done( + struct xfs_trans *tp, + struct xfs_log_item *intent, + unsigned int count) +{ + return NULL; +} + +STATIC int +xfs_defer_barrier_finish_item( + struct xfs_trans *tp, + struct xfs_log_item *done, + struct list_head *item, + struct xfs_btree_cur **state) +{ + ASSERT(0); + return -EFSCORRUPTED; +} + +STATIC void +xfs_defer_barrier_cancel_item( + struct list_head *item) +{ + ASSERT(0); +} + +static const struct xfs_defer_op_type xfs_barrier_defer_type = { + .max_items = 1, + .create_intent = xfs_defer_barrier_create_intent, + .abort_intent = xfs_defer_barrier_abort_intent, + .create_done = xfs_defer_barrier_create_done, + .finish_item = xfs_defer_barrier_finish_item, + .cancel_item = xfs_defer_barrier_cancel_item, +}; static const struct xfs_defer_op_type *defer_op_types[] = { [XFS_DEFER_OPS_TYPE_BMAP] = &xfs_bmap_update_defer_type, @@ -189,6 +241,7 @@ static const struct xfs_defer_op_type *defer_op_types[] = { [XFS_DEFER_OPS_TYPE_FREE] = &xfs_extent_free_defer_type, [XFS_DEFER_OPS_TYPE_AGFL_FREE] = &xfs_agfl_free_defer_type, [XFS_DEFER_OPS_TYPE_ATTR] = &xfs_attr_defer_type, + [XFS_DEFER_OPS_TYPE_BARRIER] = &xfs_barrier_defer_type, }; /* @@ -1036,3 +1089,33 @@ xfs_defer_item_unpause( trace_xfs_defer_item_unpause(tp->t_mountp, dfp); } + +/* + * Add a defer ops barrier to force two otherwise adjacent deferred work items + * to be tracked separately and have separate log items. + */ +void +xfs_defer_add_barrier( + struct xfs_trans *tp) +{ + struct xfs_defer_pending *dfp; + + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); + + /* If the last defer op added was a barrier, we're done. */ + if (!list_empty(&tp->t_dfops)) { + dfp = list_last_entry(&tp->t_dfops, + struct xfs_defer_pending, dfp_list); + if (dfp->dfp_type == XFS_DEFER_OPS_TYPE_BARRIER) + return; + } + + dfp = kmem_cache_zalloc(xfs_defer_pending_cache, + GFP_NOFS | __GFP_NOFAIL); + dfp->dfp_type = XFS_DEFER_OPS_TYPE_BARRIER; + INIT_LIST_HEAD(&dfp->dfp_work); + list_add_tail(&dfp->dfp_list, &tp->t_dfops); + + trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL); + dfp->dfp_count++; +} diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 094ff9062b251..0112678a8856b 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -20,6 +20,7 @@ enum xfs_defer_ops_type { XFS_DEFER_OPS_TYPE_FREE, XFS_DEFER_OPS_TYPE_AGFL_FREE, XFS_DEFER_OPS_TYPE_ATTR, + XFS_DEFER_OPS_TYPE_BARRIER, XFS_DEFER_OPS_TYPE_MAX, }; @@ -141,4 +142,6 @@ void xfs_defer_resources_rele(struct xfs_defer_resources *dres); int __init xfs_defer_init_item_caches(void); void xfs_defer_destroy_item_caches(void); +void xfs_defer_add_barrier(struct xfs_trans *tp); + #endif /* __XFS_DEFER_H__ */ diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c index 78c9f2085db46..ee26fcb500b78 100644 --- a/fs/xfs/scrub/reap.c +++ b/fs/xfs/scrub/reap.c @@ -31,6 +31,7 @@ #include "xfs_da_btree.h" #include "xfs_attr.h" #include "xfs_attr_remote.h" +#include "xfs_defer.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -409,6 +410,8 @@ xreap_agextent_iter( /* * Use deferred frees to get rid of the old btree blocks to try to * minimize the window in which we could crash and lose the old blocks. + * Add a defer ops barrier every other extent to avoid stressing the + * system with large EFIs. */ error = xfs_free_extent_later(sc->tp, fsbno, *aglenp, rs->oinfo, rs->resv, true); @@ -416,6 +419,8 @@ xreap_agextent_iter( return error; rs->deferred++; + if (rs->deferred % 2 == 0) + xfs_defer_add_barrier(sc->tp); return 0; } ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 7/7] xfs: force small EFIs for reaping btree extents 2023-11-24 23:48 ` [PATCH 7/7] xfs: force small EFIs for reaping btree extents Darrick J. Wong @ 2023-11-25 5:13 ` Christoph Hellwig 2023-11-27 22:46 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs This generally looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> But as mentioned earlier I think it would be nice to share some code between the normal xfs_defer_add and xfs_defer_add_barrier. This would be the fold for this patch on top of the one previously posted and a light merge fix for the pausing between: diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index e70dc347074dc5..5d621e445e8ab9 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -753,6 +753,22 @@ xfs_defer_can_append( return true; } +/* Create a new pending item at the end of the intake list. */ +static struct xfs_defer_pending * +xfs_defer_alloc( + struct xfs_trans *tp, + enum xfs_defer_ops_type type) +{ + struct xfs_defer_pending *dfp; + + dfp = kmem_cache_zalloc(xfs_defer_pending_cache, + GFP_NOFS | __GFP_NOFAIL); + dfp->dfp_type = type; + INIT_LIST_HEAD(&dfp->dfp_work); + list_add_tail(&dfp->dfp_list, &tp->t_dfops); + return dfp; +}; + /* Add an item for later deferred processing. */ struct xfs_defer_pending * xfs_defer_add( @@ -760,29 +776,18 @@ xfs_defer_add( enum xfs_defer_ops_type type, struct list_head *li) { - struct xfs_defer_pending *dfp = NULL; const struct xfs_defer_op_type *ops = defer_op_types[type]; + struct xfs_defer_pending *dfp; ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX); dfp = xfs_defer_find(tp, type); - if (!dfp || !xfs_defer_can_append(dfp, ops)) { - /* Create a new pending item at the end of the intake list. */ - dfp = kmem_cache_zalloc(xfs_defer_pending_cache, - GFP_NOFS | __GFP_NOFAIL); - dfp->dfp_type = type; - dfp->dfp_intent = NULL; - dfp->dfp_done = NULL; - dfp->dfp_count = 0; - INIT_LIST_HEAD(&dfp->dfp_work); - list_add_tail(&dfp->dfp_list, &tp->t_dfops); - } - + if (!dfp || !xfs_defer_can_append(dfp, ops)) + dfp = xfs_defer_alloc(tp, type); list_add_tail(li, &dfp->dfp_work); trace_xfs_defer_add_item(tp->t_mountp, dfp, li); dfp->dfp_count++; - return dfp; } @@ -1106,19 +1111,10 @@ xfs_defer_add_barrier( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); /* If the last defer op added was a barrier, we're done. */ - if (!list_empty(&tp->t_dfops)) { - dfp = list_last_entry(&tp->t_dfops, - struct xfs_defer_pending, dfp_list); - if (dfp->dfp_type == XFS_DEFER_OPS_TYPE_BARRIER) - return; - } - - dfp = kmem_cache_zalloc(xfs_defer_pending_cache, - GFP_NOFS | __GFP_NOFAIL); - dfp->dfp_type = XFS_DEFER_OPS_TYPE_BARRIER; - INIT_LIST_HEAD(&dfp->dfp_work); - list_add_tail(&dfp->dfp_list, &tp->t_dfops); - + dfp = xfs_defer_find(tp, XFS_DEFER_OPS_TYPE_BARRIER); + if (dfp) + return; + dfp = xfs_defer_alloc(tp, XFS_DEFER_OPS_TYPE_BARRIER); trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL); dfp->dfp_count++; } ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 7/7] xfs: force small EFIs for reaping btree extents 2023-11-25 5:13 ` Christoph Hellwig @ 2023-11-27 22:46 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-27 22:46 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Fri, Nov 24, 2023 at 09:13:22PM -0800, Christoph Hellwig wrote: > This generally looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > But as mentioned earlier I think it would be nice to share some > code between the normal xfs_defer_add and xfs_defer_add_barrier. > > This would be the fold for this patch on top of the one previously > posted and a light merge fix for the pausing between: Applied. Thanks for the quick cleanup! --D > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > index e70dc347074dc5..5d621e445e8ab9 100644 > --- a/fs/xfs/libxfs/xfs_defer.c > +++ b/fs/xfs/libxfs/xfs_defer.c > @@ -753,6 +753,22 @@ xfs_defer_can_append( > return true; > } > > +/* Create a new pending item at the end of the intake list. */ > +static struct xfs_defer_pending * > +xfs_defer_alloc( > + struct xfs_trans *tp, > + enum xfs_defer_ops_type type) > +{ > + struct xfs_defer_pending *dfp; > + > + dfp = kmem_cache_zalloc(xfs_defer_pending_cache, > + GFP_NOFS | __GFP_NOFAIL); > + dfp->dfp_type = type; > + INIT_LIST_HEAD(&dfp->dfp_work); > + list_add_tail(&dfp->dfp_list, &tp->t_dfops); > + return dfp; > +}; > + > /* Add an item for later deferred processing. */ > struct xfs_defer_pending * > xfs_defer_add( > @@ -760,29 +776,18 @@ xfs_defer_add( > enum xfs_defer_ops_type type, > struct list_head *li) > { > - struct xfs_defer_pending *dfp = NULL; > const struct xfs_defer_op_type *ops = defer_op_types[type]; > + struct xfs_defer_pending *dfp; > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX); > > dfp = xfs_defer_find(tp, type); > - if (!dfp || !xfs_defer_can_append(dfp, ops)) { > - /* Create a new pending item at the end of the intake list. */ > - dfp = kmem_cache_zalloc(xfs_defer_pending_cache, > - GFP_NOFS | __GFP_NOFAIL); > - dfp->dfp_type = type; > - dfp->dfp_intent = NULL; > - dfp->dfp_done = NULL; > - dfp->dfp_count = 0; > - INIT_LIST_HEAD(&dfp->dfp_work); > - list_add_tail(&dfp->dfp_list, &tp->t_dfops); > - } > - > + if (!dfp || !xfs_defer_can_append(dfp, ops)) > + dfp = xfs_defer_alloc(tp, type); > list_add_tail(li, &dfp->dfp_work); > trace_xfs_defer_add_item(tp->t_mountp, dfp, li); > dfp->dfp_count++; > - > return dfp; > } > > @@ -1106,19 +1111,10 @@ xfs_defer_add_barrier( > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > /* If the last defer op added was a barrier, we're done. */ > - if (!list_empty(&tp->t_dfops)) { > - dfp = list_last_entry(&tp->t_dfops, > - struct xfs_defer_pending, dfp_list); > - if (dfp->dfp_type == XFS_DEFER_OPS_TYPE_BARRIER) > - return; > - } > - > - dfp = kmem_cache_zalloc(xfs_defer_pending_cache, > - GFP_NOFS | __GFP_NOFAIL); > - dfp->dfp_type = XFS_DEFER_OPS_TYPE_BARRIER; > - INIT_LIST_HEAD(&dfp->dfp_work); > - list_add_tail(&dfp->dfp_list, &tp->t_dfops); > - > + dfp = xfs_defer_find(tp, XFS_DEFER_OPS_TYPE_BARRIER); > + if (dfp) > + return; > + dfp = xfs_defer_alloc(tp, XFS_DEFER_OPS_TYPE_BARRIER); > trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL); > dfp->dfp_count++; > } > ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong 2023-11-24 23:44 ` [PATCHSET v28.0 0/1] xfs: prevent livelocks in xchk_iget Darrick J. Wong 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong @ 2023-11-24 23:45 ` Darrick J. Wong 2023-11-24 23:48 ` [PATCH 1/4] xfs: force all buffers to be written during btree bulk load Darrick J. Wong ` (3 more replies) 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong ` (4 subsequent siblings) 7 siblings, 4 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:45 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, Before we start merging the online repair functions, let's improve the bulk loading code a bit. First, we need to fix a misinteraction between the AIL and the btree bulkloader wherein the delwri at the end of the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list. Second, we introduce a defer ops barrier object so that the process of reaping blocks after a repair cannot queue more than two extents per EFI log item. This increases our exposure to leaking blocks if the system goes down during a reap, but also should prevent transaction overflows, which result in the system going down. Third, we change the bulkloader itself to copy multiple records into a block if possible, and add some debugging knobs so that developers can control the slack factors, just like they can do for xfs_repair. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-prep-for-bulk-loading --- fs/xfs/libxfs/xfs_btree.c | 2 + fs/xfs/libxfs/xfs_btree.h | 3 ++ fs/xfs/libxfs/xfs_btree_staging.c | 67 +++++++++++++++++++++++++------------ fs/xfs/libxfs/xfs_btree_staging.h | 25 +++++++++++--- fs/xfs/scrub/newbt.c | 11 ++++-- fs/xfs/xfs_buf.c | 47 ++++++++++++++++++++++++-- fs/xfs/xfs_buf.h | 1 + fs/xfs/xfs_globals.c | 12 +++++++ fs/xfs/xfs_sysctl.h | 2 + fs/xfs/xfs_sysfs.c | 54 ++++++++++++++++++++++++++++++ 10 files changed, 189 insertions(+), 35 deletions(-) ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/4] xfs: force all buffers to be written during btree bulk load 2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong @ 2023-11-24 23:48 ` Darrick J. Wong 2023-11-25 5:49 ` Christoph Hellwig 2023-11-24 23:49 ` [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:48 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffer readers encountering buffers with DELWRI_Q set, even though the btree bulk load had already committed and the buffer itself wasn't on any delwri list. I traced this to a misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. This clears DELWRI_Q from the buffer state. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. By waiting for the buffer to clear both the delwri list and any potential delwri wait list, we can be sure that repair will initiate writes of all buffers and report all write errors back to userspace instead of committing the new structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree_staging.c | 4 +-- fs/xfs/xfs_buf.c | 47 ++++++++++++++++++++++++++++++++++--- fs/xfs/xfs_buf.h | 1 + 3 files changed, 45 insertions(+), 7 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index dd75e208b543e..29e3f8ccb1852 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf( if (*bpp == NULL) return; - if (!xfs_buf_delwri_queue(*bpp, buffers_list)) - ASSERT(0); - + xfs_buf_delwri_queue_here(*bpp, buffers_list); xfs_buf_relse(*bpp); *bpp = NULL; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 545c7991b9b58..f88b1334b420c 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2049,6 +2049,14 @@ xfs_alloc_buftarg( return NULL; } +static inline void +xfs_buf_list_del( + struct xfs_buf *bp) +{ + list_del_init(&bp->b_list); + wake_up_var(&bp->b_list); +} + /* * Cancel a delayed write list. * @@ -2066,7 +2074,7 @@ xfs_buf_delwri_cancel( xfs_buf_lock(bp); bp->b_flags &= ~_XBF_DELWRI_Q; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); } } @@ -2119,6 +2127,37 @@ xfs_buf_delwri_queue( return true; } +/* + * Queue a buffer to this delwri list as part of a data integrity operation. + * If the buffer is on any other delwri list, we'll wait for that to clear + * so that the caller can submit the buffer for IO and wait for the result. + * Callers must ensure the buffer is not already on the list. + */ +void +xfs_buf_delwri_queue_here( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + /* + * We need this buffer to end up on the /caller's/ delwri list, not any + * old list. This can happen if the buffer is marked stale (which + * clears DELWRI_Q) after the AIL queues the buffer to its list but + * before the AIL has a chance to submit the list. + */ + while (!list_empty(&bp->b_list)) { + xfs_buf_unlock(bp); + wait_var_event(&bp->b_list, list_empty(&bp->b_list)); + xfs_buf_lock(bp); + } + + ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + + /* This buffer is uptodate; don't let it get reread. */ + bp->b_flags |= XBF_DONE; + + xfs_buf_delwri_queue(bp, buffer_list); +} + /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons @@ -2181,7 +2220,7 @@ xfs_buf_delwri_submit_buffers( * reference and remove it from the list here. */ if (!(bp->b_flags & _XBF_DELWRI_Q)) { - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); continue; } @@ -2201,7 +2240,7 @@ xfs_buf_delwri_submit_buffers( list_move_tail(&bp->b_list, wait_list); } else { bp->b_flags |= XBF_ASYNC; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); } __xfs_buf_submit(bp, false); } @@ -2255,7 +2294,7 @@ xfs_buf_delwri_submit( while (!list_empty(&wait_list)) { bp = list_first_entry(&wait_list, struct xfs_buf, b_list); - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); /* * Wait on the locked buffer, check for errors and unlock and diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index c86e164196568..b470de08a46ca 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -319,6 +319,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); +void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load 2023-11-24 23:48 ` [PATCH 1/4] xfs: force all buffers to be written during btree bulk load Darrick J. Wong @ 2023-11-25 5:49 ` Christoph Hellwig 2023-11-28 1:50 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:49 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs The code makes me feel we're fixing the wrong thing, but I have to admin I don't fully understand it. Let me go step by step. > While stress-testing online repair of btrees, I noticed periodic > assertion failures from the buffer cache about buffer readers > encountering buffers with DELWRI_Q set, even though the btree bulk load > had already committed and the buffer itself wasn't on any delwri list. What assert do these buffer reader hit? The two I can spot that are in the read path in the broader sense are: 1) in xfs_buf_find_lock for stale buffers. 2) in __xfs_buf_submit just before I/O submission > I traced this to a misunderstanding of how the delwri lists work, > particularly with regards to the AIL's buffer list. If a buffer is > logged and committed, the buffer can end up on that AIL buffer list. If > btree repairs are run twice in rapid succession, it's possible that the > first repair will invalidate the buffer and free it before the next time > the AIL wakes up. This clears DELWRI_Q from the buffer state. Where "this clears" is xfs_buf_stale called from xfs_btree_free_block via xfs_trans_binval? > If the second repair allocates the same block, it will then recycle the > buffer to start writing the new btree block. If my above theory is correct: how do we end up reusing a stale buffer? If not, what am I misunderstanding above? > Meanwhile, if the AIL > wakes up and walks the buffer list, it will ignore the buffer because it > can't lock it, and go back to sleep. And I think this is where the trouble starts - we have a buffer that is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared, it is stale and we then reuse it. I don't think we just need to kick it off the delwri list just for btree staging, but in general. > > When the second repair calls delwri_queue to put the buffer on the > list of buffers to write before committing the new btree, it will set > DELWRI_Q again, but since the buffer hasn't been removed from the AIL's > buffer list, it won't add it to the bulkload buffer's list. > > This is incorrect, because the bulkload caller relies on delwri_submit > to ensure that all the buffers have been sent to disk /before/ > committing the new btree root pointer. This ordering requirement is > required for data consistency. > > Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally > drop it, so the next thread to walk through the btree will trip over a > debug assertion on that flag. Where do it finally drop it? > To fix this, create a new function that waits for the buffer to be > removed from any other delwri lists before adding the buffer to the > caller's delwri list. By waiting for the buffer to clear both the > delwri list and any potential delwri wait list, we can be sure that > repair will initiate writes of all buffers and report all write errors > back to userspace instead of committing the new structure. If my understanding above is correct this just papers over the bug that a buffer that is marked stale and can be reused for something else is left on a delwri list. I've entirely thought about all the consequence, but here is what I'd try: - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q call your new wait code instead of asserting (probably only for the !trylock case) - make sure we don't leak DELWRI_Q ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load 2023-11-25 5:49 ` Christoph Hellwig @ 2023-11-28 1:50 ` Darrick J. Wong 2023-11-28 7:13 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 1:50 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Fri, Nov 24, 2023 at 09:49:28PM -0800, Christoph Hellwig wrote: > The code makes me feel we're fixing the wrong thing, but I have to > admin I don't fully understand it. Let me go step by step. > > > While stress-testing online repair of btrees, I noticed periodic > > assertion failures from the buffer cache about buffer readers > > encountering buffers with DELWRI_Q set, even though the btree bulk load > > had already committed and the buffer itself wasn't on any delwri list. > > What assert do these buffer reader hit? The two I can spot that are in > the read path in the broader sense are: > > 1) in xfs_buf_find_lock for stale buffers. > 2) in __xfs_buf_submit just before I/O submission The second assert: AIL: Repair0: Repair1: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit delwri_submit # oops Though there's more to this. > > I traced this to a misunderstanding of how the delwri lists work, > > particularly with regards to the AIL's buffer list. If a buffer is > > logged and committed, the buffer can end up on that AIL buffer list. If > > btree repairs are run twice in rapid succession, it's possible that the > > first repair will invalidate the buffer and free it before the next time > > the AIL wakes up. This clears DELWRI_Q from the buffer state. > > Where "this clears" is xfs_buf_stale called from xfs_btree_free_block > via xfs_trans_binval? Yes. I'll rework the last sentence to read: "Marking the buffer stale clears DELWRI_Q from the buffer state without removing the buffer from its delwri list." > > If the second repair allocates the same block, it will then recycle the > > buffer to start writing the new btree block. > > If my above theory is correct: how do we end up reusing a stale buffer? > If not, what am I misunderstanding above? If we free a metadata buffer, we'll mark it stale, update the bnobt, and add an entry to the extent busy list. If a later repair finds: 1. The same free space in the bnobt; 2. The free space exactly coincides with one extent busy list entry; 3. The entry isn't in the middle of discarding the block; Then the allocator will remove the extent busy list entry and let us have the space. At that point we could have a stale buffer that's also on one of the AIL's lists: AIL: Repair0: Repair1: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit find free space X get buffer rewrite buffer delwri_queue: set DELWRI_Q already on a list, do not add commit BAD: committed tree root before all blocks written delwri_submit # too late now I could demonstrate this by injecting dmerror while delwri(ting) the new btree blocks, and observing that some of the EIOs would not trigger the rollback but instead would trip IO errors in the AIL. > > wakes up and walks the buffer list, it will ignore the buffer because it > > can't lock it, and go back to sleep. > > And I think this is where the trouble starts - we have a buffer that > is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared, > it is stale and we then reuse it. I don't think we just need to kick > it off the delwri list just for btree staging, but in general. ...but how to do that? We don't know whose delwri list the buffer's on, let alone how to lock the list so that we can remove the buffer from that list. (Oh, you have some suggestions below.) > > When the second repair calls delwri_queue to put the buffer on the > > list of buffers to write before committing the new btree, it will set > > DELWRI_Q again, but since the buffer hasn't been removed from the AIL's > > buffer list, it won't add it to the bulkload buffer's list. > > > > This is incorrect, because the bulkload caller relies on delwri_submit > > to ensure that all the buffers have been sent to disk /before/ > > committing the new btree root pointer. This ordering requirement is > > required for data consistency. > > > > Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally > > drop it, so the next thread to walk through the btree will trip over a > > debug assertion on that flag. > > Where do it finally drop it? TBH, it's been so long that I can't remember anymore. :( > > To fix this, create a new function that waits for the buffer to be > > removed from any other delwri lists before adding the buffer to the > > caller's delwri list. By waiting for the buffer to clear both the > > delwri list and any potential delwri wait list, we can be sure that > > repair will initiate writes of all buffers and report all write errors > > back to userspace instead of committing the new structure. > > If my understanding above is correct this just papers over the bug > that a buffer that is marked stale and can be reused for something > else is left on a delwri list. I'm not sure it's a bug for *most* code that encounters it. Regular transactions don't directly use the delwri lists, so it will never see that at all. If the buffer gets rewritten and logged, then it'll just end up on the AIL's delwri buffer list again. The other delwri_submit users don't seem to care if the buffer gets written directly by delwri_submit or by the AIL. In the first case, the caller will get the EIO and force a shutdown; in the second case, the AIL will shut down the log. Weirdly, xfs_bwrite clears DELWRI_Q but doesn't necessary shut down the fs if the write fails. Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does it mean "b_list is active"? Or does it merely mean "another thread will write this buffer via b_list if nobody gets there first"? I think it's the second, since it's easy enough to list_empty(). It's only repair that requires this new behavior that all new buffers have to be written through the delwri list, and only because it actually /can/ cancel the transaction by finishing the autoreap EFIs. > I've entirely thought about all the > consequence, but here is what I'd try: > > - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q > call your new wait code instead of asserting (probably only > for the !trylock case) > - make sure we don't leak DELWRI_Q Way back when I first discovered this, my first impulse was to make xfs_buf_stale wait for DELWRI_Q to clear. That IIRC didn't work because a transaction could hold an inode that the AIL will need to lock. I think xfs_buf_find_lock would have the same problem. Seeing as repair is the only user with the requirement that it alone can issue writes for the delwri list buffers, that's why I went with what's in this patch. Thank you for persevering so far! :D --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load 2023-11-28 1:50 ` Darrick J. Wong @ 2023-11-28 7:13 ` Christoph Hellwig 2023-11-28 15:18 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 7:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs > > If my understanding above is correct this just papers over the bug > > that a buffer that is marked stale and can be reused for something > > else is left on a delwri list. > > I'm not sure it's a bug for *most* code that encounters it. Regular > transactions don't directly use the delwri lists, so it will never see > that at all. If the buffer gets rewritten and logged, then it'll just > end up on the AIL's delwri buffer list again. Where "regular transactions" means data I/O, yes. All normal metadata I/O eventually ends up on a delwri list. And we want it on that delwri list after actually updating the contents. > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does > it mean "b_list is active"? Or does it merely mean "another thread will > write this buffer via b_list if nobody gets there first"? I think it's > the second, since it's easy enough to list_empty(). I think it is misnamed and not clearly defined, and we should probably fix that. Note that least the current list_empty() usage isn't quite correct either without a lock held by the delwri list owner. We at least need a list_empty_careful for a racey but still save check. Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely. Except for various asserts we mostly use it in reverse polarity, that is to cancel delwri writeout for stale buffers. How about just skipping XBF_STALE buffers on the delwri list directly and not bother with DELWRI_Q? With that we can then support multiple delwri lists that don't get into each others way using say an allocating xarray instead of the invase list and avoid this whole mess. Let me try to prototype this.. > It's only repair that requires this new behavior that all new buffers > have to be written through the delwri list, and only because it actually > /can/ cancel the transaction by finishing the autoreap EFIs. Right now yes. But I think the delwri behavior is a land mine, and this just happens to be the first user to trigger it. Edit: while looking through the DELWRI_Q usage I noticed xfs_qm_flush_one, which seems to deal with what is at least a somewhat related issue based on the comments there. > Way back when I first discovered this, my first impulse was to make > xfs_buf_stale wait for DELWRI_Q to clear. That IIRC didn't work because > a transaction could hold an inode that the AIL will need to lock. I > think xfs_buf_find_lock would have the same problem. Yes, that makes sense. Would be great to document such details in the commit message.. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load 2023-11-28 7:13 ` Christoph Hellwig @ 2023-11-28 15:18 ` Christoph Hellwig 2023-11-28 17:07 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 15:18 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Mon, Nov 27, 2023 at 11:13:40PM -0800, Christoph Hellwig wrote: > > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does > > it mean "b_list is active"? Or does it merely mean "another thread will > > write this buffer via b_list if nobody gets there first"? I think it's > > the second, since it's easy enough to list_empty(). > > I think it is misnamed and not clearly defined, and we should probably > fix that. Note that least the current list_empty() usage isn't quite > correct either without a lock held by the delwri list owner. We at > least need a list_empty_careful for a racey but still save check. > > Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely. > Except for various asserts we mostly use it in reverse polarity, that is > to cancel delwri writeout for stale buffers. How about just skipping > XBF_STALE buffers on the delwri list directly and not bother with > DELWRI_Q? With that we can then support multiple delwri lists that > don't get into each others way using say an allocating xarray instead > of the invase list and avoid this whole mess. > > Let me try to prototype this.. Ok, I spent half the day prototyping this and it passes basic sanity checks: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/xfs-multiple-delwri-lists My conclusions from that is: 1) it works 2) I think it is the right to do 3) it needs a lot more work 4) we can't block the online repair work on it so I guess we'll need to go with the approach in this patch for now, maybe with a better commit log, and I'll look into finishing this work some time in the future. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load 2023-11-28 15:18 ` Christoph Hellwig @ 2023-11-28 17:07 ` Darrick J. Wong 2023-11-30 4:33 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 17:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 07:18:22AM -0800, Christoph Hellwig wrote: > On Mon, Nov 27, 2023 at 11:13:40PM -0800, Christoph Hellwig wrote: > > > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does > > > it mean "b_list is active"? Or does it merely mean "another thread will > > > write this buffer via b_list if nobody gets there first"? I think it's > > > the second, since it's easy enough to list_empty(). > > > > I think it is misnamed and not clearly defined, and we should probably > > fix that. Note that least the current list_empty() usage isn't quite > > correct either without a lock held by the delwri list owner. We at > > least need a list_empty_careful for a racey but still save check. > > > > Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely. > > Except for various asserts we mostly use it in reverse polarity, that is > > to cancel delwri writeout for stale buffers. How about just skipping > > XBF_STALE buffers on the delwri list directly and not bother with > > DELWRI_Q? With that we can then support multiple delwri lists that > > don't get into each others way using say an allocating xarray instead > > of the invase list and avoid this whole mess. > > > > Let me try to prototype this.. > > Ok, I spent half the day prototyping this and it passes basic sanity > checks: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/xfs-multiple-delwri-lists > > My conclusions from that is: > > 1) it works > 2) I think it is the right to do > 3) it needs a lot more work > 4) we can't block the online repair work on it > > so I guess we'll need to go with the approach in this patch for now, > maybe with a better commit log, and I'll look into finishing this work > some time in the future. <nod> I think an xarray version of this would be less clunky than xfs_delwri_buf with three pointers. Also note Dave's comments on the v25 posting of this series: https://lore.kernel.org/linux-xfs/ZJJa7Cnni0mb%2F9sN@dread.disaster.area/ --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/4] xfs: force all buffers to be written during btree bulk load 2023-11-28 17:07 ` Darrick J. Wong @ 2023-11-30 4:33 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 4:33 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Nov 28, 2023 at 09:07:34AM -0800, Darrick J. Wong wrote: > > so I guess we'll need to go with the approach in this patch for now, > > maybe with a better commit log, and I'll look into finishing this work > > some time in the future. > > <nod> I think an xarray version of this would be less clunky than > xfs_delwri_buf with three pointers. It would, but xarray indices are unsigned longs, so we couldn't simply use the block number for it. So unless we decided to give up on building XFS on 32-bit kernels (which would nice for other reasons) it would need something more complicated. ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors 2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong 2023-11-24 23:48 ` [PATCH 1/4] xfs: force all buffers to be written during btree bulk load Darrick J. Wong @ 2023-11-24 23:49 ` Darrick J. Wong 2023-11-25 5:50 ` Christoph Hellwig 2023-11-24 23:49 ` [PATCH 3/4] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong 2023-11-24 23:49 ` [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong 3 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:49 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add some debug knobs so that we can control the leaf and node block slack when rebuilding btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/newbt.c | 10 ++++++--- fs/xfs/xfs_globals.c | 12 +++++++++++ fs/xfs/xfs_sysctl.h | 2 ++ fs/xfs/xfs_sysfs.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 75 insertions(+), 3 deletions(-) diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c index 2932fd317ab23..2c388c647d37f 100644 --- a/fs/xfs/scrub/newbt.c +++ b/fs/xfs/scrub/newbt.c @@ -47,9 +47,13 @@ xrep_newbt_estimate_slack( uint64_t free; uint64_t sz; - /* Let the btree code compute the default slack values. */ - bload->leaf_slack = -1; - bload->node_slack = -1; + /* + * The xfs_globals values are set to -1 (i.e. take the bload defaults) + * unless someone has set them otherwise, so we just pull the values + * here. + */ + bload->leaf_slack = xfs_globals.bload_leaf_slack; + bload->node_slack = xfs_globals.bload_node_slack; if (sc->ops->type == ST_PERAG) { free = sc->sa.pag->pagf_freeblks; diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c index 9edc1f2bc9399..f18fec0adf666 100644 --- a/fs/xfs/xfs_globals.c +++ b/fs/xfs/xfs_globals.c @@ -44,4 +44,16 @@ struct xfs_globals xfs_globals = { .pwork_threads = -1, /* automatic thread detection */ .larp = false, /* log attribute replay */ #endif + + /* + * Leave this many record slots empty when bulk loading btrees. By + * default we load new btree leaf blocks 75% full. + */ + .bload_leaf_slack = -1, + + /* + * Leave this many key/ptr slots empty when bulk loading btrees. By + * default we load new btree node blocks 75% full. + */ + .bload_node_slack = -1, }; diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h index f78ad6b10ea58..276696a07040c 100644 --- a/fs/xfs/xfs_sysctl.h +++ b/fs/xfs/xfs_sysctl.h @@ -85,6 +85,8 @@ struct xfs_globals { int pwork_threads; /* parallel workqueue threads */ bool larp; /* log attribute replay */ #endif + int bload_leaf_slack; /* btree bulk load leaf slack */ + int bload_node_slack; /* btree bulk load node slack */ int log_recovery_delay; /* log recovery delay (secs) */ int mount_delay; /* mount setup delay (secs) */ bool bug_on_assert; /* BUG() the kernel on assert failure */ diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index a3c6b15487237..4eaa0507ec287 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -253,6 +253,58 @@ larp_show( XFS_SYSFS_ATTR_RW(larp); #endif /* DEBUG */ +STATIC ssize_t +bload_leaf_slack_store( + struct kobject *kobject, + const char *buf, + size_t count) +{ + int ret; + int val; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + xfs_globals.bload_leaf_slack = val; + return count; +} + +STATIC ssize_t +bload_leaf_slack_show( + struct kobject *kobject, + char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.bload_leaf_slack); +} +XFS_SYSFS_ATTR_RW(bload_leaf_slack); + +STATIC ssize_t +bload_node_slack_store( + struct kobject *kobject, + const char *buf, + size_t count) +{ + int ret; + int val; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + xfs_globals.bload_node_slack = val; + return count; +} + +STATIC ssize_t +bload_node_slack_show( + struct kobject *kobject, + char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.bload_node_slack); +} +XFS_SYSFS_ATTR_RW(bload_node_slack); + static struct attribute *xfs_dbg_attrs[] = { ATTR_LIST(bug_on_assert), ATTR_LIST(log_recovery_delay), @@ -262,6 +314,8 @@ static struct attribute *xfs_dbg_attrs[] = { ATTR_LIST(pwork_threads), ATTR_LIST(larp), #endif + ATTR_LIST(bload_leaf_slack), + ATTR_LIST(bload_node_slack), NULL, }; ATTRIBUTE_GROUPS(xfs_dbg); ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors 2023-11-24 23:49 ` [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong @ 2023-11-25 5:50 ` Christoph Hellwig 2023-11-28 1:44 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:50 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:49:15PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Add some debug knobs so that we can control the leaf and node block > slack when rebuilding btrees. Maybe expand a bit on what these "debug knows" are useful for and how they work? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors 2023-11-25 5:50 ` Christoph Hellwig @ 2023-11-28 1:44 ` Darrick J. Wong 2023-11-28 5:42 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 1:44 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Fri, Nov 24, 2023 at 09:50:23PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:49:15PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Add some debug knobs so that we can control the leaf and node block > > slack when rebuilding btrees. > > Maybe expand a bit on what these "debug knows" are useful for and how > they work? They're not generally useful for users, obviously. For developers, it might be useful to construct btrees of various heights by crafting a filesystem with a certain number of records and then using repair+knobs to rebuild the index with a certain shape. Practically speaking, you'd only ever do that for extreme stress testing. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors 2023-11-28 1:44 ` Darrick J. Wong @ 2023-11-28 5:42 ` Christoph Hellwig 2023-11-28 17:07 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 5:42 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Mon, Nov 27, 2023 at 05:44:37PM -0800, Darrick J. Wong wrote: > They're not generally useful for users, obviously. For developers, it > might be useful to construct btrees of various heights by crafting a > filesystem with a certain number of records and then using repair+knobs > to rebuild the index with a certain shape. > > Practically speaking, you'd only ever do that for extreme stress > testing. Please add this to the commit message. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors 2023-11-28 5:42 ` Christoph Hellwig @ 2023-11-28 17:07 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 17:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Mon, Nov 27, 2023 at 09:42:37PM -0800, Christoph Hellwig wrote: > On Mon, Nov 27, 2023 at 05:44:37PM -0800, Darrick J. Wong wrote: > > They're not generally useful for users, obviously. For developers, it > > might be useful to construct btrees of various heights by crafting a > > filesystem with a certain number of records and then using repair+knobs > > to rebuild the index with a certain shape. > > > > Practically speaking, you'd only ever do that for extreme stress > > testing. > > Please add this to the commit message. Done. --D > > ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 3/4] xfs: move btree bulkload record initialization to ->get_record implementations 2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong 2023-11-24 23:48 ` [PATCH 1/4] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 2023-11-24 23:49 ` [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong @ 2023-11-24 23:49 ` Darrick J. Wong 2023-11-25 5:51 ` Christoph Hellwig 2023-11-24 23:49 ` [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong 3 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:49 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> When we're performing a bulk load of a btree, move the code that actually stores the btree record in the new btree block out of the generic code and into the individual ->get_record implementations. This is preparation for being able to store multiple records with a single indirect call. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree_staging.c | 17 +++++++---------- fs/xfs/libxfs/xfs_btree_staging.h | 15 ++++++++++----- 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index 29e3f8ccb1852..369965cacc8c5 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -434,22 +434,19 @@ STATIC int xfs_btree_bload_leaf( struct xfs_btree_cur *cur, unsigned int recs_this_block, - xfs_btree_bload_get_record_fn get_record, + xfs_btree_bload_get_records_fn get_records, struct xfs_btree_block *block, void *priv) { - unsigned int j; + unsigned int j = 1; int ret; /* Fill the leaf block with records. */ - for (j = 1; j <= recs_this_block; j++) { - union xfs_btree_rec *block_rec; - - ret = get_record(cur, priv); - if (ret) + while (j <= recs_this_block) { + ret = get_records(cur, j, block, recs_this_block - j + 1, priv); + if (ret < 0) return ret; - block_rec = xfs_btree_rec_addr(cur, j, block); - cur->bc_ops->init_rec_from_cur(cur, block_rec); + j += ret; } return 0; @@ -787,7 +784,7 @@ xfs_btree_bload( trace_xfs_btree_bload_block(cur, level, i, blocks, &ptr, nr_this_block); - ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_record, + ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_records, block, priv); if (ret) goto out; diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h index d6dea3f0088c6..82a3a8ef0f125 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.h +++ b/fs/xfs/libxfs/xfs_btree_staging.h @@ -50,7 +50,9 @@ void xfs_btree_commit_ifakeroot(struct xfs_btree_cur *cur, struct xfs_trans *tp, int whichfork, const struct xfs_btree_ops *ops); /* Bulk loading of staged btrees. */ -typedef int (*xfs_btree_bload_get_record_fn)(struct xfs_btree_cur *cur, void *priv); +typedef int (*xfs_btree_bload_get_records_fn)(struct xfs_btree_cur *cur, + unsigned int idx, struct xfs_btree_block *block, + unsigned int nr_wanted, void *priv); typedef int (*xfs_btree_bload_claim_block_fn)(struct xfs_btree_cur *cur, union xfs_btree_ptr *ptr, void *priv); typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur, @@ -58,11 +60,14 @@ typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur, struct xfs_btree_bload { /* - * This function will be called nr_records times to load records into - * the btree. The function does this by setting the cursor's bc_rec - * field in in-core format. Records must be returned in sort order. + * This function will be called to load @nr_wanted records into the + * btree. The implementation does this by setting the cursor's bc_rec + * field in in-core format and using init_rec_from_cur to set the + * records in the btree block. Records must be returned in sort order. + * The function must return the number of records loaded or the usual + * negative errno. */ - xfs_btree_bload_get_record_fn get_record; + xfs_btree_bload_get_records_fn get_records; /* * This function will be called nr_blocks times to obtain a pointer ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 3/4] xfs: move btree bulkload record initialization to ->get_record implementations 2023-11-24 23:49 ` [PATCH 3/4] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong @ 2023-11-25 5:51 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:51 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree 2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:49 ` [PATCH 3/4] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong @ 2023-11-24 23:49 ` Darrick J. Wong 2023-11-25 5:53 ` Christoph Hellwig 3 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:49 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Constrain the number of dirty buffers that are locked by the btree staging code at any given time by establishing a threshold at which we put them all on the delwri queue and push them to disk. This limits memory consumption while writing out new btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree.c | 2 + fs/xfs/libxfs/xfs_btree.h | 3 ++ fs/xfs/libxfs/xfs_btree_staging.c | 50 +++++++++++++++++++++++++++++-------- fs/xfs/libxfs/xfs_btree_staging.h | 10 +++++++ fs/xfs/scrub/newbt.c | 1 + 5 files changed, 54 insertions(+), 12 deletions(-) diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 6a6503ab0cd76..c100e92140be1 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -1330,7 +1330,7 @@ xfs_btree_get_buf_block( * Read in the buffer at the given ptr and return the buffer and * the block pointer within the buffer. */ -STATIC int +int xfs_btree_read_buf_block( struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr, diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index 4d68a58be160c..e0875cec49392 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -700,6 +700,9 @@ void xfs_btree_set_ptr_null(struct xfs_btree_cur *cur, int xfs_btree_get_buf_block(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr, struct xfs_btree_block **block, struct xfs_buf **bpp); +int xfs_btree_read_buf_block(struct xfs_btree_cur *cur, + const union xfs_btree_ptr *ptr, int flags, + struct xfs_btree_block **block, struct xfs_buf **bpp); void xfs_btree_set_sibling(struct xfs_btree_cur *cur, struct xfs_btree_block *block, const union xfs_btree_ptr *ptr, int lr); diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index 369965cacc8c5..6fd6ea8e6fbd7 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -333,18 +333,35 @@ xfs_btree_commit_ifakeroot( /* * Put a btree block that we're loading onto the ordered list and release it. * The btree blocks will be written to disk when bulk loading is finished. + * If we reach the dirty buffer threshold, flush them to disk before + * continuing. */ -static void +static int xfs_btree_bload_drop_buf( - struct list_head *buffers_list, - struct xfs_buf **bpp) + struct xfs_btree_bload *bbl, + struct list_head *buffers_list, + struct xfs_buf **bpp) { - if (*bpp == NULL) - return; + struct xfs_buf *bp = *bpp; + int error; - xfs_buf_delwri_queue_here(*bpp, buffers_list); - xfs_buf_relse(*bpp); + if (!bp) + return 0; + + xfs_buf_delwri_queue_here(bp, buffers_list); + xfs_buf_relse(bp); *bpp = NULL; + bbl->nr_dirty++; + + if (!bbl->max_dirty || bbl->nr_dirty < bbl->max_dirty) + return 0; + + error = xfs_buf_delwri_submit(buffers_list); + if (error) + return error; + + bbl->nr_dirty = 0; + return 0; } /* @@ -416,7 +433,10 @@ xfs_btree_bload_prep_block( */ if (*blockp) xfs_btree_set_sibling(cur, *blockp, &new_ptr, XFS_BB_RIGHTSIB); - xfs_btree_bload_drop_buf(buffers_list, bpp); + + ret = xfs_btree_bload_drop_buf(bbl, buffers_list, bpp); + if (ret) + return ret; /* Initialize the new btree block. */ xfs_btree_init_block_cur(cur, new_bp, level, nr_this_block); @@ -480,7 +500,7 @@ xfs_btree_bload_node( ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr)); - ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block, + ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, &child_bp); if (ret) return ret; @@ -759,6 +779,7 @@ xfs_btree_bload( cur->bc_nlevels = bbl->btree_height; xfs_btree_set_ptr_null(cur, &child_ptr); xfs_btree_set_ptr_null(cur, &ptr); + bbl->nr_dirty = 0; xfs_btree_bload_level_geometry(cur, bbl, level, nr_this_level, &avg_per_block, &blocks, &blocks_with_extra); @@ -797,7 +818,10 @@ xfs_btree_bload( xfs_btree_copy_ptrs(cur, &child_ptr, &ptr, 1); } total_blocks += blocks; - xfs_btree_bload_drop_buf(&buffers_list, &bp); + + ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp); + if (ret) + goto out; /* Populate the internal btree nodes. */ for (level = 1; level < cur->bc_nlevels; level++) { @@ -839,7 +863,11 @@ xfs_btree_bload( xfs_btree_copy_ptrs(cur, &first_ptr, &ptr, 1); } total_blocks += blocks; - xfs_btree_bload_drop_buf(&buffers_list, &bp); + + ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp); + if (ret) + goto out; + xfs_btree_copy_ptrs(cur, &child_ptr, &first_ptr, 1); } diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h index 82a3a8ef0f125..d2eaf4fdc6032 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.h +++ b/fs/xfs/libxfs/xfs_btree_staging.h @@ -115,6 +115,16 @@ struct xfs_btree_bload { * height of the new btree. */ unsigned int btree_height; + + /* + * Flush the new btree block buffer list to disk after this many blocks + * have been formatted. Zero prohibits writing any buffers until all + * blocks have been formatted. + */ + uint16_t max_dirty; + + /* Number of dirty buffers. */ + uint16_t nr_dirty; }; int xfs_btree_bload_compute_geometry(struct xfs_btree_cur *cur, diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c index 2c388c647d37f..73e21a9e5e929 100644 --- a/fs/xfs/scrub/newbt.c +++ b/fs/xfs/scrub/newbt.c @@ -89,6 +89,7 @@ xrep_newbt_init_ag( xnr->alloc_hint = alloc_hint; xnr->resv = resv; INIT_LIST_HEAD(&xnr->resv_list); + xnr->bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */ xrep_newbt_estimate_slack(xnr); } ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree 2023-11-24 23:49 ` [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong @ 2023-11-25 5:53 ` Christoph Hellwig 2023-11-27 22:56 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:53 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs > @@ -480,7 +500,7 @@ xfs_btree_bload_node( > > ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr)); > > - ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block, > + ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, > &child_bp); How is this (and making xfs_btree_read_buf_block outside of xfs_buf.c) related to the dirty limit? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree 2023-11-25 5:53 ` Christoph Hellwig @ 2023-11-27 22:56 ` Darrick J. Wong 2023-11-28 20:11 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-27 22:56 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Fri, Nov 24, 2023 at 09:53:38PM -0800, Christoph Hellwig wrote: > > @@ -480,7 +500,7 @@ xfs_btree_bload_node( > > > > ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr)); > > > > - ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block, > > + ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, > > &child_bp); > > How is this (and making xfs_btree_read_buf_block outside of xfs_buf.c) > related to the dirty limit? Oh! Looking through my notes, I wanted the /new/ btree block buffers have the same lru reference count the old ones did. I probably should have exported xfs_btree_set_refs instead of reading whatever is on disk into a buffer, only to blow away the contents anyway. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree 2023-11-27 22:56 ` Darrick J. Wong @ 2023-11-28 20:11 ` Darrick J. Wong 2023-11-29 5:50 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 20:11 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Mon, Nov 27, 2023 at 02:56:31PM -0800, Darrick J. Wong wrote: > On Fri, Nov 24, 2023 at 09:53:38PM -0800, Christoph Hellwig wrote: > > > @@ -480,7 +500,7 @@ xfs_btree_bload_node( > > > > > > ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr)); > > > > > > - ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block, > > > + ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, > > > &child_bp); > > > > How is this (and making xfs_btree_read_buf_block outside of xfs_buf.c) > > related to the dirty limit? > > Oh! Looking through my notes, I wanted the /new/ btree block buffers > have the same lru reference count the old ones did. I probably should > have exported xfs_btree_set_refs instead of reading whatever is on disk > into a buffer, only to blow away the contents anyway. And now that I've dug further through my notes, I've realized that there's a better reason for this unexplained _get_buf -> _read_buf transition and the setting of XBF_DONE in _delwri_queue_here. This patch introduces the behavior that we flush the delwri list to disk every 256k. Flushing the buffers releases them, which means that reclaim could free the buffer before xfs_btree_bload_node needs it again to build the next level up. If that's the case, then _get_buf will get us a !DONE buffer with zeroes instead of reading the (freshly written) buffer back in from disk. We'll then end up formatting garbage keys into the node block, which is bad. /* * Read the lower-level block in case the buffer for it has * been reclaimed. LRU refs will be set on the block, which is * desirable if the new btree commits. */ ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, &child_bp); The behavior of setting XBF_DONE in xfs_buf_delwri_queue_here is an optimization if _delwri_submit releases the buffer and it is /not/ reclaimed. In that case, xfs_btree_read_buf_block will find the buffer without the DONE flag set and reread the contents from disk, which is unnecessary. I'll split those changes into separate patches with fuller explanations of what's going on. --D > --D > ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree 2023-11-28 20:11 ` Darrick J. Wong @ 2023-11-29 5:50 ` Christoph Hellwig 2023-11-29 5:57 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 5:50 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Nov 28, 2023 at 12:11:33PM -0800, Darrick J. Wong wrote: > And now that I've dug further through my notes, I've realized that > there's a better reason for this unexplained _get_buf -> _read_buf > transition and the setting of XBF_DONE in _delwri_queue_here. > > This patch introduces the behavior that we flush the delwri list to disk > every 256k. Where "the delwri list" is the one used for writing stage btrees I think. > Flushing the buffers releases them, which means that > reclaim could free the buffer before xfs_btree_bload_node needs it again > to build the next level up. Oh, indeed. > If that's the case, then _get_buf will get > us a !DONE buffer with zeroes instead of reading the (freshly written) > buffer back in from disk. We'll then end up formatting garbage keys > into the node block, which is bad. Yeah. > /* > * Read the lower-level block in case the buffer for it has > * been reclaimed. LRU refs will be set on the block, which is > * desirable if the new btree commits. > */ > ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, > &child_bp); > > The behavior of setting XBF_DONE in xfs_buf_delwri_queue_here is an > optimization if _delwri_submit releases the buffer and it is /not/ > reclaimed. In that case, xfs_btree_read_buf_block will find the buffer > without the DONE flag set and reread the contents from disk, which is > unnecessary. Yeah. I still find it weird to set it in the delwri_submit_here helper, but maybe that's a discussion for the other thread. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree 2023-11-29 5:50 ` Christoph Hellwig @ 2023-11-29 5:57 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-29 5:57 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 09:50:35PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 12:11:33PM -0800, Darrick J. Wong wrote: > > And now that I've dug further through my notes, I've realized that > > there's a better reason for this unexplained _get_buf -> _read_buf > > transition and the setting of XBF_DONE in _delwri_queue_here. > > > > This patch introduces the behavior that we flush the delwri list to disk > > every 256k. > > Where "the delwri list" is the one used for writing stage btrees I > think. Correct. > > Flushing the buffers releases them, which means that > > reclaim could free the buffer before xfs_btree_bload_node needs it again > > to build the next level up. > > Oh, indeed. > > > If that's the case, then _get_buf will get > > us a !DONE buffer with zeroes instead of reading the (freshly written) > > buffer back in from disk. We'll then end up formatting garbage keys > > into the node block, which is bad. > > Yeah. > > > /* > > * Read the lower-level block in case the buffer for it has > > * been reclaimed. LRU refs will be set on the block, which is > > * desirable if the new btree commits. > > */ > > ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block, > > &child_bp); > > > > The behavior of setting XBF_DONE in xfs_buf_delwri_queue_here is an > > optimization if _delwri_submit releases the buffer and it is /not/ > > reclaimed. In that case, xfs_btree_read_buf_block will find the buffer > > without the DONE flag set and reread the contents from disk, which is > > unnecessary. > > Yeah. I still find it weird to set it in the delwri_submit_here helper, > but maybe that's a discussion for the other thread. D'Oh! XBF_DONE exists in userspace too, because libxfs uses it. Well then, the proper place for it is at the top of xfs_btree_bload_drop_buf. I'll go change that tomorrow. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/5] xfs: online repair of AG btrees 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong @ 2023-11-24 23:45 ` Darrick J. Wong 2023-11-24 23:50 ` [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps Darrick J. Wong ` (4 more replies) 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong ` (3 subsequent siblings) 7 siblings, 5 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:45 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs Hi all, Now that we've spent a lot of time reworking common code in online fsck, we're ready to start rebuilding the AG space btrees. This series implements repair functions for the free space, inode, and refcount btrees. Rebuilding the reverse mapping btree is much more intense and is left for a subsequent patchset. The fstests counterpart of this patchset implements stress testing of repair. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-ag-btrees --- fs/xfs/Makefile | 3 fs/xfs/libxfs/xfs_ag.h | 10 fs/xfs/libxfs/xfs_ag_resv.c | 2 fs/xfs/libxfs/xfs_alloc.c | 18 - fs/xfs/libxfs/xfs_alloc.h | 2 fs/xfs/libxfs/xfs_alloc_btree.c | 13 - fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_ialloc.c | 41 +- fs/xfs/libxfs/xfs_ialloc.h | 3 fs/xfs/libxfs/xfs_refcount.c | 18 - fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 - fs/xfs/libxfs/xfs_types.h | 7 fs/xfs/scrub/agheader_repair.c | 18 - fs/xfs/scrub/alloc.c | 16 + fs/xfs/scrub/alloc_repair.c | 914 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/bitmap.c | 404 +++++++++++++--- fs/xfs/scrub/bitmap.h | 70 ++- fs/xfs/scrub/common.c | 1 fs/xfs/scrub/common.h | 19 + fs/xfs/scrub/ialloc_repair.c | 874 ++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.c | 48 ++ fs/xfs/scrub/newbt.h | 6 fs/xfs/scrub/reap.c | 5 fs/xfs/scrub/refcount_repair.c | 793 +++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.c | 128 +++++ fs/xfs/scrub/repair.h | 43 ++ fs/xfs/scrub/scrub.c | 22 + fs/xfs/scrub/scrub.h | 9 fs/xfs/scrub/trace.h | 112 +++- fs/xfs/scrub/xfarray.h | 22 + fs/xfs/xfs_extent_busy.c | 13 + fs/xfs/xfs_extent_busy.h | 2 34 files changed, 3493 insertions(+), 186 deletions(-) create mode 100644 fs/xfs/scrub/alloc_repair.c create mode 100644 fs/xfs/scrub/ialloc_repair.c create mode 100644 fs/xfs/scrub/refcount_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong @ 2023-11-24 23:50 ` Darrick J. Wong 2023-11-25 5:57 ` Christoph Hellwig 2023-11-24 23:50 ` [PATCH 2/5] xfs: roll the scrub transaction after completing a repair Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:50 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a version of the xbitmap that handles 32-bit integer intervals and adapt the xfs_agblock_t bitmap to use it. This reduces the size of the interval tree nodes from 48 to 36 bytes and enables us to use a more efficient slab (:0000040 instead of :0000048) which allows us to pack more nodes into a single slab page (102 vs 85). As a side effect, the users of these bitmaps no longer have to convert between u32 and u64 quantities just to use the bitmap; and the hairy overflow checking code in xagb_bitmap_test goes away. Later in this patchset we're going to add bitmaps for xfs_agino_t, xfs_rgblock_t, and xfs_dablk_t, so the increase in code size (5622 vs. 9959 bytes) seems worth it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/agheader_repair.c | 9 - fs/xfs/scrub/bitmap.c | 404 ++++++++++++++++++++++++++++++++++------ fs/xfs/scrub/bitmap.h | 70 ++++--- fs/xfs/scrub/reap.c | 5 4 files changed, 390 insertions(+), 98 deletions(-) diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c index 36c511f96b004..9d541d2087da8 100644 --- a/fs/xfs/scrub/agheader_repair.c +++ b/fs/xfs/scrub/agheader_repair.c @@ -495,12 +495,11 @@ xrep_agfl_walk_rmap( /* Strike out the blocks that are cross-linked according to the rmapbt. */ STATIC int xrep_agfl_check_extent( - uint64_t start, - uint64_t len, + uint32_t agbno, + uint32_t len, void *priv) { struct xrep_agfl *ra = priv; - xfs_agblock_t agbno = start; xfs_agblock_t last_agbno = agbno + len - 1; int error; @@ -648,8 +647,8 @@ struct xrep_agfl_fill { /* Fill the AGFL with whatever blocks are in this extent. */ static int xrep_agfl_fill( - uint64_t start, - uint64_t len, + uint32_t start, + uint32_t len, void *priv) { struct xrep_agfl_fill *af = priv; diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c index e0c89a9a0ca07..1b5add26c1d8d 100644 --- a/fs/xfs/scrub/bitmap.c +++ b/fs/xfs/scrub/bitmap.c @@ -16,6 +16,8 @@ #include <linux/interval_tree_generic.h> +/* u64 bitmap */ + struct xbitmap_node { struct rb_node bn_rbnode; @@ -228,6 +230,345 @@ xbitmap_disunion( return 0; } +/* How many bits are set in this bitmap? */ +uint64_t +xbitmap_hweight( + struct xbitmap *bitmap) +{ + struct xbitmap_node *bn; + uint64_t ret = 0; + + for_each_xbitmap_extent(bn, bitmap) + ret += bn->bn_last - bn->bn_start + 1; + + return ret; +} + +/* Call a function for every run of set bits in this bitmap. */ +int +xbitmap_walk( + struct xbitmap *bitmap, + xbitmap_walk_fn fn, + void *priv) +{ + struct xbitmap_node *bn; + int error = 0; + + for_each_xbitmap_extent(bn, bitmap) { + error = fn(bn->bn_start, bn->bn_last - bn->bn_start + 1, priv); + if (error) + break; + } + + return error; +} + +/* Does this bitmap have no bits set at all? */ +bool +xbitmap_empty( + struct xbitmap *bitmap) +{ + return bitmap->xb_root.rb_root.rb_node == NULL; +} + +/* Is the start of the range set or clear? And for how long? */ +bool +xbitmap_test( + struct xbitmap *bitmap, + uint64_t start, + uint64_t *len) +{ + struct xbitmap_node *bn; + uint64_t last = start + *len - 1; + + bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last); + if (!bn) + return false; + if (bn->bn_start <= start) { + if (bn->bn_last < last) + *len = bn->bn_last - start + 1; + return true; + } + *len = bn->bn_start - start; + return false; +} + +/* u32 bitmap */ + +struct xbitmap32_node { + struct rb_node bn_rbnode; + + /* First set bit of this interval and subtree. */ + uint32_t bn_start; + + /* Last set bit of this interval. */ + uint32_t bn_last; + + /* Last set bit of this subtree. Do not touch this. */ + uint32_t __bn_subtree_last; +}; + +/* Define our own interval tree type with uint32_t parameters. */ + +/* + * These functions are defined by the INTERVAL_TREE_DEFINE macro, but we'll + * forward-declare them anyway for clarity. + */ +static inline void +xbitmap32_tree_insert(struct xbitmap32_node *node, struct rb_root_cached *root); + +static inline void +xbitmap32_tree_remove(struct xbitmap32_node *node, struct rb_root_cached *root); + +static inline struct xbitmap32_node * +xbitmap32_tree_iter_first(struct rb_root_cached *root, uint32_t start, + uint32_t last); + +static inline struct xbitmap32_node * +xbitmap32_tree_iter_next(struct xbitmap32_node *node, uint32_t start, + uint32_t last); + +INTERVAL_TREE_DEFINE(struct xbitmap32_node, bn_rbnode, uint32_t, + __bn_subtree_last, START, LAST, static inline, xbitmap32_tree) + +/* Iterate each interval of a bitmap. Do not change the bitmap. */ +#define for_each_xbitmap32_extent(bn, bitmap) \ + for ((bn) = rb_entry_safe(rb_first(&(bitmap)->xb_root.rb_root), \ + struct xbitmap32_node, bn_rbnode); \ + (bn) != NULL; \ + (bn) = rb_entry_safe(rb_next(&(bn)->bn_rbnode), \ + struct xbitmap32_node, bn_rbnode)) + +/* Clear a range of this bitmap. */ +int +xbitmap32_clear( + struct xbitmap32 *bitmap, + uint32_t start, + uint32_t len) +{ + struct xbitmap32_node *bn; + struct xbitmap32_node *new_bn; + uint32_t last = start + len - 1; + + while ((bn = xbitmap32_tree_iter_first(&bitmap->xb_root, start, last))) { + if (bn->bn_start < start && bn->bn_last > last) { + uint32_t old_last = bn->bn_last; + + /* overlaps with the entire clearing range */ + xbitmap32_tree_remove(bn, &bitmap->xb_root); + bn->bn_last = start - 1; + xbitmap32_tree_insert(bn, &bitmap->xb_root); + + /* add an extent */ + new_bn = kmalloc(sizeof(struct xbitmap32_node), + XCHK_GFP_FLAGS); + if (!new_bn) + return -ENOMEM; + new_bn->bn_start = last + 1; + new_bn->bn_last = old_last; + xbitmap32_tree_insert(new_bn, &bitmap->xb_root); + } else if (bn->bn_start < start) { + /* overlaps with the left side of the clearing range */ + xbitmap32_tree_remove(bn, &bitmap->xb_root); + bn->bn_last = start - 1; + xbitmap32_tree_insert(bn, &bitmap->xb_root); + } else if (bn->bn_last > last) { + /* overlaps with the right side of the clearing range */ + xbitmap32_tree_remove(bn, &bitmap->xb_root); + bn->bn_start = last + 1; + xbitmap32_tree_insert(bn, &bitmap->xb_root); + break; + } else { + /* in the middle of the clearing range */ + xbitmap32_tree_remove(bn, &bitmap->xb_root); + kfree(bn); + } + } + + return 0; +} + +/* Set a range of this bitmap. */ +int +xbitmap32_set( + struct xbitmap32 *bitmap, + uint32_t start, + uint32_t len) +{ + struct xbitmap32_node *left; + struct xbitmap32_node *right; + uint32_t last = start + len - 1; + int error; + + /* Is this whole range already set? */ + left = xbitmap32_tree_iter_first(&bitmap->xb_root, start, last); + if (left && left->bn_start <= start && left->bn_last >= last) + return 0; + + /* Clear out everything in the range we want to set. */ + error = xbitmap32_clear(bitmap, start, len); + if (error) + return error; + + /* Do we have a left-adjacent extent? */ + left = xbitmap32_tree_iter_first(&bitmap->xb_root, start - 1, start - 1); + ASSERT(!left || left->bn_last + 1 == start); + + /* Do we have a right-adjacent extent? */ + right = xbitmap32_tree_iter_first(&bitmap->xb_root, last + 1, last + 1); + ASSERT(!right || right->bn_start == last + 1); + + if (left && right) { + /* combine left and right adjacent extent */ + xbitmap32_tree_remove(left, &bitmap->xb_root); + xbitmap32_tree_remove(right, &bitmap->xb_root); + left->bn_last = right->bn_last; + xbitmap32_tree_insert(left, &bitmap->xb_root); + kfree(right); + } else if (left) { + /* combine with left extent */ + xbitmap32_tree_remove(left, &bitmap->xb_root); + left->bn_last = last; + xbitmap32_tree_insert(left, &bitmap->xb_root); + } else if (right) { + /* combine with right extent */ + xbitmap32_tree_remove(right, &bitmap->xb_root); + right->bn_start = start; + xbitmap32_tree_insert(right, &bitmap->xb_root); + } else { + /* add an extent */ + left = kmalloc(sizeof(struct xbitmap32_node), XCHK_GFP_FLAGS); + if (!left) + return -ENOMEM; + left->bn_start = start; + left->bn_last = last; + xbitmap32_tree_insert(left, &bitmap->xb_root); + } + + return 0; +} + +/* Free everything related to this bitmap. */ +void +xbitmap32_destroy( + struct xbitmap32 *bitmap) +{ + struct xbitmap32_node *bn; + + while ((bn = xbitmap32_tree_iter_first(&bitmap->xb_root, 0, -1U))) { + xbitmap32_tree_remove(bn, &bitmap->xb_root); + kfree(bn); + } +} + +/* Set up a per-AG block bitmap. */ +void +xbitmap32_init( + struct xbitmap32 *bitmap) +{ + bitmap->xb_root = RB_ROOT_CACHED; +} + +/* + * Remove all the blocks mentioned in @sub from the extents in @bitmap. + * + * The intent is that callers will iterate the rmapbt for all of its records + * for a given owner to generate @bitmap; and iterate all the blocks of the + * metadata structures that are not being rebuilt and have the same rmapbt + * owner to generate @sub. This routine subtracts all the extents + * mentioned in sub from all the extents linked in @bitmap, which leaves + * @bitmap as the list of blocks that are not accounted for, which we assume + * are the dead blocks of the old metadata structure. The blocks mentioned in + * @bitmap can be reaped. + * + * This is the logical equivalent of bitmap &= ~sub. + */ +int +xbitmap32_disunion( + struct xbitmap32 *bitmap, + struct xbitmap32 *sub) +{ + struct xbitmap32_node *bn; + int error; + + if (xbitmap32_empty(bitmap) || xbitmap32_empty(sub)) + return 0; + + for_each_xbitmap32_extent(bn, sub) { + error = xbitmap32_clear(bitmap, bn->bn_start, + bn->bn_last - bn->bn_start + 1); + if (error) + return error; + } + + return 0; +} + +/* How many bits are set in this bitmap? */ +uint32_t +xbitmap32_hweight( + struct xbitmap32 *bitmap) +{ + struct xbitmap32_node *bn; + uint32_t ret = 0; + + for_each_xbitmap32_extent(bn, bitmap) + ret += bn->bn_last - bn->bn_start + 1; + + return ret; +} + +/* Call a function for every run of set bits in this bitmap. */ +int +xbitmap32_walk( + struct xbitmap32 *bitmap, + xbitmap32_walk_fn fn, + void *priv) +{ + struct xbitmap32_node *bn; + int error = 0; + + for_each_xbitmap32_extent(bn, bitmap) { + error = fn(bn->bn_start, bn->bn_last - bn->bn_start + 1, priv); + if (error) + break; + } + + return error; +} + +/* Does this bitmap have no bits set at all? */ +bool +xbitmap32_empty( + struct xbitmap32 *bitmap) +{ + return bitmap->xb_root.rb_root.rb_node == NULL; +} + +/* Is the start of the range set or clear? And for how long? */ +bool +xbitmap32_test( + struct xbitmap32 *bitmap, + uint32_t start, + uint32_t *len) +{ + struct xbitmap32_node *bn; + uint32_t last = start + *len - 1; + + bn = xbitmap32_tree_iter_first(&bitmap->xb_root, start, last); + if (!bn) + return false; + if (bn->bn_start <= start) { + if (bn->bn_last < last) + *len = bn->bn_last - start + 1; + return true; + } + *len = bn->bn_start - start; + return false; +} + +/* xfs_agblock_t bitmap */ + /* * Record all btree blocks seen while iterating all records of a btree. * @@ -316,66 +657,3 @@ xagb_bitmap_set_btcur_path( return 0; } - -/* How many bits are set in this bitmap? */ -uint64_t -xbitmap_hweight( - struct xbitmap *bitmap) -{ - struct xbitmap_node *bn; - uint64_t ret = 0; - - for_each_xbitmap_extent(bn, bitmap) - ret += bn->bn_last - bn->bn_start + 1; - - return ret; -} - -/* Call a function for every run of set bits in this bitmap. */ -int -xbitmap_walk( - struct xbitmap *bitmap, - xbitmap_walk_fn fn, - void *priv) -{ - struct xbitmap_node *bn; - int error = 0; - - for_each_xbitmap_extent(bn, bitmap) { - error = fn(bn->bn_start, bn->bn_last - bn->bn_start + 1, priv); - if (error) - break; - } - - return error; -} - -/* Does this bitmap have no bits set at all? */ -bool -xbitmap_empty( - struct xbitmap *bitmap) -{ - return bitmap->xb_root.rb_root.rb_node == NULL; -} - -/* Is the start of the range set or clear? And for how long? */ -bool -xbitmap_test( - struct xbitmap *bitmap, - uint64_t start, - uint64_t *len) -{ - struct xbitmap_node *bn; - uint64_t last = start + *len - 1; - - bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last); - if (!bn) - return false; - if (bn->bn_start <= start) { - if (bn->bn_last < last) - *len = bn->bn_last - start + 1; - return true; - } - *len = bn->bn_start - start; - return false; -} diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h index 4fe58bad67345..9cdc41b6cb02d 100644 --- a/fs/xfs/scrub/bitmap.h +++ b/fs/xfs/scrub/bitmap.h @@ -6,6 +6,8 @@ #ifndef __XFS_SCRUB_BITMAP_H__ #define __XFS_SCRUB_BITMAP_H__ +/* u64 bitmap */ + struct xbitmap { struct rb_root_cached xb_root; }; @@ -32,72 +34,86 @@ int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn, bool xbitmap_empty(struct xbitmap *bitmap); bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len); +/* u32 bitmap */ + +struct xbitmap32 { + struct rb_root_cached xb_root; +}; + +void xbitmap32_init(struct xbitmap32 *bitmap); +void xbitmap32_destroy(struct xbitmap32 *bitmap); + +int xbitmap32_clear(struct xbitmap32 *bitmap, uint32_t start, uint32_t len); +int xbitmap32_set(struct xbitmap32 *bitmap, uint32_t start, uint32_t len); +int xbitmap32_disunion(struct xbitmap32 *bitmap, struct xbitmap32 *sub); +uint32_t xbitmap32_hweight(struct xbitmap32 *bitmap); + +/* + * Return codes for the bitmap iterator functions are 0 to continue iterating, + * and non-zero to stop iterating. Any non-zero value will be passed up to the + * iteration caller. The special value -ECANCELED can be used to stop + * iteration, because neither bitmap iterator ever generates that error code on + * its own. Callers must not modify the bitmap while walking it. + */ +typedef int (*xbitmap32_walk_fn)(uint32_t start, uint32_t len, void *priv); +int xbitmap32_walk(struct xbitmap32 *bitmap, xbitmap32_walk_fn fn, + void *priv); + +bool xbitmap32_empty(struct xbitmap32 *bitmap); +bool xbitmap32_test(struct xbitmap32 *bitmap, uint32_t start, uint32_t *len); + /* Bitmaps, but for type-checked for xfs_agblock_t */ struct xagb_bitmap { - struct xbitmap agbitmap; + struct xbitmap32 agbitmap; }; static inline void xagb_bitmap_init(struct xagb_bitmap *bitmap) { - xbitmap_init(&bitmap->agbitmap); + xbitmap32_init(&bitmap->agbitmap); } static inline void xagb_bitmap_destroy(struct xagb_bitmap *bitmap) { - xbitmap_destroy(&bitmap->agbitmap); + xbitmap32_destroy(&bitmap->agbitmap); } static inline int xagb_bitmap_clear(struct xagb_bitmap *bitmap, xfs_agblock_t start, xfs_extlen_t len) { - return xbitmap_clear(&bitmap->agbitmap, start, len); + return xbitmap32_clear(&bitmap->agbitmap, start, len); } static inline int xagb_bitmap_set(struct xagb_bitmap *bitmap, xfs_agblock_t start, xfs_extlen_t len) { - return xbitmap_set(&bitmap->agbitmap, start, len); + return xbitmap32_set(&bitmap->agbitmap, start, len); } -static inline bool -xagb_bitmap_test( - struct xagb_bitmap *bitmap, - xfs_agblock_t start, - xfs_extlen_t *len) +static inline bool xagb_bitmap_test(struct xagb_bitmap *bitmap, + xfs_agblock_t start, xfs_extlen_t *len) { - uint64_t biglen = *len; - bool ret; - - ret = xbitmap_test(&bitmap->agbitmap, start, &biglen); - - if (start + biglen >= UINT_MAX) { - ASSERT(0); - biglen = UINT_MAX - start; - } - - *len = biglen; - return ret; + return xbitmap32_test(&bitmap->agbitmap, start, len); } static inline int xagb_bitmap_disunion(struct xagb_bitmap *bitmap, struct xagb_bitmap *sub) { - return xbitmap_disunion(&bitmap->agbitmap, &sub->agbitmap); + return xbitmap32_disunion(&bitmap->agbitmap, &sub->agbitmap); } static inline uint32_t xagb_bitmap_hweight(struct xagb_bitmap *bitmap) { - return xbitmap_hweight(&bitmap->agbitmap); + return xbitmap32_hweight(&bitmap->agbitmap); } static inline bool xagb_bitmap_empty(struct xagb_bitmap *bitmap) { - return xbitmap_empty(&bitmap->agbitmap); + return xbitmap32_empty(&bitmap->agbitmap); } static inline int xagb_bitmap_walk(struct xagb_bitmap *bitmap, - xbitmap_walk_fn fn, void *priv) + xbitmap32_walk_fn fn, void *priv) { - return xbitmap_walk(&bitmap->agbitmap, fn, priv); + return xbitmap32_walk(&bitmap->agbitmap, fn, priv); } int xagb_bitmap_set_btblocks(struct xagb_bitmap *bitmap, diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c index ee26fcb500b78..c8c8e3f9bc7a4 100644 --- a/fs/xfs/scrub/reap.c +++ b/fs/xfs/scrub/reap.c @@ -430,13 +430,12 @@ xreap_agextent_iter( */ STATIC int xreap_agmeta_extent( - uint64_t fsbno, - uint64_t len, + uint32_t agbno, + uint32_t len, void *priv) { struct xreap_state *rs = priv; struct xfs_scrub *sc = rs->sc; - xfs_agblock_t agbno = fsbno; xfs_agblock_t agbno_next = agbno + len; int error = 0; ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps 2023-11-24 23:50 ` [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps Darrick J. Wong @ 2023-11-25 5:57 ` Christoph Hellwig 2023-11-28 1:34 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 5:57 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:50:02PM -0800, Darrick J. Wong wrote: > Create a version of the xbitmap that handles 32-bit integer intervals > and adapt the xfs_agblock_t bitmap to use it. This reduces the size of > the interval tree nodes from 48 to 36 bytes and enables us to use a more > efficient slab (:0000040 instead of :0000048) which allows us to pack > more nodes into a single slab page (102 vs 85). The changes themsleves looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> Q: should we rename the existing xbitmap to xbitmap64 for consistency? Also why are the agb_bitmap* wrappers in bitmap.h? Following our usual code organization I'd expect bitmap.[ch] to just be the library code and have users outside of that. Maybe for later.. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps 2023-11-25 5:57 ` Christoph Hellwig @ 2023-11-28 1:34 ` Darrick J. Wong 2023-11-28 5:43 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 1:34 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Fri, Nov 24, 2023 at 09:57:35PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:50:02PM -0800, Darrick J. Wong wrote: > > Create a version of the xbitmap that handles 32-bit integer intervals > > and adapt the xfs_agblock_t bitmap to use it. This reduces the size of > > the interval tree nodes from 48 to 36 bytes and enables us to use a more > > efficient slab (:0000040 instead of :0000048) which allows us to pack > > more nodes into a single slab page (102 vs 85). > > The changes themsleves looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > Q: should we rename the existing xbitmap to xbitmap64 for consistency? Yes. Done. > Also why are the agb_bitmap* wrappers in bitmap.h? Following our > usual code organization I'd expect bitmap.[ch] to just be the > library code and have users outside of that. Maybe for later.. Those wrappers are trivial except for the enhanced typechecking, so I didn't think it was a big deal to cram them into bitmap.h. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps 2023-11-28 1:34 ` Darrick J. Wong @ 2023-11-28 5:43 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 5:43 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Mon, Nov 27, 2023 at 05:34:44PM -0800, Darrick J. Wong wrote: > > Also why are the agb_bitmap* wrappers in bitmap.h? Following our > > usual code organization I'd expect bitmap.[ch] to just be the > > library code and have users outside of that. Maybe for later.. > > Those wrappers are trivial except for the enhanced typechecking, so I > didn't think it was a big deal to cram them into bitmap.h. I find that kind of code structure a bit confusion to be honest. If you prefer it I can live with it (at least fow now :)) of course. ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 2/5] xfs: roll the scrub transaction after completing a repair 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong 2023-11-24 23:50 ` [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps Darrick J. Wong @ 2023-11-24 23:50 ` Darrick J. Wong 2023-11-25 6:05 ` Christoph Hellwig 2023-11-24 23:50 ` [PATCH 3/5] xfs: repair free space btrees Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:50 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> When we've finished repairing an AG header, roll the scrub transaction. This ensure that any failures caused by defer ops failing are captured by the xrep_done tracepoint and that any stacktraces that occur will point to the repair code that caused it, instead of xchk_teardown. Going forward, repair functions should commit the transaction if they're going to return success. Usually the space reaping functions that run after a successful atomic commit of the new metadata will take care of that for us. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/agheader_repair.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c index 9d541d2087da8..5dbf82705eab7 100644 --- a/fs/xfs/scrub/agheader_repair.c +++ b/fs/xfs/scrub/agheader_repair.c @@ -73,7 +73,7 @@ xrep_superblock( /* Write this to disk. */ xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF); xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1); - return error; + return 0; } /* AGF */ @@ -342,7 +342,7 @@ xrep_agf_commit_new( pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level); set_bit(XFS_AGSTATE_AGF_INIT, &pag->pag_opstate); - return 0; + return xrep_roll_ag_trans(sc); } /* Repair the AGF. v5 filesystems only. */ @@ -789,6 +789,9 @@ xrep_agfl( /* Dump any AGFL overflow. */ error = xrep_reap_agblocks(sc, &agfl_extents, &XFS_RMAP_OINFO_AG, XFS_AG_RESV_AGFL); + if (error) + goto err; + err: xagb_bitmap_destroy(&agfl_extents); return error; @@ -962,7 +965,7 @@ xrep_agi_commit_new( pag->pagi_freecount = be32_to_cpu(agi->agi_freecount); set_bit(XFS_AGSTATE_AGI_INIT, &pag->pag_opstate); - return 0; + return xrep_roll_ag_trans(sc); } /* Repair the AGI. */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 2/5] xfs: roll the scrub transaction after completing a repair 2023-11-24 23:50 ` [PATCH 2/5] xfs: roll the scrub transaction after completing a repair Darrick J. Wong @ 2023-11-25 6:05 ` Christoph Hellwig 2023-11-28 1:29 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 6:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:50:17PM -0800, Darrick J. Wong wrote: > Going forward, repair functions should commit the transaction if they're > going to return success. Usually the space reaping functions that run > after a successful atomic commit of the new metadata will take care of > that for us. Generally looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> A random comment on a pre-existing function from reading the code, and a nitpick on the patch itself below: > +++ b/fs/xfs/scrub/agheader_repair.c > @@ -73,7 +73,7 @@ xrep_superblock( > /* Write this to disk. */ > xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF); > xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1); > - return error; > + return 0; After looking through the code this is obviously fine, error must be 0 here because the last patch touching it is xchk_should_terminate, which only sets the error if it returns true. But the calling conventions for xchk_should_terminate really make me scratch my head as they are so hard to reason about. I did quick look over must caller and most of them get there with error always set to 0. So just making xchk_should_terminate return the error would seem a lot better to me - any caller with a previous error would need a second error2, but that seems better than what we have there right now. > /* Repair the AGF. v5 filesystems only. */ > @@ -789,6 +789,9 @@ xrep_agfl( > /* Dump any AGFL overflow. */ > error = xrep_reap_agblocks(sc, &agfl_extents, &XFS_RMAP_OINFO_AG, > XFS_AG_RESV_AGFL); > + if (error) > + goto err; > + > err: This seems rather pointless and doesn't change anything.. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/5] xfs: roll the scrub transaction after completing a repair 2023-11-25 6:05 ` Christoph Hellwig @ 2023-11-28 1:29 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 1:29 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Fri, Nov 24, 2023 at 10:05:38PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:50:17PM -0800, Darrick J. Wong wrote: > > Going forward, repair functions should commit the transaction if they're > > going to return success. Usually the space reaping functions that run > > after a successful atomic commit of the new metadata will take care of > > that for us. > > Generally looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > A random comment on a pre-existing function from reading the code, and > a nitpick on the patch itself below: > > > +++ b/fs/xfs/scrub/agheader_repair.c > > @@ -73,7 +73,7 @@ xrep_superblock( > > /* Write this to disk. */ > > xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF); > > xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1); > > - return error; > > + return 0; > > After looking through the code this is obviously fine, error must > be 0 here because the last patch touching it is xchk_should_terminate, > which only sets the error if it returns true. <nod> > But the calling conventions for xchk_should_terminate really make me > scratch my head as they are so hard to reason about. I did quick > look over must caller and most of them get there with error always > set to 0. So just making xchk_should_terminate return the error > would seem a lot better to me - any caller with a previous error > would need a second error2, but that seems better than what we have > there right now. Agreed, the callsites would be a bit more obvious if they looked like: error = xchk_should_terminate(sc); if (error) break; Though I'm working on some tweaks of that function, since it was pointed out to me that cond_resched() and fatal_signal_pending() aren't entirely free. What I've been testing out the last three weeks is: unsigned long now = jiffies; if (time_after(sc->next_poke, now)) { sc->next_poke = now + (HZ / 10); cond_resched(); if (fatal_signal_pending(current)) return -EINTR; } return 0; So far I haven't seen much improvement, but the callsite change is something that I think I could promote to the end of online repair part 2. > > /* Repair the AGF. v5 filesystems only. */ > > @@ -789,6 +789,9 @@ xrep_agfl( > > /* Dump any AGFL overflow. */ > > error = xrep_reap_agblocks(sc, &agfl_extents, &XFS_RMAP_OINFO_AG, > > XFS_AG_RESV_AGFL); > > + if (error) > > + goto err; > > + > > err: > > This seems rather pointless and doesn't change anything.. Oops, lemme get rid of that dead code... --D > ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 3/5] xfs: repair free space btrees 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong 2023-11-24 23:50 ` [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps Darrick J. Wong 2023-11-24 23:50 ` [PATCH 2/5] xfs: roll the scrub transaction after completing a repair Darrick J. Wong @ 2023-11-24 23:50 ` Darrick J. Wong 2023-11-25 6:11 ` Christoph Hellwig 2023-11-28 15:10 ` Christoph Hellwig 2023-11-24 23:50 ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong 2023-11-24 23:51 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 4 siblings, 2 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:50 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Rebuild the free space btrees from the gaps in the rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_ag.h | 9 fs/xfs/libxfs/xfs_ag_resv.c | 2 fs/xfs/libxfs/xfs_alloc.c | 18 + fs/xfs/libxfs/xfs_alloc.h | 2 fs/xfs/libxfs/xfs_alloc_btree.c | 13 + fs/xfs/libxfs/xfs_types.h | 7 fs/xfs/scrub/alloc.c | 16 + fs/xfs/scrub/alloc_repair.c | 914 +++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.h | 19 + fs/xfs/scrub/newbt.c | 48 ++ fs/xfs/scrub/newbt.h | 6 fs/xfs/scrub/repair.c | 69 +++ fs/xfs/scrub/repair.h | 24 + fs/xfs/scrub/scrub.c | 14 - fs/xfs/scrub/scrub.h | 8 fs/xfs/scrub/trace.h | 24 + fs/xfs/scrub/xfarray.h | 22 + fs/xfs/xfs_extent_busy.c | 13 + fs/xfs/xfs_extent_busy.h | 2 20 files changed, 1212 insertions(+), 19 deletions(-) create mode 100644 fs/xfs/scrub/alloc_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 1537d66e5ab01..026591681937d 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -181,6 +181,7 @@ xfs-$(CONFIG_XFS_QUOTA) += scrub/quota.o ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) xfs-y += $(addprefix scrub/, \ agheader_repair.o \ + alloc_repair.o \ newbt.o \ reap.o \ repair.o \ diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 2e0aef87d633e..686f4eadd5743 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -80,6 +80,15 @@ struct xfs_perag { */ uint16_t pag_checked; uint16_t pag_sick; + +#ifdef CONFIG_XFS_ONLINE_REPAIR + /* + * Alternate btree heights so that online repair won't trip the write + * verifiers while rebuilding the AG btrees. + */ + uint8_t pagf_alt_levels[XFS_BTNUM_AGF]; +#endif + spinlock_t pag_state_lock; spinlock_t pagb_lock; /* lock for pagb_tree */ diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c index 7fd1fea95552f..da1057bd0e606 100644 --- a/fs/xfs/libxfs/xfs_ag_resv.c +++ b/fs/xfs/libxfs/xfs_ag_resv.c @@ -411,6 +411,8 @@ xfs_ag_resv_free_extent( fallthrough; case XFS_AG_RESV_NONE: xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len); + fallthrough; + case XFS_AG_RESV_IGNORE: return; } diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 4940f9377f21a..f33914455bf83 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -243,14 +243,11 @@ xfs_alloc_btrec_to_irec( irec->ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount); } -/* Simple checks for free space records. */ -xfs_failaddr_t -xfs_alloc_check_irec( - struct xfs_btree_cur *cur, +inline xfs_failaddr_t +xfs_alloc_check_perag_irec( + struct xfs_perag *pag, const struct xfs_alloc_rec_incore *irec) { - struct xfs_perag *pag = cur->bc_ag.pag; - if (irec->ar_blockcount == 0) return __this_address; @@ -261,6 +258,15 @@ xfs_alloc_check_irec( return NULL; } +/* Simple checks for free space records. */ +xfs_failaddr_t +xfs_alloc_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_alloc_rec_incore *irec) +{ + return xfs_alloc_check_perag_irec(cur->bc_ag.pag, irec); +} + static inline int xfs_alloc_complain_bad_rec( struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 851cafbd64494..68ffe058a420f 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -185,6 +185,8 @@ xfs_alloc_get_rec( union xfs_btree_rec; void xfs_alloc_btrec_to_irec(const union xfs_btree_rec *rec, struct xfs_alloc_rec_incore *irec); +xfs_failaddr_t xfs_alloc_check_perag_irec(struct xfs_perag *pag, + const struct xfs_alloc_rec_incore *irec); xfs_failaddr_t xfs_alloc_check_irec(struct xfs_btree_cur *cur, const struct xfs_alloc_rec_incore *irec); diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c index c65228efed4ae..90c7cb8c54ab0 100644 --- a/fs/xfs/libxfs/xfs_alloc_btree.c +++ b/fs/xfs/libxfs/xfs_alloc_btree.c @@ -323,7 +323,18 @@ xfs_allocbt_verify( if (bp->b_ops->magic[0] == cpu_to_be32(XFS_ABTC_MAGIC)) btnum = XFS_BTNUM_CNTi; if (pag && xfs_perag_initialised_agf(pag)) { - if (level >= pag->pagf_levels[btnum]) + unsigned int maxlevel = pag->pagf_levels[btnum]; + +#ifdef CONFIG_XFS_ONLINE_REPAIR + /* + * Online repair could be rewriting the free space btrees, so + * we'll validate against the larger of either tree while this + * is going on. + */ + maxlevel = max_t(unsigned int, maxlevel, + pag->pagf_alt_levels[btnum]); +#endif + if (level >= maxlevel) return __this_address; } else if (level >= mp->m_alloc_maxlevels) return __this_address; diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h index 533200c4ccc25..035bf703d719a 100644 --- a/fs/xfs/libxfs/xfs_types.h +++ b/fs/xfs/libxfs/xfs_types.h @@ -208,6 +208,13 @@ enum xfs_ag_resv_type { XFS_AG_RESV_AGFL, XFS_AG_RESV_METADATA, XFS_AG_RESV_RMAPBT, + + /* + * Don't increase fdblocks when freeing extent. This is a pony for + * the bnobt repair functions to re-free the free space without + * altering fdblocks. If you think you need this you're wrong. + */ + XFS_AG_RESV_IGNORE, }; /* Results of scanning a btree keyspace to check occupancy. */ diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c index 279af72b1671d..964089e24ca6d 100644 --- a/fs/xfs/scrub/alloc.c +++ b/fs/xfs/scrub/alloc.c @@ -9,13 +9,16 @@ #include "xfs_format.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" #include "xfs_btree.h" #include "xfs_alloc.h" #include "xfs_rmap.h" +#include "xfs_ag.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/btree.h" -#include "xfs_ag.h" +#include "scrub/repair.h" /* * Set us up to scrub free space btrees. @@ -24,10 +27,19 @@ int xchk_setup_ag_allocbt( struct xfs_scrub *sc) { + int error; + if (xchk_need_intent_drain(sc)) xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); - return xchk_setup_ag_btree(sc, false); + error = xchk_setup_ag_btree(sc, false); + if (error) + return error; + + if (xchk_could_repair(sc)) + return xrep_setup_ag_allocbt(sc); + + return 0; } /* Free space btree scrubber. */ diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c new file mode 100644 index 0000000000000..917307e08bc16 --- /dev/null +++ b/fs/xfs/scrub/alloc_repair.c @@ -0,0 +1,914 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_alloc.h" +#include "xfs_alloc_btree.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_inode.h" +#include "xfs_refcount.h" +#include "xfs_extent_busy.h" +#include "xfs_health.h" +#include "xfs_bmap.h" +#include "xfs_ialloc.h" +#include "xfs_ag.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/newbt.h" +#include "scrub/reap.h" + +/* + * Free Space Btree Repair + * ======================= + * + * The reverse mappings are supposed to record all space usage for the entire + * AG. Therefore, we can recalculate the free extents in an AG by looking for + * gaps in the physical extents recorded in the rmapbt. On a reflink + * filesystem this is a little more tricky in that we have to be aware that + * the rmap records are allowed to overlap. + * + * We derive which blocks belonged to the old bnobt/cntbt by recording all the + * OWN_AG extents and subtracting out the blocks owned by all other OWN_AG + * metadata: the rmapbt blocks visited while iterating the reverse mappings + * and the AGFL blocks. + * + * Once we have both of those pieces, we can reconstruct the bnobt and cntbt + * by blowing out the free block state and freeing all the extents that we + * found. This adds the requirement that we can't have any busy extents in + * the AG because the busy code cannot handle duplicate records. + * + * Note that we can only rebuild both free space btrees at the same time + * because the regular extent freeing infrastructure loads both btrees at the + * same time. + * + * We use the prefix 'xrep_abt' here because we regenerate both free space + * allocation btrees at the same time. + */ + +struct xrep_abt { + /* Blocks owned by the rmapbt or the agfl. */ + struct xagb_bitmap not_allocbt_blocks; + + /* All OWN_AG blocks. */ + struct xagb_bitmap old_allocbt_blocks; + + /* + * New bnobt information. All btree block reservations are added to + * the reservation list in new_bnobt. + */ + struct xrep_newbt new_bnobt; + + /* new cntbt information */ + struct xrep_newbt new_cntbt; + + /* Free space extents. */ + struct xfarray *free_records; + + struct xfs_scrub *sc; + + /* Number of non-null records in @free_records. */ + uint64_t nr_real_records; + + /* get_records()'s position in the free space record array. */ + xfarray_idx_t array_cur; + + /* + * Next block we anticipate seeing in the rmap records. If the next + * rmap record is greater than next_agbno, we have found unused space. + */ + xfs_agblock_t next_agbno; + + /* Number of free blocks in this AG. */ + xfs_agblock_t nr_blocks; + + /* Longest free extent we found in the AG. */ + xfs_agblock_t longest; +}; + +/* Set up to repair AG free space btrees. */ +int +xrep_setup_ag_allocbt( + struct xfs_scrub *sc) +{ + unsigned int busy_gen; + + /* + * Make sure the busy extent list is clear because we can't put extents + * on there twice. + */ + busy_gen = READ_ONCE(sc->sa.pag->pagb_gen); + if (xfs_extent_busy_list_empty(sc->sa.pag)) + return 0; + + return xfs_extent_busy_flush(sc->tp, sc->sa.pag, busy_gen, 0); +} + +/* Check for any obvious conflicts in the free extent. */ +STATIC int +xrep_abt_check_free_ext( + struct xfs_scrub *sc, + const struct xfs_alloc_rec_incore *rec) +{ + enum xbtree_recpacking outcome; + int error; + + if (xfs_alloc_check_perag_irec(sc->sa.pag, rec) != NULL) + return -EFSCORRUPTED; + + /* Must not be an inode chunk. */ + error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur, + rec->ar_startblock, rec->ar_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + /* Must not be shared or CoW staging. */ + if (sc->sa.refc_cur) { + error = xfs_refcount_has_records(sc->sa.refc_cur, + XFS_REFC_DOMAIN_SHARED, rec->ar_startblock, + rec->ar_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + error = xfs_refcount_has_records(sc->sa.refc_cur, + XFS_REFC_DOMAIN_COW, rec->ar_startblock, + rec->ar_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + } + + return 0; +} + +/* + * Stash a free space record for all the space since the last bno we found + * all the way up to @end. + */ +static int +xrep_abt_stash( + struct xrep_abt *ra, + xfs_agblock_t end) +{ + struct xfs_alloc_rec_incore arec = { + .ar_startblock = ra->next_agbno, + .ar_blockcount = end - ra->next_agbno, + }; + struct xfs_scrub *sc = ra->sc; + int error = 0; + + if (xchk_should_terminate(sc, &error)) + return error; + + error = xrep_abt_check_free_ext(ra->sc, &arec); + if (error) + return error; + + trace_xrep_abt_found(sc->mp, sc->sa.pag->pag_agno, &arec); + + error = xfarray_append(ra->free_records, &arec); + if (error) + return error; + + ra->nr_blocks += arec.ar_blockcount; + return 0; +} + +/* Record extents that aren't in use from gaps in the rmap records. */ +STATIC int +xrep_abt_walk_rmap( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec, + void *priv) +{ + struct xrep_abt *ra = priv; + int error; + + /* Record all the OWN_AG blocks... */ + if (rec->rm_owner == XFS_RMAP_OWN_AG) { + error = xagb_bitmap_set(&ra->old_allocbt_blocks, + rec->rm_startblock, rec->rm_blockcount); + if (error) + return error; + } + + /* ...and all the rmapbt blocks... */ + error = xagb_bitmap_set_btcur_path(&ra->not_allocbt_blocks, cur); + if (error) + return error; + + /* ...and all the free space. */ + if (rec->rm_startblock > ra->next_agbno) { + error = xrep_abt_stash(ra, rec->rm_startblock); + if (error) + return error; + } + + /* + * rmap records can overlap on reflink filesystems, so project + * next_agbno as far out into the AG space as we currently know about. + */ + ra->next_agbno = max_t(xfs_agblock_t, ra->next_agbno, + rec->rm_startblock + rec->rm_blockcount); + return 0; +} + +/* Collect an AGFL block for the not-to-release list. */ +static int +xrep_abt_walk_agfl( + struct xfs_mount *mp, + xfs_agblock_t agbno, + void *priv) +{ + struct xrep_abt *ra = priv; + + return xagb_bitmap_set(&ra->not_allocbt_blocks, agbno, 1); +} + +/* + * Compare two free space extents by block number. We want to sort in order of + * increasing block number. + */ +static int +xrep_bnobt_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_alloc_rec_incore *ap = a; + const struct xfs_alloc_rec_incore *bp = b; + + if (ap->ar_startblock > bp->ar_startblock) + return 1; + else if (ap->ar_startblock < bp->ar_startblock) + return -1; + return 0; +} + +/* + * Re-sort the free extents by block number so that we can put the records into + * the bnobt in the correct order. Make sure the records do not overlap in + * physical space. + */ +STATIC int +xrep_bnobt_sort_records( + struct xrep_abt *ra) +{ + struct xfs_alloc_rec_incore arec; + xfarray_idx_t cur = XFARRAY_CURSOR_INIT; + xfs_agblock_t next_agbno = 0; + int error; + + error = xfarray_sort(ra->free_records, xrep_bnobt_extent_cmp, 0); + if (error) + return error; + + while ((error = xfarray_iter(ra->free_records, &cur, &arec)) == 1) { + if (arec.ar_startblock < next_agbno) + return -EFSCORRUPTED; + + next_agbno = arec.ar_startblock + arec.ar_blockcount; + } + + return error; +} + +/* + * Compare two free space extents by length and then block number. We want + * to sort first in order of increasing length and then in order of increasing + * block number. + */ +static int +xrep_cntbt_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_alloc_rec_incore *ap = a; + const struct xfs_alloc_rec_incore *bp = b; + + if (ap->ar_blockcount > bp->ar_blockcount) + return 1; + else if (ap->ar_blockcount < bp->ar_blockcount) + return -1; + return xrep_bnobt_extent_cmp(a, b); +} + +/* + * Sort the free extents by length so so that we can put the records into the + * cntbt in the correct order. Don't let userspace kill us if we're resorting + * after allocating btree blocks. + */ +STATIC int +xrep_cntbt_sort_records( + struct xrep_abt *ra, + bool is_resort) +{ + return xfarray_sort(ra->free_records, xrep_cntbt_extent_cmp, + is_resort ? 0 : XFARRAY_SORT_KILLABLE); +} + +/* + * Iterate all reverse mappings to find (1) the gaps between rmap records (all + * unowned space), (2) the OWN_AG extents (which encompass the free space + * btrees, the rmapbt, and the agfl), (3) the rmapbt blocks, and (4) the AGFL + * blocks. The free space is (1) + (2) - (3) - (4). + */ +STATIC int +xrep_abt_find_freespace( + struct xrep_abt *ra) +{ + struct xfs_scrub *sc = ra->sc; + struct xfs_mount *mp = sc->mp; + struct xfs_agf *agf = sc->sa.agf_bp->b_addr; + struct xfs_buf *agfl_bp; + xfs_agblock_t agend; + int error; + + xagb_bitmap_init(&ra->not_allocbt_blocks); + + xrep_ag_btcur_init(sc, &sc->sa); + + /* + * Iterate all the reverse mappings to find gaps in the physical + * mappings, all the OWN_AG blocks, and all the rmapbt extents. + */ + error = xfs_rmap_query_all(sc->sa.rmap_cur, xrep_abt_walk_rmap, ra); + if (error) + goto err; + + /* Insert a record for space between the last rmap and EOAG. */ + agend = be32_to_cpu(agf->agf_length); + if (ra->next_agbno < agend) { + error = xrep_abt_stash(ra, agend); + if (error) + goto err; + } + + /* Collect all the AGFL blocks. */ + error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp); + if (error) + goto err; + + error = xfs_agfl_walk(mp, agf, agfl_bp, xrep_abt_walk_agfl, ra); + if (error) + goto err_agfl; + + /* Compute the old bnobt/cntbt blocks. */ + error = xagb_bitmap_disunion(&ra->old_allocbt_blocks, + &ra->not_allocbt_blocks); + if (error) + goto err_agfl; + + ra->nr_real_records = xfarray_length(ra->free_records); +err_agfl: + xfs_trans_brelse(sc->tp, agfl_bp); +err: + xchk_ag_btcur_free(&sc->sa); + xagb_bitmap_destroy(&ra->not_allocbt_blocks); + return error; +} + +/* + * We're going to use the observed free space records to reserve blocks for the + * new free space btrees, so we play an iterative game where we try to converge + * on the number of blocks we need: + * + * 1. Estimate how many blocks we'll need to store the records. + * 2. If the first free record has more blocks than we need, we're done. + * We will have to re-sort the records prior to building the cntbt. + * 3. If that record has exactly the number of blocks we need, null out the + * record. We're done. + * 4. Otherwise, we still need more blocks. Null out the record, subtract its + * length from the number of blocks we need, and go back to step 1. + * + * Fortunately, we don't have to do any transaction work to play this game, so + * we don't have to tear down the staging cursors. + */ +STATIC int +xrep_abt_reserve_space( + struct xrep_abt *ra, + struct xfs_btree_cur *bno_cur, + struct xfs_btree_cur *cnt_cur, + bool *needs_resort) +{ + struct xfs_scrub *sc = ra->sc; + xfarray_idx_t record_nr; + unsigned int allocated = 0; + int error = 0; + + record_nr = xfarray_length(ra->free_records) - 1; + do { + struct xfs_alloc_rec_incore arec; + uint64_t required; + unsigned int desired; + unsigned int len; + + /* Compute how many blocks we'll need. */ + error = xfs_btree_bload_compute_geometry(cnt_cur, + &ra->new_cntbt.bload, ra->nr_real_records); + if (error) + break; + + error = xfs_btree_bload_compute_geometry(bno_cur, + &ra->new_bnobt.bload, ra->nr_real_records); + if (error) + break; + + /* How many btree blocks do we need to store all records? */ + required = ra->new_bnobt.bload.nr_blocks + + ra->new_cntbt.bload.nr_blocks; + ASSERT(required < INT_MAX); + + /* If we've reserved enough blocks, we're done. */ + if (allocated >= required) + break; + + desired = required - allocated; + + /* We need space but there's none left; bye! */ + if (ra->nr_real_records == 0) { + error = -ENOSPC; + break; + } + + /* Grab the first record from the list. */ + error = xfarray_load(ra->free_records, record_nr, &arec); + if (error) + break; + + ASSERT(arec.ar_blockcount <= UINT_MAX); + len = min_t(unsigned int, arec.ar_blockcount, desired); + + trace_xrep_newbt_alloc_ag_blocks(sc->mp, sc->sa.pag->pag_agno, + arec.ar_startblock, len, XFS_RMAP_OWN_AG); + + error = xrep_newbt_add_extent(&ra->new_bnobt, sc->sa.pag, + arec.ar_startblock, len); + if (error) + break; + allocated += len; + ra->nr_blocks -= len; + + if (arec.ar_blockcount > desired) { + /* + * Record has more space than we need. The number of + * free records doesn't change, so shrink the free + * record, inform the caller that the records are no + * longer sorted by length, and exit. + */ + arec.ar_startblock += desired; + arec.ar_blockcount -= desired; + error = xfarray_store(ra->free_records, record_nr, + &arec); + if (error) + break; + + *needs_resort = true; + return 0; + } + + /* + * We're going to use up the entire record, so unset it and + * move on to the next one. This changes the number of free + * records (but doesn't break the sorting order), so we must + * go around the loop once more to re-run _bload_init. + */ + error = xfarray_unset(ra->free_records, record_nr); + if (error) + break; + ra->nr_real_records--; + record_nr--; + } while (1); + + return error; +} + +STATIC int +xrep_abt_dispose_one( + struct xrep_abt *ra, + struct xrep_newbt_resv *resv) +{ + struct xfs_scrub *sc = ra->sc; + struct xfs_perag *pag = sc->sa.pag; + xfs_agblock_t free_agbno = resv->agbno + resv->used; + xfs_extlen_t free_aglen = resv->len - resv->used; + int error; + + ASSERT(pag == resv->pag); + + /* Add a deferred rmap for each extent we used. */ + if (resv->used > 0) + xfs_rmap_alloc_extent(sc->tp, pag->pag_agno, resv->agbno, + resv->used, XFS_RMAP_OWN_AG); + + /* + * For each reserved btree block we didn't use, add it to the free + * space btree. We didn't touch fdblocks when we reserved them, so + * we don't touch it now. + */ + if (free_aglen == 0) + return 0; + + trace_xrep_newbt_free_blocks(sc->mp, resv->pag->pag_agno, free_agbno, + free_aglen, ra->new_bnobt.oinfo.oi_owner); + + error = __xfs_free_extent(sc->tp, resv->pag, free_agbno, free_aglen, + &ra->new_bnobt.oinfo, XFS_AG_RESV_IGNORE, true); + if (error) + return error; + + return xrep_defer_finish(sc); +} + +/* + * Deal with all the space we reserved. Blocks that were allocated for the + * free space btrees need to have a (deferred) rmap added for the OWN_AG + * allocation, and blocks that didn't get used can be freed via the usual + * (deferred) means. + */ +STATIC void +xrep_abt_dispose_reservations( + struct xrep_abt *ra, + int error) +{ + struct xrep_newbt_resv *resv, *n; + + if (error) + goto junkit; + + for_each_xrep_newbt_reservation(&ra->new_bnobt, resv, n) { + error = xrep_abt_dispose_one(ra, resv); + if (error) + goto junkit; + } + +junkit: + for_each_xrep_newbt_reservation(&ra->new_bnobt, resv, n) { + xfs_perag_put(resv->pag); + list_del(&resv->list); + kfree(resv); + } + + xrep_newbt_cancel(&ra->new_bnobt); + xrep_newbt_cancel(&ra->new_cntbt); +} + +/* Retrieve free space data for bulk load. */ +STATIC int +xrep_abt_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_alloc_rec_incore *arec = &cur->bc_rec.a; + struct xrep_abt *ra = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + error = xfarray_load_next(ra->free_records, &ra->array_cur, + arec); + if (error) + return error; + + ra->longest = max(ra->longest, arec->ar_blockcount); + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_abt_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_abt *ra = priv; + + return xrep_newbt_claim_block(cur, &ra->new_bnobt, ptr); +} + +/* + * Reset the AGF counters to reflect the free space btrees that we just + * rebuilt, then reinitialize the per-AG data. + */ +STATIC int +xrep_abt_reset_counters( + struct xrep_abt *ra) +{ + struct xfs_scrub *sc = ra->sc; + struct xfs_perag *pag = sc->sa.pag; + struct xfs_agf *agf = sc->sa.agf_bp->b_addr; + unsigned int freesp_btreeblks = 0; + + /* + * Compute the contribution to agf_btreeblks for the new free space + * btrees. This is the computed btree size minus anything we didn't + * use. + */ + freesp_btreeblks += ra->new_bnobt.bload.nr_blocks - 1; + freesp_btreeblks += ra->new_cntbt.bload.nr_blocks - 1; + + freesp_btreeblks -= xrep_newbt_unused_blocks(&ra->new_bnobt); + freesp_btreeblks -= xrep_newbt_unused_blocks(&ra->new_cntbt); + + /* + * The AGF header contains extra information related to the free space + * btrees, so we must update those fields here. + */ + agf->agf_btreeblks = cpu_to_be32(freesp_btreeblks + + (be32_to_cpu(agf->agf_rmap_blocks) - 1)); + agf->agf_freeblks = cpu_to_be32(ra->nr_blocks); + agf->agf_longest = cpu_to_be32(ra->longest); + xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, XFS_AGF_BTREEBLKS | + XFS_AGF_LONGEST | + XFS_AGF_FREEBLKS); + + /* + * After we commit the new btree to disk, it is possible that the + * process to reap the old btree blocks will race with the AIL trying + * to checkpoint the old btree blocks into the filesystem. If the new + * tree is shorter than the old one, the allocbt write verifier will + * fail and the AIL will shut down the filesystem. + * + * To avoid this, save the old incore btree height values as the alt + * height values before re-initializing the perag info from the updated + * AGF to capture all the new values. + */ + pag->pagf_alt_levels[XFS_BTNUM_BNOi] = pag->pagf_levels[XFS_BTNUM_BNOi]; + pag->pagf_alt_levels[XFS_BTNUM_CNTi] = pag->pagf_levels[XFS_BTNUM_CNTi]; + + /* Reinitialize with the values we just logged. */ + return xrep_reinit_pagf(sc); +} + +/* + * Use the collected free space information to stage new free space btrees. + * If this is successful we'll return with the new btree root + * information logged to the repair transaction but not yet committed. + */ +STATIC int +xrep_abt_build_new_trees( + struct xrep_abt *ra) +{ + struct xfs_scrub *sc = ra->sc; + struct xfs_btree_cur *bno_cur; + struct xfs_btree_cur *cnt_cur; + struct xfs_perag *pag = sc->sa.pag; + bool needs_resort = false; + int error; + + /* + * Sort the free extents by length so that we can set up the free space + * btrees in as few extents as possible. This reduces the amount of + * deferred rmap / free work we have to do at the end. + */ + error = xrep_cntbt_sort_records(ra, false); + if (error) + return error; + + /* + * Prepare to construct the new btree by reserving disk space for the + * new btree and setting up all the accounting information we'll need + * to root the new btree while it's under construction and before we + * attach it to the AG header. + */ + xrep_newbt_init_bare(&ra->new_bnobt, sc); + xrep_newbt_init_bare(&ra->new_cntbt, sc); + + ra->new_bnobt.bload.get_records = xrep_abt_get_records; + ra->new_cntbt.bload.get_records = xrep_abt_get_records; + + ra->new_bnobt.bload.claim_block = xrep_abt_claim_block; + ra->new_cntbt.bload.claim_block = xrep_abt_claim_block; + + /* Allocate cursors for the staged btrees. */ + bno_cur = xfs_allocbt_stage_cursor(sc->mp, &ra->new_bnobt.afake, + pag, XFS_BTNUM_BNO); + cnt_cur = xfs_allocbt_stage_cursor(sc->mp, &ra->new_cntbt.afake, + pag, XFS_BTNUM_CNT); + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto err_cur; + + /* Reserve the space we'll need for the new btrees. */ + error = xrep_abt_reserve_space(ra, bno_cur, cnt_cur, &needs_resort); + if (error) + goto err_cur; + + /* + * If we need to re-sort the free extents by length, do so so that we + * can put the records into the cntbt in the correct order. + */ + if (needs_resort) { + error = xrep_cntbt_sort_records(ra, needs_resort); + if (error) + goto err_cur; + } + + /* + * Due to btree slack factors, it's possible for a new btree to be one + * level taller than the old btree. Update the alternate incore btree + * height so that we don't trip the verifiers when writing the new + * btree blocks to disk. + */ + pag->pagf_alt_levels[XFS_BTNUM_BNOi] = + ra->new_bnobt.bload.btree_height; + pag->pagf_alt_levels[XFS_BTNUM_CNTi] = + ra->new_cntbt.bload.btree_height; + + /* Load the free space by length tree. */ + ra->array_cur = XFARRAY_CURSOR_INIT; + ra->longest = 0; + error = xfs_btree_bload(cnt_cur, &ra->new_cntbt.bload, ra); + if (error) + goto err_levels; + + error = xrep_bnobt_sort_records(ra); + if (error) + return error; + + /* Load the free space by block number tree. */ + ra->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(bno_cur, &ra->new_bnobt.bload, ra); + if (error) + goto err_levels; + + /* + * Install the new btrees in the AG header. After this point the old + * btrees are no longer accessible and the new trees are live. + */ + xfs_allocbt_commit_staged_btree(bno_cur, sc->tp, sc->sa.agf_bp); + xfs_btree_del_cursor(bno_cur, 0); + xfs_allocbt_commit_staged_btree(cnt_cur, sc->tp, sc->sa.agf_bp); + xfs_btree_del_cursor(cnt_cur, 0); + + /* Reset the AGF counters now that we've changed the btree shape. */ + error = xrep_abt_reset_counters(ra); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + xrep_abt_dispose_reservations(ra, error); + + return xrep_roll_ag_trans(sc); + +err_levels: + pag->pagf_alt_levels[XFS_BTNUM_BNOi] = 0; + pag->pagf_alt_levels[XFS_BTNUM_CNTi] = 0; +err_cur: + xfs_btree_del_cursor(cnt_cur, error); + xfs_btree_del_cursor(bno_cur, error); +err_newbt: + xrep_abt_dispose_reservations(ra, error); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_abt_remove_old_trees( + struct xrep_abt *ra) +{ + struct xfs_perag *pag = ra->sc->sa.pag; + int error; + + /* Free the old btree blocks if they're not in use. */ + error = xrep_reap_agblocks(ra->sc, &ra->old_allocbt_blocks, + &XFS_RMAP_OINFO_AG, XFS_AG_RESV_IGNORE); + if (error) + return error; + + /* + * Now that we've zapped all the old allocbt blocks we can turn off + * the alternate height mechanism. + */ + pag->pagf_alt_levels[XFS_BTNUM_BNOi] = 0; + pag->pagf_alt_levels[XFS_BTNUM_CNTi] = 0; + return 0; +} + +/* Repair the freespace btrees for some AG. */ +int +xrep_allocbt( + struct xfs_scrub *sc) +{ + struct xrep_abt *ra; + struct xfs_mount *mp = sc->mp; + char *descr; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; + + ra = kzalloc(sizeof(struct xrep_abt), XCHK_GFP_FLAGS); + if (!ra) + return -ENOMEM; + ra->sc = sc; + + /* We rebuild both data structures. */ + sc->sick_mask = XFS_SICK_AG_BNOBT | XFS_SICK_AG_CNTBT; + + /* + * Make sure the busy extent list is clear because we can't put extents + * on there twice. In theory we cleared this before we started, but + * let's not risk the filesystem. + */ + if (!xfs_extent_busy_list_empty(sc->sa.pag)) { + error = -EDEADLOCK; + goto out_ra; + } + + /* Set up enough storage to handle maximally fragmented free space. */ + descr = xchk_xfile_ag_descr(sc, "free space records"); + error = xfarray_create(descr, mp->m_sb.sb_agblocks / 2, + sizeof(struct xfs_alloc_rec_incore), + &ra->free_records); + kfree(descr); + if (error) + goto out_ra; + + /* Collect the free space data and find the old btree blocks. */ + xagb_bitmap_init(&ra->old_allocbt_blocks); + error = xrep_abt_find_freespace(ra); + if (error) + goto out_bitmap; + + /* Rebuild the free space information. */ + error = xrep_abt_build_new_trees(ra); + if (error) + goto out_bitmap; + + /* Kill the old trees. */ + error = xrep_abt_remove_old_trees(ra); + if (error) + goto out_bitmap; + +out_bitmap: + xagb_bitmap_destroy(&ra->old_allocbt_blocks); + xfarray_destroy(ra->free_records); +out_ra: + kfree(ra); + return error; +} + +/* Make sure both btrees are ok after we've rebuilt them. */ +int +xrep_revalidate_allocbt( + struct xfs_scrub *sc) +{ + __u32 old_type = sc->sm->sm_type; + int error; + + /* + * We must update sm_type temporarily so that the tree-to-tree cross + * reference checks will work in the correct direction, and also so + * that tracing will report correctly if there are more errors. + */ + sc->sm->sm_type = XFS_SCRUB_TYPE_BNOBT; + error = xchk_bnobt(sc); + if (error) + goto out; + + sc->sm->sm_type = XFS_SCRUB_TYPE_CNTBT; + error = xchk_cntbt(sc); +out: + sc->sm->sm_type = old_type; + return error; +} diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h index a39dbe6be1e59..4b666f254d700 100644 --- a/fs/xfs/scrub/common.h +++ b/fs/xfs/scrub/common.h @@ -194,8 +194,21 @@ static inline bool xchk_needs_repair(const struct xfs_scrub_metadata *sm) XFS_SCRUB_OFLAG_XCORRUPT | XFS_SCRUB_OFLAG_PREEN); } + +/* + * "Should we prepare for a repair?" + * + * Return true if the caller permits us to repair metadata and we're not + * setting up for a post-repair evaluation. + */ +static inline bool xchk_could_repair(const struct xfs_scrub *sc) +{ + return (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) && + !(sc->flags & XREP_ALREADY_FIXED); +} #else # define xchk_needs_repair(sc) (false) +# define xchk_could_repair(sc) (false) #endif /* CONFIG_XFS_ONLINE_REPAIR */ int xchk_metadata_inode_forks(struct xfs_scrub *sc); @@ -207,6 +220,12 @@ int xchk_metadata_inode_forks(struct xfs_scrub *sc); #define xchk_xfile_descr(sc, fmt, ...) \ kasprintf(XCHK_GFP_FLAGS, "XFS (%s): " fmt, \ (sc)->mp->m_super->s_id, ##__VA_ARGS__) +#define xchk_xfile_ag_descr(sc, fmt, ...) \ + kasprintf(XCHK_GFP_FLAGS, "XFS (%s): AG 0x%x " fmt, \ + (sc)->mp->m_super->s_id, \ + (sc)->sa.pag ? (sc)->sa.pag->pag_agno : (sc)->sm->sm_agno, \ + ##__VA_ARGS__) + /* * Setting up a hook to wait for intents to drain is costly -- we have to take diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c index 73e21a9e5e929..deb45d7982f7e 100644 --- a/fs/xfs/scrub/newbt.c +++ b/fs/xfs/scrub/newbt.c @@ -153,11 +153,13 @@ xrep_newbt_add_blocks( resv->used = 0; resv->pag = xfs_perag_hold(pag); - ASSERT(xnr->oinfo.oi_offset == 0); + if (args->tp) { + ASSERT(xnr->oinfo.oi_offset == 0); - error = xfs_alloc_schedule_autoreap(args, true, &resv->autoreap); - if (error) - goto out_pag; + error = xfs_alloc_schedule_autoreap(args, true, &resv->autoreap); + if (error) + goto out_pag; + } list_add_tail(&resv->list, &xnr->resv_list); return 0; @@ -167,6 +169,30 @@ xrep_newbt_add_blocks( return error; } +/* + * Add an extent to the new btree reservation pool. Callers are required to + * reap this reservation manually if the repair is cancelled. @pag must be a + * passive reference. + */ +int +xrep_newbt_add_extent( + struct xrep_newbt *xnr, + struct xfs_perag *pag, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + struct xfs_mount *mp = xnr->sc->mp; + struct xfs_alloc_arg args = { + .tp = NULL, /* no autoreap */ + .oinfo = xnr->oinfo, + .fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, agbno), + .len = len, + .resv = xnr->resv, + }; + + return xrep_newbt_add_blocks(xnr, pag, &args); +} + /* Don't let our allocation hint take us beyond this AG */ static inline void xrep_newbt_validate_ag_alloc_hint( @@ -368,6 +394,7 @@ xrep_newbt_free_extent( free_aglen, xnr->oinfo.oi_owner); ASSERT(xnr->resv != XFS_AG_RESV_AGFL); + ASSERT(xnr->resv != XFS_AG_RESV_IGNORE); /* * Use EFIs to free the reservations. This reduces the chance @@ -513,3 +540,16 @@ xrep_newbt_claim_block( /* Relog all the EFIs. */ return xrep_defer_finish(xnr->sc); } + +/* How many reserved blocks are unused? */ +unsigned int +xrep_newbt_unused_blocks( + struct xrep_newbt *xnr) +{ + struct xrep_newbt_resv *resv; + unsigned int unused = 0; + + list_for_each_entry(resv, &xnr->resv_list, list) + unused += resv->len - resv->used; + return unused; +} diff --git a/fs/xfs/scrub/newbt.h b/fs/xfs/scrub/newbt.h index d2baffa17b1ae..c158d92562787 100644 --- a/fs/xfs/scrub/newbt.h +++ b/fs/xfs/scrub/newbt.h @@ -50,6 +50,9 @@ struct xrep_newbt { enum xfs_ag_resv_type resv; }; +#define for_each_xrep_newbt_reservation(xnr, resv, n) \ + list_for_each_entry_safe((resv), (n), &(xnr)->resv_list, list) + void xrep_newbt_init_bare(struct xrep_newbt *xnr, struct xfs_scrub *sc); void xrep_newbt_init_ag(struct xrep_newbt *xnr, struct xfs_scrub *sc, const struct xfs_owner_info *oinfo, xfs_fsblock_t alloc_hint, @@ -57,9 +60,12 @@ void xrep_newbt_init_ag(struct xrep_newbt *xnr, struct xfs_scrub *sc, int xrep_newbt_init_inode(struct xrep_newbt *xnr, struct xfs_scrub *sc, int whichfork, const struct xfs_owner_info *oinfo); int xrep_newbt_alloc_blocks(struct xrep_newbt *xnr, uint64_t nr_blocks); +int xrep_newbt_add_extent(struct xrep_newbt *xnr, struct xfs_perag *pag, + xfs_agblock_t agbno, xfs_extlen_t len); void xrep_newbt_cancel(struct xrep_newbt *xnr); int xrep_newbt_commit(struct xrep_newbt *xnr); int xrep_newbt_claim_block(struct xfs_btree_cur *cur, struct xrep_newbt *xnr, union xfs_btree_ptr *ptr); +unsigned int xrep_newbt_unused_blocks(struct xrep_newbt *xnr); #endif /* __XFS_SCRUB_NEWBT_H__ */ diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 1b8b5439f2d7f..828c0585701a4 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -734,3 +734,72 @@ xrep_ino_dqattach( return error; } + +/* Initialize all the btree cursors for an AG repair. */ +void +xrep_ag_btcur_init( + struct xfs_scrub *sc, + struct xchk_ag *sa) +{ + struct xfs_mount *mp = sc->mp; + + /* Set up a bnobt cursor for cross-referencing. */ + if (sc->sm->sm_type != XFS_SCRUB_TYPE_BNOBT && + sc->sm->sm_type != XFS_SCRUB_TYPE_CNTBT) { + sa->bno_cur = xfs_allocbt_init_cursor(mp, sc->tp, sa->agf_bp, + sc->sa.pag, XFS_BTNUM_BNO); + sa->cnt_cur = xfs_allocbt_init_cursor(mp, sc->tp, sa->agf_bp, + sc->sa.pag, XFS_BTNUM_CNT); + } + + /* Set up a inobt cursor for cross-referencing. */ + if (sc->sm->sm_type != XFS_SCRUB_TYPE_INOBT && + sc->sm->sm_type != XFS_SCRUB_TYPE_FINOBT) { + sa->ino_cur = xfs_inobt_init_cursor(sc->sa.pag, sc->tp, + sa->agi_bp, XFS_BTNUM_INO); + if (xfs_has_finobt(mp)) + sa->fino_cur = xfs_inobt_init_cursor(sc->sa.pag, + sc->tp, sa->agi_bp, XFS_BTNUM_FINO); + } + + /* Set up a rmapbt cursor for cross-referencing. */ + if (sc->sm->sm_type != XFS_SCRUB_TYPE_RMAPBT && + xfs_has_rmapbt(mp)) + sa->rmap_cur = xfs_rmapbt_init_cursor(mp, sc->tp, sa->agf_bp, + sc->sa.pag); + + /* Set up a refcountbt cursor for cross-referencing. */ + if (sc->sm->sm_type != XFS_SCRUB_TYPE_REFCNTBT && + xfs_has_reflink(mp)) + sa->refc_cur = xfs_refcountbt_init_cursor(mp, sc->tp, + sa->agf_bp, sc->sa.pag); +} + +/* + * Reinitialize the in-core AG state after a repair by rereading the AGF + * buffer. We had better get the same AGF buffer as the one that's attached + * to the scrub context. + */ +int +xrep_reinit_pagf( + struct xfs_scrub *sc) +{ + struct xfs_perag *pag = sc->sa.pag; + struct xfs_buf *bp; + int error; + + ASSERT(pag); + ASSERT(xfs_perag_initialised_agf(pag)); + + clear_bit(XFS_AGSTATE_AGF_INIT, &pag->pag_opstate); + error = xfs_alloc_read_agf(pag, sc->tp, 0, &bp); + if (error) + return error; + + if (bp != sc->sa.agf_bp) { + ASSERT(bp == sc->sa.agf_bp); + return -EFSCORRUPTED; + } + + return 0; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 60d2a9ae5f2ec..bc3353ecae8a1 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -60,6 +60,15 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp, void xrep_force_quotacheck(struct xfs_scrub *sc, xfs_dqtype_t type); int xrep_ino_dqattach(struct xfs_scrub *sc); +/* Repair setup functions */ +int xrep_setup_ag_allocbt(struct xfs_scrub *sc); + +void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa); + +/* Metadata revalidators */ + +int xrep_revalidate_allocbt(struct xfs_scrub *sc); + /* Metadata repairers */ int xrep_probe(struct xfs_scrub *sc); @@ -67,6 +76,9 @@ int xrep_superblock(struct xfs_scrub *sc); int xrep_agf(struct xfs_scrub *sc); int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); +int xrep_allocbt(struct xfs_scrub *sc); + +int xrep_reinit_pagf(struct xfs_scrub *sc); #else @@ -87,11 +99,23 @@ xrep_calc_ag_resblks( return 0; } +/* repair setup functions for no-repair */ +static inline int +xrep_setup_nothing( + struct xfs_scrub *sc) +{ + return 0; +} +#define xrep_setup_ag_allocbt xrep_setup_nothing + +#define xrep_revalidate_allocbt (NULL) + #define xrep_probe xrep_notsupported #define xrep_superblock xrep_notsupported #define xrep_agf xrep_notsupported #define xrep_agfl xrep_notsupported #define xrep_agi xrep_notsupported +#define xrep_allocbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 474f4c4a9cd3b..ebfaeb3793154 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -241,13 +241,15 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_PERAG, .setup = xchk_setup_ag_allocbt, .scrub = xchk_bnobt, - .repair = xrep_notsupported, + .repair = xrep_allocbt, + .repair_eval = xrep_revalidate_allocbt, }, [XFS_SCRUB_TYPE_CNTBT] = { /* cntbt */ .type = ST_PERAG, .setup = xchk_setup_ag_allocbt, .scrub = xchk_cntbt, - .repair = xrep_notsupported, + .repair = xrep_allocbt, + .repair_eval = xrep_revalidate_allocbt, }, [XFS_SCRUB_TYPE_INOBT] = { /* inobt */ .type = ST_PERAG, @@ -533,7 +535,10 @@ xfs_scrub_metadata( /* Scrub for errors. */ check_start = xchk_stats_now(); - error = sc->ops->scrub(sc); + if ((sc->flags & XREP_ALREADY_FIXED) && sc->ops->repair_eval != NULL) + error = sc->ops->repair_eval(sc); + else + error = sc->ops->scrub(sc); run.scrub_ns += xchk_stats_elapsed_ns(check_start); if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER)) goto try_harder; @@ -544,8 +549,7 @@ xfs_scrub_metadata( xchk_update_health(sc); - if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) && - !(sc->flags & XREP_ALREADY_FIXED)) { + if (xchk_could_repair(sc)) { bool needs_fix = xchk_needs_repair(sc->sm); /* Userspace asked us to rebuild the structure regardless. */ diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index 1ef9c6b4842a1..cb6294e629836 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -35,6 +35,14 @@ struct xchk_meta_ops { /* Repair or optimize the metadata. */ int (*repair)(struct xfs_scrub *); + /* + * Re-scrub the metadata we repaired, in case there's extra work that + * we need to do to check our repair work. If this is NULL, we'll use + * the ->scrub function pointer, assuming that the regular scrub is + * sufficient. + */ + int (*repair_eval)(struct xfs_scrub *sc); + /* Decide if we even have this piece of metadata. */ bool (*has)(struct xfs_mount *); diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index aa76830753196..ea518712efa81 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1172,11 +1172,33 @@ DEFINE_EVENT(xrep_rmap_class, name, \ xfs_agblock_t agbno, xfs_extlen_t len, \ uint64_t owner, uint64_t offset, unsigned int flags), \ TP_ARGS(mp, agno, agbno, len, owner, offset, flags)) -DEFINE_REPAIR_RMAP_EVENT(xrep_alloc_extent_fn); DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn); DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn); DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn); +TRACE_EVENT(xrep_abt_found, + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, + const struct xfs_alloc_rec_incore *rec), + TP_ARGS(mp, agno, rec), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agblock_t, startblock) + __field(xfs_extlen_t, blockcount) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->agno = agno; + __entry->startblock = rec->ar_startblock; + __entry->blockcount = rec->ar_blockcount; + ), + TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->startblock, + __entry->blockcount) +) + TRACE_EVENT(xrep_refcount_extent_fn, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, struct xfs_refcount_irec *irec), diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h index 4ecac01363d9f..62b9c506fdd1b 100644 --- a/fs/xfs/scrub/xfarray.h +++ b/fs/xfs/scrub/xfarray.h @@ -54,6 +54,28 @@ static inline int xfarray_append(struct xfarray *array, const void *ptr) uint64_t xfarray_length(struct xfarray *array); int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec); +/* + * Iterate the non-null elements in a sparse xfarray. Callers should + * initialize *idx to XFARRAY_CURSOR_INIT before the first call; on return, it + * will be set to one more than the index of the record that was retrieved. + * Returns 1 if a record was retrieved, 0 if there weren't any more records, or + * a negative errno. + */ +static inline int +xfarray_iter( + struct xfarray *array, + xfarray_idx_t *idx, + void *rec) +{ + int ret = xfarray_load_next(array, idx, rec); + + if (ret == -ENODATA) + return 0; + if (ret == 0) + return 1; + return ret; +} + /* Declarations for xfile array sort functionality. */ typedef cmp_func_t xfarray_cmp_fn; diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index 9ecfdcdc752f7..2ccde32c9a9e9 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -678,3 +678,16 @@ xfs_extent_busy_ag_cmp( diff = b1->bno - b2->bno; return diff; } + +/* Are there any busy extents in this AG? */ +bool +xfs_extent_busy_list_empty( + struct xfs_perag *pag) +{ + bool res; + + spin_lock(&pag->pagb_lock); + res = RB_EMPTY_ROOT(&pag->pagb_tree); + spin_unlock(&pag->pagb_lock); + return res; +} diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h index 0639aab336f3f..470032de31391 100644 --- a/fs/xfs/xfs_extent_busy.h +++ b/fs/xfs/xfs_extent_busy.h @@ -85,4 +85,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list) list_sort(NULL, list, xfs_extent_busy_ag_cmp); } +bool xfs_extent_busy_list_empty(struct xfs_perag *pag); + #endif /* __XFS_EXTENT_BUSY_H__ */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-24 23:50 ` [PATCH 3/5] xfs: repair free space btrees Darrick J. Wong @ 2023-11-25 6:11 ` Christoph Hellwig 2023-11-28 1:05 ` Darrick J. Wong 2023-11-28 15:10 ` Christoph Hellwig 1 sibling, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 6:11 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs On Fri, Nov 24, 2023 at 03:50:33PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Rebuild the free space btrees from the gaps in the rmap btree. This commit message feels a bit sparse for the amount of code added, although I can't really offer a good idea of what to add. Otherwise just two comments on the interaction with the rest of the xfs code, I'll try to digest the new repair code a bit more in the meantime. > +#ifdef CONFIG_XFS_ONLINE_REPAIR > + /* > + * Alternate btree heights so that online repair won't trip the write > + * verifiers while rebuilding the AG btrees. > + */ > + uint8_t pagf_alt_levels[XFS_BTNUM_AGF]; > +#endif Alternate and the alt_ prefix doesn't feel very descriptive. As far as I can tell these are about an ongoign repair, so as a at lest somewhat better choice call it "pagf_repair_levels"? > +xfs_failaddr_t > +xfs_alloc_check_irec( > + struct xfs_btree_cur *cur, > + const struct xfs_alloc_rec_incore *irec) > +{ > + return xfs_alloc_check_perag_irec(cur->bc_ag.pag, irec); > +} Is there much of a point in even keeping this wrapper vs just switching xfs_alloc_check_irec to pass the pag instead of the cursor? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-25 6:11 ` Christoph Hellwig @ 2023-11-28 1:05 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 1:05 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Fri, Nov 24, 2023 at 10:11:18PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:50:33PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Rebuild the free space btrees from the gaps in the rmap btree. > > This commit message feels a bit sparse for the amount of code added, > although I can't really offer a good idea of what to add. "Refer to the design documentation for more details: Link: https://docs.kernel.org/filesystems/xfs-online-fsck-design.html?highlight=xfs#case-study-rebuilding-the-free-space-indices" ? > Otherwise just two comments on the interaction with the rest of the > xfs code, I'll try to digest the new repair code a bit more in the > meantime. > > > +#ifdef CONFIG_XFS_ONLINE_REPAIR > > + /* > > + * Alternate btree heights so that online repair won't trip the write > > + * verifiers while rebuilding the AG btrees. > > + */ > > + uint8_t pagf_alt_levels[XFS_BTNUM_AGF]; > > +#endif > > Alternate and the alt_ prefix doesn't feel very descriptive. As far as > I can tell these are about an ongoign repair, so as a at lest somewhat > better choice call it "pagf_repair_levels"? Done. > > +xfs_failaddr_t > > +xfs_alloc_check_irec( > > + struct xfs_btree_cur *cur, > > + const struct xfs_alloc_rec_incore *irec) > > +{ > > + return xfs_alloc_check_perag_irec(cur->bc_ag.pag, irec); > > +} > > Is there much of a point in even keeping this wrapper vs just > switching xfs_alloc_check_irec to pass the pag instead of the > cursor? Not really. I'll remove this from the next spin. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-24 23:50 ` [PATCH 3/5] xfs: repair free space btrees Darrick J. Wong 2023-11-25 6:11 ` Christoph Hellwig @ 2023-11-28 15:10 ` Christoph Hellwig 2023-11-28 21:13 ` Darrick J. Wong 1 sibling, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 15:10 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs A highlevel question on how blocks are (re)used here. - xrep_abt_find_freespace accounts the old allocbt blocks as freespace per the comment, although as far as I understand the code in xrep_abt_walk_rmap the allocbt blocks aren't actually added to the free_records xfarray, but recorded into the old_allocbt_blocks bitmap (btw, why are we using different data structures for them?) - xrep_abt_reserve_space seems to allocate space for the new alloc btrees by popping the first entry of ->free_records until it has enough space. - what happens if the AG is so full and fragmented that we do not have space to create a second set of allocbts without tearing down the old ones first? I've also got a whole bunch of nitpicks that mostly don't require an immediate action and/or about existing code that just grows more users here: > +int > +xrep_allocbt( > + struct xfs_scrub *sc) > +{ > + struct xrep_abt *ra; > + struct xfs_mount *mp = sc->mp; > + char *descr; > + int error; > + > + /* We require the rmapbt to rebuild anything. */ > + if (!xfs_has_rmapbt(mp)) > + return -EOPNOTSUPP; Shoudn't this be checked by the .has callback in struct xchk_meta_ops? > + /* Set up enough storage to handle maximally fragmented free space. */ > + descr = xchk_xfile_ag_descr(sc, "free space records"); > + error = xfarray_create(descr, mp->m_sb.sb_agblocks / 2, > + sizeof(struct xfs_alloc_rec_incore), > + &ra->free_records); > + kfree(descr); Commenting on a new user of old code here, but why doesn't xfarray_create simply take a format string so that we don't need the separate allocatiom and kasprintf here? > + /* > + * We must update sm_type temporarily so that the tree-to-tree cross > + * reference checks will work in the correct direction, and also so > + * that tracing will report correctly if there are more errors. > + */ > + sc->sm->sm_type = XFS_SCRUB_TYPE_BNOBT; > + error = xchk_bnobt(sc); So xchk_bnobt is a tiny wrapper around xchk_allocbt, which is a small wrapper around xchk_btree that basіally de-multiplex the argument pass in by xchk_bnobt again. This is existing code not newly added, but the call chain looks a bit odd to me. > +/* > + * Add an extent to the new btree reservation pool. Callers are required to > + * reap this reservation manually if the repair is cancelled. @pag must be a > + * passive reference. > + */ > +int > +xrep_newbt_add_extent( > + struct xrep_newbt *xnr, > + struct xfs_perag *pag, > + xfs_agblock_t agbno, > + xfs_extlen_t len) > +{ > + struct xfs_mount *mp = xnr->sc->mp; > + struct xfs_alloc_arg args = { > + .tp = NULL, /* no autoreap */ > + .oinfo = xnr->oinfo, > + .fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, agbno), > + .len = len, > + .resv = xnr->resv, > + }; > + > + return xrep_newbt_add_blocks(xnr, pag, &args); > +} I don't quite understand what this helper adds, and the _blocks vs _extent naming is a bit confusing. > +#define for_each_xrep_newbt_reservation(xnr, resv, n) \ > + list_for_each_entry_safe((resv), (n), &(xnr)->resv_list, list) I have to admit that I find the open code list_for_each_entry_safe easier to follow than such wrappers. > +/* Initialize all the btree cursors for an AG repair. */ > +void > +xrep_ag_btcur_init( > + struct xfs_scrub *sc, > + struct xchk_ag *sa) > +{ As far as I can tell this basically sets up cursors for all the btrees except the one that we want to repair, and the one that goes along with for the alloc and ialloc pairs? Maybe just spell that out for clarity. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-28 15:10 ` Christoph Hellwig @ 2023-11-28 21:13 ` Darrick J. Wong 2023-11-29 5:56 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 21:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Tue, Nov 28, 2023 at 07:10:57AM -0800, Christoph Hellwig wrote: > A highlevel question on how blocks are (re)used here. > > - xrep_abt_find_freespace accounts the old allocbt blocks as freespace > per the comment, although as far as I understand the code in > xrep_abt_walk_rmap the allocbt blocks aren't actually added to the > free_records xfarray, but recorded into the old_allocbt_blocks > bitmap (btw, why are we using different data structures for them?) The old free space btree blocks are tracked via old_allocbt_blocks so that we can reap the space after committing the new btree roots. Reaping cross-references the set regions in the bitmap against the rmapbt records so that we don't free crosslinked blocks. This is a big difference from xfs_repair, which constructs a free space map by walking the directory / bmbt trees and builds completely new indexes in the gaps. It gets away with that because it's building all the space metadata in the AG, not just the free space btrees. The next question you might have is why there's old_allocbt_blocks and not_allocbt_blocks -- this is due to us using the AGFL to supply the bnobt, cntbt, and rmapbt's alloc_block routines. Hence the blocks tracked by all four data structures are all RMAP_OWN_AG, and we have to do a bit of bitmap work to subtract the rmapbt and AGFL blocks from all the OWN_AG records to end up with the blocks that we think are owned by the free space btrees. > - xrep_abt_reserve_space seems to allocate space for the new alloc > btrees by popping the first entry of ->free_records until it has > enough space. Correct. > - what happens if the AG is so full and fragmented that we do not > have space to create a second set of allocbts without tearing down > the old ones first? xrep_abt_reserve_space returns -ENOSPC, so we throw away all the incore records and throw the errno out to userspace. Across all the btree rebuilding code, the block reservation step happens as the very last thing before we start writing btree blocks, so it's still possible to back out cleanly. > I've also got a whole bunch of nitpicks that mostly don't require an > immediate action and/or about existing code that just grows more users > here: Heh. > > +int > > +xrep_allocbt( > > + struct xfs_scrub *sc) > > +{ > > + struct xrep_abt *ra; > > + struct xfs_mount *mp = sc->mp; > > + char *descr; > > + int error; > > + > > + /* We require the rmapbt to rebuild anything. */ > > + if (!xfs_has_rmapbt(mp)) > > + return -EOPNOTSUPP; > > Shoudn't this be checked by the .has callback in struct xchk_meta_ops? No. Checking doesn't require the rmapbt because all we do is walk the bnobt/cntbt records and cross-reference them with whatever metadata we can find. A stronger check would scan the AG to build a second recordset of the free space and compare that against what's on disk. However, that would be much slower, and Dave wanted scans to be fast because corruptions are supposed to be the edge case. :) The weaker checking also means we can scrub old filesystems, even if we still require xfs_repair to fix them. > > + /* Set up enough storage to handle maximally fragmented free space. */ > > + descr = xchk_xfile_ag_descr(sc, "free space records"); > > + error = xfarray_create(descr, mp->m_sb.sb_agblocks / 2, > > + sizeof(struct xfs_alloc_rec_incore), > > + &ra->free_records); > > + kfree(descr); > > Commenting on a new user of old code here, but why doesn't > xfarray_create simply take a format string so that we don't need the > separate allocatiom and kasprintf here? I didn't want to spend the brainpower to figure out how to make the macro and va_args crud work to support pushing both from xrep_allocbt -> xfarray_create -> xfile_create. I don't know how to make that stuff nest short of adding a kas- variant of xfile_create. Seeing as we don't install xfiles into the file descriptor table anyway, the labels are only visible via ftrace, and not procfs. I decided that cleanliness here wasn't a high enough priority. > > + /* > > + * We must update sm_type temporarily so that the tree-to-tree cross > > + * reference checks will work in the correct direction, and also so > > + * that tracing will report correctly if there are more errors. > > + */ > > + sc->sm->sm_type = XFS_SCRUB_TYPE_BNOBT; > > + error = xchk_bnobt(sc); > > So xchk_bnobt is a tiny wrapper around xchk_allocbt, which is a small > wrapper around xchk_btree that basіally de-multiplex the argument > pass in by xchk_bnobt again. This is existing code not newly added, > but the call chain looks a bit odd to me. Yeah. I suppose one way to clean that up would be to export xchk_allocbt to the dispatch table in scrub.c instead of xchk_{bno,cnt}bt and figure out which btree we want from sm_type. Way back when I was designing scrub I thought that repair would be separate for each btree type, but that turned out not to be the case. Hence the awkwardness in the call chains. > > +/* > > + * Add an extent to the new btree reservation pool. Callers are required to > > + * reap this reservation manually if the repair is cancelled. @pag must be a > > + * passive reference. > > + */ > > +int > > +xrep_newbt_add_extent( > > + struct xrep_newbt *xnr, > > + struct xfs_perag *pag, > > + xfs_agblock_t agbno, > > + xfs_extlen_t len) > > +{ > > + struct xfs_mount *mp = xnr->sc->mp; > > + struct xfs_alloc_arg args = { > > + .tp = NULL, /* no autoreap */ > > + .oinfo = xnr->oinfo, > > + .fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, agbno), > > + .len = len, > > + .resv = xnr->resv, > > + }; > > + > > + return xrep_newbt_add_blocks(xnr, pag, &args); > > +} > > I don't quite understand what this helper adds, and the _blocks vs > _extent naming is a bit confusing. This wrapper simplifes the interface to xrep_newbt_add_blocks so that external callers don't have to know which magic values of xfs_alloc_arg are actually used by xrep_newbt_add_blocks and therefore need to be set. For all the other repair functions, they have to allocate space from the free space btree, so xrep_newbt_add_blocks is passed the full _alloc_args as returned by the allocator to xrep_newbt_alloc_ag_blocks. > > +#define for_each_xrep_newbt_reservation(xnr, resv, n) \ > > + list_for_each_entry_safe((resv), (n), &(xnr)->resv_list, list) > > I have to admit that I find the open code list_for_each_entry_safe > easier to follow than such wrappers. The funny part is that I don't even use it in newbt.c. Maybe it's time to get rid of it. $ git grep for_each_xrep_newbt_reservation fs fs/xfs/scrub/alloc_repair.c:568: for_each_xrep_newbt_reservation(&ra->new_bnobt, resv, n) { fs/xfs/scrub/alloc_repair.c:575: for_each_xrep_newbt_reservation(&ra->new_bnobt, resv, n) { fs/xfs/scrub/newbt.h:60:#define for_each_xrep_newbt_reservation(xnr, resv, n) \ fs/xfs/scrub/rmap_repair.c:1126: for_each_xrep_newbt_reservation(&rr->new_btree, resv, n) { > > +/* Initialize all the btree cursors for an AG repair. */ > > +void > > +xrep_ag_btcur_init( > > + struct xfs_scrub *sc, > > + struct xchk_ag *sa) > > +{ > > As far as I can tell this basically sets up cursors for all the > btrees except the one that we want to repair, and the one that > goes along with for the alloc and ialloc pairs? Maybe just spell > that out for clarity. Done. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-28 21:13 ` Darrick J. Wong @ 2023-11-29 5:56 ` Christoph Hellwig 2023-11-29 6:18 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 5:56 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs On Tue, Nov 28, 2023 at 01:13:58PM -0800, Darrick J. Wong wrote: > > - xrep_abt_find_freespace accounts the old allocbt blocks as freespace > > per the comment, although as far as I understand the code in > > xrep_abt_walk_rmap the allocbt blocks aren't actually added to the > > free_records xfarray, but recorded into the old_allocbt_blocks > > bitmap (btw, why are we using different data structures for them?) > > The old free space btree blocks are tracked via old_allocbt_blocks so > that we can reap the space after committing the new btree roots. Yes, I got that from the code. But I'm a bit confused about the comment treating the old allocbt blocks as free space, while the code doesn't. Or I guess the confusion is that we really have two slightly different notions of "free space": 1) the space we try to build the trees for 2) the space used as free space to us for the trees The old allocbt blocks are part of 1 but not of 2. > > - what happens if the AG is so full and fragmented that we do not > > have space to create a second set of allocbts without tearing down > > the old ones first? > > xrep_abt_reserve_space returns -ENOSPC, so we throw away all the incore > records and throw the errno out to userspace. Across all the btree > rebuilding code, the block reservation step happens as the very last > thing before we start writing btree blocks, so it's still possible to > back out cleanly. But that means we can't repair the allocation btrees for this case. > > > + /* We require the rmapbt to rebuild anything. */ > > > + if (!xfs_has_rmapbt(mp)) > > > + return -EOPNOTSUPP; > > > > Shoudn't this be checked by the .has callback in struct xchk_meta_ops? > > No. Checking doesn't require the rmapbt because all we do is walk the > bnobt/cntbt records and cross-reference them with whatever metadata we > can find. Oh, and .has applies to both checking and repairing. Gt it. > This wrapper simplifes the interface to xrep_newbt_add_blocks so that > external callers don't have to know which magic values of xfs_alloc_arg > are actually used by xrep_newbt_add_blocks and therefore need to be set. > > For all the other repair functions, they have to allocate space from the > free space btree, so xrep_newbt_add_blocks is passed the full > _alloc_args as returned by the allocator to xrep_newbt_alloc_ag_blocks. Ok. Maybe throw in a comment that it just is a convenience wrapper? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-29 5:56 ` Christoph Hellwig @ 2023-11-29 6:18 ` Darrick J. Wong 2023-11-29 6:24 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-29 6:18 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Tue, Nov 28, 2023 at 09:56:59PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 01:13:58PM -0800, Darrick J. Wong wrote: > > > - xrep_abt_find_freespace accounts the old allocbt blocks as freespace > > > per the comment, although as far as I understand the code in > > > xrep_abt_walk_rmap the allocbt blocks aren't actually added to the > > > free_records xfarray, but recorded into the old_allocbt_blocks > > > bitmap (btw, why are we using different data structures for them?) > > > > The old free space btree blocks are tracked via old_allocbt_blocks so > > that we can reap the space after committing the new btree roots. > > Yes, I got that from the code. But I'm a bit confused about the > comment treating the old allocbt blocks as free space, while the code > doesn't. Or I guess the confusion is that we really have two slightly > different notions of "free space": > > 1) the space we try to build the trees for > 2) the space used as free space to us for the trees > > The old allocbt blocks are part of 1 but not of 2. Yeah. The blocks for #2 only come from the gaps in the rmapbt records that we find during the scan. Part of the confusion here might be the naming -- at the end of the rmapbt scan, old_allocbt_blocks is set to all the blocks that the rmapbt says are OWN_AG. Each rmapbt block encountered during the scan is added to not_allocbt_blocks, and afterwards each block on the AGFL is also added to not_allocbt_blocks. After that is where the definitions of old_allocbt_blocks shifts. This expression will help us to identify possible former bnobt/cntbt blocks: (OWN_AG blocks) & ~(rmapbt blocks | agfl blocks); Substituting from above definitions, that becomes: old_allocbt_blocks & ~not_allocbt_blocks The OWN_AG bitmap itself isn't needed after this point, so what we really do instead is: old_allocbt_blocks &= ~not_allocbt_blocks; IOWs, after this point, "old_allocbt_blocks" is a bitmap of alleged former bnobt/cntbt blocks. The xagb_bitmap_disunion operation modifies its first parameter in place to avoid copying records around. The gaps (aka the new freespace records) are stashed in the free_records xfarray. Next, some of the @free_records are diverted to the newbt reservation and used to format new btree blocks. I hope that makes things clearer? > > > - what happens if the AG is so full and fragmented that we do not > > > have space to create a second set of allocbts without tearing down > > > the old ones first? > > > > xrep_abt_reserve_space returns -ENOSPC, so we throw away all the incore > > records and throw the errno out to userspace. Across all the btree > > rebuilding code, the block reservation step happens as the very last > > thing before we start writing btree blocks, so it's still possible to > > back out cleanly. > > But that means we can't repair the allocation btrees for this case. Yep. At least we revert cleanly here -- when that happens to xfs_repair it just blows up and leaves a dead fs. :/ Practically speaking, the rmapbt reservations are large enough that we can rebuild the trees if we have to, even if that just means stealing from the per-AG space reservations later. Though you could create the fs from hell by using reflink and cow to fragmenting the rmapbt until it swells up and consumes the entire AG. That would require billions of fragments; even with my bmap inflation xfs_db commands that still takes about 2 days to get the rmapbt to the point of overflowing the regular reservation. I tried pushing it to eat an entire AG but then the ILOM exploded and no idea what happened to that machine and its 512G of RAM. The lab people kind of hate me now. > > > > + /* We require the rmapbt to rebuild anything. */ > > > > + if (!xfs_has_rmapbt(mp)) > > > > + return -EOPNOTSUPP; > > > > > > Shoudn't this be checked by the .has callback in struct xchk_meta_ops? > > > > No. Checking doesn't require the rmapbt because all we do is walk the > > bnobt/cntbt records and cross-reference them with whatever metadata we > > can find. > > Oh, and .has applies to both checking and repairing. Gt it. Correct. > > This wrapper simplifes the interface to xrep_newbt_add_blocks so that > > external callers don't have to know which magic values of xfs_alloc_arg > > are actually used by xrep_newbt_add_blocks and therefore need to be set. > > > > For all the other repair functions, they have to allocate space from the > > free space btree, so xrep_newbt_add_blocks is passed the full > > _alloc_args as returned by the allocator to xrep_newbt_alloc_ag_blocks. > > Ok. Maybe throw in a comment that it just is a convenience wrapper? Will do. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-29 6:18 ` Darrick J. Wong @ 2023-11-29 6:24 ` Christoph Hellwig 2023-11-29 6:26 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 6:24 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs On Tue, Nov 28, 2023 at 10:18:19PM -0800, Darrick J. Wong wrote: > The gaps (aka the new freespace records) are stashed in the free_records > xfarray. Next, some of the @free_records are diverted to the newbt > reservation and used to format new btree blocks. > > I hope that makes things clearer? Yes. Can you add this to the code as a comment? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: repair free space btrees 2023-11-29 6:24 ` Christoph Hellwig @ 2023-11-29 6:26 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-29 6:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Tue, Nov 28, 2023 at 10:24:58PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 10:18:19PM -0800, Darrick J. Wong wrote: > > The gaps (aka the new freespace records) are stashed in the free_records > > xfarray. Next, some of the @free_records are diverted to the newbt > > reservation and used to format new btree blocks. > > > > I hope that makes things clearer? > > Yes. Can you add this to the code as a comment? Will do. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 4/5] xfs: repair inode btrees 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:50 ` [PATCH 3/5] xfs: repair free space btrees Darrick J. Wong @ 2023-11-24 23:50 ` Darrick J. Wong 2023-11-25 6:12 ` Christoph Hellwig 2023-11-28 15:57 ` Christoph Hellwig 2023-11-24 23:51 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 4 siblings, 2 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:50 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Use the rmapbt to find inode chunks, query the chunks to compute hole and free masks, and with that information rebuild the inobt and finobt. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_ialloc.c | 41 +- fs/xfs/libxfs/xfs_ialloc.h | 3 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/ialloc_repair.c | 874 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.c | 59 +++ fs/xfs/scrub/repair.h | 17 + fs/xfs/scrub/scrub.c | 6 fs/xfs/scrub/scrub.h | 1 fs/xfs/scrub/trace.h | 68 ++- 10 files changed, 1022 insertions(+), 49 deletions(-) create mode 100644 fs/xfs/scrub/ialloc_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 026591681937d..7fed0e706cfa0 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -182,6 +182,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) xfs-y += $(addprefix scrub/, \ agheader_repair.o \ alloc_repair.o \ + ialloc_repair.o \ newbt.o \ reap.o \ repair.o \ diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index d61d03e5b853b..edb03762dd7ae 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -95,18 +95,29 @@ xfs_inobt_btrec_to_irec( irec->ir_free = be64_to_cpu(rec->inobt.ir_free); } -/* Simple checks for inode records. */ -xfs_failaddr_t -xfs_inobt_check_irec( - struct xfs_btree_cur *cur, +/* Compute the freecount of an incore inode record. */ +uint8_t +xfs_inobt_rec_freecount( const struct xfs_inobt_rec_incore *irec) { - uint64_t realfree; + uint64_t realfree; + if (!xfs_inobt_issparse(irec->ir_holemask)) + realfree = irec->ir_free; + else + realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec); + return hweight64(realfree); +} + +inline xfs_failaddr_t +xfs_inobt_check_perag_irec( + struct xfs_perag *pag, + const struct xfs_inobt_rec_incore *irec) +{ /* Record has to be properly aligned within the AG. */ - if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino)) + if (!xfs_verify_agino(pag, irec->ir_startino)) return __this_address; - if (!xfs_verify_agino(cur->bc_ag.pag, + if (!xfs_verify_agino(pag, irec->ir_startino + XFS_INODES_PER_CHUNK - 1)) return __this_address; if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT || @@ -115,17 +126,21 @@ xfs_inobt_check_irec( if (irec->ir_freecount > XFS_INODES_PER_CHUNK) return __this_address; - /* if there are no holes, return the first available offset */ - if (!xfs_inobt_issparse(irec->ir_holemask)) - realfree = irec->ir_free; - else - realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec); - if (hweight64(realfree) != irec->ir_freecount) + if (xfs_inobt_rec_freecount(irec) != irec->ir_freecount) return __this_address; return NULL; } +/* Simple checks for inode records. */ +xfs_failaddr_t +xfs_inobt_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_inobt_rec_incore *irec) +{ + return xfs_inobt_check_perag_irec(cur->bc_ag.pag, irec); +} + static inline int xfs_inobt_complain_bad_rec( struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h index fe824bb04a091..012aac5671bad 100644 --- a/fs/xfs/libxfs/xfs_ialloc.h +++ b/fs/xfs/libxfs/xfs_ialloc.h @@ -79,6 +79,7 @@ int xfs_inobt_lookup(struct xfs_btree_cur *cur, xfs_agino_t ino, */ int xfs_inobt_get_rec(struct xfs_btree_cur *cur, xfs_inobt_rec_incore_t *rec, int *stat); +uint8_t xfs_inobt_rec_freecount(const struct xfs_inobt_rec_incore *irec); /* * Inode chunk initialisation routine @@ -93,6 +94,8 @@ union xfs_btree_rec; void xfs_inobt_btrec_to_irec(struct xfs_mount *mp, const union xfs_btree_rec *rec, struct xfs_inobt_rec_incore *irec); +xfs_failaddr_t xfs_inobt_check_perag_irec(struct xfs_perag *pag, + const struct xfs_inobt_rec_incore *irec); xfs_failaddr_t xfs_inobt_check_irec(struct xfs_btree_cur *cur, const struct xfs_inobt_rec_incore *irec); int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur, diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index 4bba3c49f8c59..b6725b05fb417 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -605,6 +605,7 @@ xchk_ag_free( struct xchk_ag *sa) { xchk_ag_btcur_free(sa); + xrep_reset_perag_resv(sc); if (sa->agf_bp) { xfs_trans_brelse(sc->tp, sa->agf_bp); sa->agf_bp = NULL; diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c new file mode 100644 index 0000000000000..eb5fe9d9ba194 --- /dev/null +++ b/fs/xfs/scrub/ialloc_repair.c @@ -0,0 +1,874 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_alloc.h" +#include "xfs_ialloc.h" +#include "xfs_ialloc_btree.h" +#include "xfs_icache.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_log.h" +#include "xfs_trans_priv.h" +#include "xfs_error.h" +#include "xfs_health.h" +#include "xfs_ag.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/newbt.h" +#include "scrub/reap.h" + +/* + * Inode Btree Repair + * ================== + * + * A quick refresher of inode btrees on a v5 filesystem: + * + * - Inode records are read into memory in units of 'inode clusters'. However + * many inodes fit in a cluster buffer is the smallest number of inodes that + * can be allocated or freed. Clusters are never smaller than one fs block + * though they can span multiple blocks. The size (in fs blocks) is + * computed with xfs_icluster_size_fsb(). The fs block alignment of a + * cluster is computed with xfs_ialloc_cluster_alignment(). + * + * - Each inode btree record can describe a single 'inode chunk'. The chunk + * size is defined to be 64 inodes. If sparse inodes are enabled, every + * inobt record must be aligned to the chunk size; if not, every record must + * be aligned to the start of a cluster. It is possible to construct an XFS + * geometry where one inobt record maps to multiple inode clusters; it is + * also possible to construct a geometry where multiple inobt records map to + * different parts of one inode cluster. + * + * - If sparse inodes are not enabled, the smallest unit of allocation for + * inode records is enough to contain one inode chunk's worth of inodes. + * + * - If sparse inodes are enabled, the holemask field will be active. Each + * bit of the holemask represents 4 potential inodes; if set, the + * corresponding space does *not* contain inodes and must be left alone. + * Clusters cannot be smaller than 4 inodes. The smallest unit of allocation + * of inode records is one inode cluster. + * + * So what's the rebuild algorithm? + * + * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT + * records. The OWN_INOBT records are the old inode btree blocks and will be + * cleared out after we've rebuilt the tree. Each possible inode cluster + * within an OWN_INODES record will be read in; for each possible inobt record + * associated with that cluster, compute the freemask calculated from the + * i_mode data in the inode chunk. For sparse inodes the holemask will be + * calculated by creating the properly aligned inobt record and punching out + * any chunk that's missing. Inode allocations and frees grab the AGI first, + * so repair protects itself from concurrent access by locking the AGI. + * + * Once we've reconstructed all the inode records, we can create new inode + * btree roots and reload the btrees. We rebuild both inode trees at the same + * time because they have the same rmap owner and it would be more complex to + * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT + * blocks it owns. We have all the data we need to build both, so dump + * everything and start over. + * + * We use the prefix 'xrep_ibt' because we rebuild both inode btrees at once. + */ + +struct xrep_ibt { + /* Record under construction. */ + struct xfs_inobt_rec_incore rie; + + /* new inobt information */ + struct xrep_newbt new_inobt; + + /* new finobt information */ + struct xrep_newbt new_finobt; + + /* Old inode btree blocks we found in the rmap. */ + struct xagb_bitmap old_iallocbt_blocks; + + /* Reconstructed inode records. */ + struct xfarray *inode_records; + + struct xfs_scrub *sc; + + /* Number of inodes assigned disk space. */ + unsigned int icount; + + /* Number of inodes in use. */ + unsigned int iused; + + /* Number of finobt records needed. */ + unsigned int finobt_recs; + + /* get_records()'s position in the inode record array. */ + xfarray_idx_t array_cur; +}; + +/* + * Is this inode in use? If the inode is in memory we can tell from i_mode, + * otherwise we have to check di_mode in the on-disk buffer. We only care + * that the high (i.e. non-permission) bits of _mode are zero. This should be + * safe because repair keeps all AG headers locked until the end, and process + * trying to perform an inode allocation/free must lock the AGI. + * + * @cluster_ag_base is the inode offset of the cluster within the AG. + * @cluster_bp is the cluster buffer. + * @cluster_index is the inode offset within the inode cluster. + */ +STATIC int +xrep_ibt_check_ifree( + struct xrep_ibt *ri, + xfs_agino_t cluster_ag_base, + struct xfs_buf *cluster_bp, + unsigned int cluster_index, + bool *inuse) +{ + struct xfs_scrub *sc = ri->sc; + struct xfs_mount *mp = sc->mp; + struct xfs_dinode *dip; + xfs_ino_t fsino; + xfs_agino_t agino; + xfs_agnumber_t agno = ri->sc->sa.pag->pag_agno; + unsigned int cluster_buf_base; + unsigned int offset; + int error; + + agino = cluster_ag_base + cluster_index; + fsino = XFS_AGINO_TO_INO(mp, agno, agino); + + /* Inode uncached or half assembled, read disk buffer */ + cluster_buf_base = XFS_INO_TO_OFFSET(mp, cluster_ag_base); + offset = (cluster_buf_base + cluster_index) * mp->m_sb.sb_inodesize; + if (offset >= BBTOB(cluster_bp->b_length)) + return -EFSCORRUPTED; + dip = xfs_buf_offset(cluster_bp, offset); + if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC) + return -EFSCORRUPTED; + + if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino) + return -EFSCORRUPTED; + + /* Will the in-core inode tell us if it's in use? */ + error = xchk_inode_is_allocated(sc, agino, inuse); + if (!error) + return 0; + + *inuse = dip->di_mode != 0; + return 0; +} + +/* Stash the accumulated inobt record for rebuilding. */ +STATIC int +xrep_ibt_stash( + struct xrep_ibt *ri) +{ + int error = 0; + + if (xchk_should_terminate(ri->sc, &error)) + return error; + + ri->rie.ir_freecount = xfs_inobt_rec_freecount(&ri->rie); + if (xfs_inobt_check_perag_irec(ri->sc->sa.pag, &ri->rie) != NULL) + return -EFSCORRUPTED; + + if (ri->rie.ir_freecount > 0) + ri->finobt_recs++; + + trace_xrep_ibt_found(ri->sc->mp, ri->sc->sa.pag->pag_agno, &ri->rie); + + error = xfarray_append(ri->inode_records, &ri->rie); + if (error) + return error; + + ri->rie.ir_startino = NULLAGINO; + return 0; +} + +/* + * Given an extent of inodes and an inode cluster buffer, calculate the + * location of the corresponding inobt record (creating it if necessary), + * then update the parts of the holemask and freemask of that record that + * correspond to the inode extent we were given. + * + * @cluster_ir_startino is the AG inode number of an inobt record that we're + * proposing to create for this inode cluster. If sparse inodes are enabled, + * we must round down to a chunk boundary to find the actual sparse record. + * @cluster_bp is the buffer of the inode cluster. + * @nr_inodes is the number of inodes to check from the cluster. + */ +STATIC int +xrep_ibt_cluster_record( + struct xrep_ibt *ri, + xfs_agino_t cluster_ir_startino, + struct xfs_buf *cluster_bp, + unsigned int nr_inodes) +{ + struct xfs_scrub *sc = ri->sc; + struct xfs_mount *mp = sc->mp; + xfs_agino_t ir_startino; + unsigned int cluster_base; + unsigned int cluster_index; + int error = 0; + + ir_startino = cluster_ir_startino; + if (xfs_has_sparseinodes(mp)) + ir_startino = rounddown(ir_startino, XFS_INODES_PER_CHUNK); + cluster_base = cluster_ir_startino - ir_startino; + + /* + * If the accumulated inobt record doesn't map this cluster, add it to + * the list and reset it. + */ + if (ri->rie.ir_startino != NULLAGINO && + ri->rie.ir_startino + XFS_INODES_PER_CHUNK <= ir_startino) { + error = xrep_ibt_stash(ri); + if (error) + return error; + } + + if (ri->rie.ir_startino == NULLAGINO) { + ri->rie.ir_startino = ir_startino; + ri->rie.ir_free = XFS_INOBT_ALL_FREE; + ri->rie.ir_holemask = 0xFFFF; + ri->rie.ir_count = 0; + } + + /* Record the whole cluster. */ + ri->icount += nr_inodes; + ri->rie.ir_count += nr_inodes; + ri->rie.ir_holemask &= ~xfs_inobt_maskn( + cluster_base / XFS_INODES_PER_HOLEMASK_BIT, + nr_inodes / XFS_INODES_PER_HOLEMASK_BIT); + + /* Which inodes within this cluster are free? */ + for (cluster_index = 0; cluster_index < nr_inodes; cluster_index++) { + bool inuse = false; + + error = xrep_ibt_check_ifree(ri, cluster_ir_startino, + cluster_bp, cluster_index, &inuse); + if (error) + return error; + if (!inuse) + continue; + ri->iused++; + ri->rie.ir_free &= ~XFS_INOBT_MASK(cluster_base + + cluster_index); + } + return 0; +} + +/* + * For each inode cluster covering the physical extent recorded by the rmapbt, + * we must calculate the properly aligned startino of that cluster, then + * iterate each cluster to fill in used and filled masks appropriately. We + * then use the (startino, used, filled) information to construct the + * appropriate inode records. + */ +STATIC int +xrep_ibt_process_cluster( + struct xrep_ibt *ri, + xfs_agblock_t cluster_bno) +{ + struct xfs_imap imap; + struct xfs_buf *cluster_bp; + struct xfs_scrub *sc = ri->sc; + struct xfs_mount *mp = sc->mp; + struct xfs_ino_geometry *igeo = M_IGEO(mp); + xfs_agino_t cluster_ag_base; + xfs_agino_t irec_index; + unsigned int nr_inodes; + int error; + + nr_inodes = min_t(unsigned int, igeo->inodes_per_cluster, + XFS_INODES_PER_CHUNK); + + /* + * Grab the inode cluster buffer. This is safe to do with a broken + * inobt because imap_to_bp directly maps the buffer without touching + * either inode btree. + */ + imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.pag->pag_agno, cluster_bno); + imap.im_len = XFS_FSB_TO_BB(mp, igeo->blocks_per_cluster); + imap.im_boffset = 0; + error = xfs_imap_to_bp(mp, sc->tp, &imap, &cluster_bp); + if (error) + return error; + + /* + * Record the contents of each possible inobt record mapping this + * cluster. + */ + cluster_ag_base = XFS_AGB_TO_AGINO(mp, cluster_bno); + for (irec_index = 0; + irec_index < igeo->inodes_per_cluster; + irec_index += XFS_INODES_PER_CHUNK) { + error = xrep_ibt_cluster_record(ri, + cluster_ag_base + irec_index, cluster_bp, + nr_inodes); + if (error) + break; + + } + + xfs_trans_brelse(sc->tp, cluster_bp); + return error; +} + +/* Check for any obvious conflicts in the inode chunk extent. */ +STATIC int +xrep_ibt_check_inode_ext( + struct xfs_scrub *sc, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_ino_geometry *igeo = M_IGEO(mp); + xfs_agino_t agino; + enum xbtree_recpacking outcome; + int error; + + /* Inode records must be within the AG. */ + if (!xfs_verify_agbext(sc->sa.pag, agbno, len)) + return -EFSCORRUPTED; + + /* The entire record must align to the inode cluster size. */ + if (!IS_ALIGNED(agbno, igeo->blocks_per_cluster) || + !IS_ALIGNED(agbno + len, igeo->blocks_per_cluster)) + return -EFSCORRUPTED; + + /* + * The entire record must also adhere to the inode cluster alignment + * size if sparse inodes are not enabled. + */ + if (!xfs_has_sparseinodes(mp) && + (!IS_ALIGNED(agbno, igeo->cluster_align) || + !IS_ALIGNED(agbno + len, igeo->cluster_align))) + return -EFSCORRUPTED; + + /* + * On a sparse inode fs, this cluster could be part of a sparse chunk. + * Sparse clusters must be aligned to sparse chunk alignment. + */ + if (xfs_has_sparseinodes(mp) && + (!IS_ALIGNED(agbno, mp->m_sb.sb_spino_align) || + !IS_ALIGNED(agbno + len, mp->m_sb.sb_spino_align))) + return -EFSCORRUPTED; + + /* Make sure the entire range of blocks are valid AG inodes. */ + agino = XFS_AGB_TO_AGINO(mp, agbno); + if (!xfs_verify_agino(sc->sa.pag, agino)) + return -EFSCORRUPTED; + + agino = XFS_AGB_TO_AGINO(mp, agbno + len) - 1; + if (!xfs_verify_agino(sc->sa.pag, agino)) + return -EFSCORRUPTED; + + /* Make sure this isn't free space. */ + error = xfs_alloc_has_records(sc->sa.bno_cur, agbno, len, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + return 0; +} + +/* Found a fragment of the old inode btrees; dispose of them later. */ +STATIC int +xrep_ibt_record_old_btree_blocks( + struct xrep_ibt *ri, + const struct xfs_rmap_irec *rec) +{ + if (!xfs_verify_agbext(ri->sc->sa.pag, rec->rm_startblock, + rec->rm_blockcount)) + return -EFSCORRUPTED; + + return xagb_bitmap_set(&ri->old_iallocbt_blocks, rec->rm_startblock, + rec->rm_blockcount); +} + +/* Record extents that belong to inode btrees. */ +STATIC int +xrep_ibt_walk_rmap( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec, + void *priv) +{ + struct xrep_ibt *ri = priv; + struct xfs_mount *mp = cur->bc_mp; + struct xfs_ino_geometry *igeo = M_IGEO(mp); + xfs_agblock_t cluster_base; + int error = 0; + + if (xchk_should_terminate(ri->sc, &error)) + return error; + + if (rec->rm_owner == XFS_RMAP_OWN_INOBT) + return xrep_ibt_record_old_btree_blocks(ri, rec); + + /* Skip extents which are not owned by this inode and fork. */ + if (rec->rm_owner != XFS_RMAP_OWN_INODES) + return 0; + + error = xrep_ibt_check_inode_ext(ri->sc, rec->rm_startblock, + rec->rm_blockcount); + if (error) + return error; + + trace_xrep_ibt_walk_rmap(mp, ri->sc->sa.pag->pag_agno, + rec->rm_startblock, rec->rm_blockcount, rec->rm_owner, + rec->rm_offset, rec->rm_flags); + + /* + * Record the free/hole masks for each inode cluster that could be + * mapped by this rmap record. + */ + for (cluster_base = 0; + cluster_base < rec->rm_blockcount; + cluster_base += igeo->blocks_per_cluster) { + error = xrep_ibt_process_cluster(ri, + rec->rm_startblock + cluster_base); + if (error) + return error; + } + + return 0; +} + +/* + * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode + * btrees (OWN_INOBT). Figure out if we have enough free space to reconstruct + * the inode btrees. The caller must clean up the lists if anything goes + * wrong. + */ +STATIC int +xrep_ibt_find_inodes( + struct xrep_ibt *ri) +{ + struct xfs_scrub *sc = ri->sc; + int error; + + ri->rie.ir_startino = NULLAGINO; + + /* Collect all reverse mappings for inode blocks. */ + xrep_ag_btcur_init(sc, &sc->sa); + error = xfs_rmap_query_all(sc->sa.rmap_cur, xrep_ibt_walk_rmap, ri); + xchk_ag_btcur_free(&sc->sa); + if (error) + return error; + + /* If we have a record ready to go, add it to the array. */ + if (ri->rie.ir_startino == NULLAGINO) + return 0; + + return xrep_ibt_stash(ri); +} + +/* Update the AGI counters. */ +STATIC int +xrep_ibt_reset_counters( + struct xrep_ibt *ri) +{ + struct xfs_scrub *sc = ri->sc; + struct xfs_agi *agi = sc->sa.agi_bp->b_addr; + unsigned int freecount = ri->icount - ri->iused; + + /* Trigger inode count recalculation */ + xfs_force_summary_recalc(sc->mp); + + /* + * The AGI header contains extra information related to the inode + * btrees, so we must update those fields here. + */ + agi->agi_count = cpu_to_be32(ri->icount); + agi->agi_freecount = cpu_to_be32(freecount); + xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, + XFS_AGI_COUNT | XFS_AGI_FREECOUNT); + + /* Reinitialize with the values we just logged. */ + return xrep_reinit_pagi(sc); +} + +/* Retrieve finobt data for bulk load. */ +STATIC int +xrep_fibt_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_inobt_rec_incore *irec = &cur->bc_rec.i; + struct xrep_ibt *ri = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + do { + error = xfarray_load(ri->inode_records, + ri->array_cur++, irec); + } while (error == 0 && xfs_inobt_rec_freecount(irec) == 0); + if (error) + return error; + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Retrieve inobt data for bulk load. */ +STATIC int +xrep_ibt_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_inobt_rec_incore *irec = &cur->bc_rec.i; + struct xrep_ibt *ri = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + error = xfarray_load(ri->inode_records, ri->array_cur++, irec); + if (error) + return error; + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Feed one of the new inobt blocks to the bulk loader. */ +STATIC int +xrep_ibt_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_ibt *ri = priv; + + return xrep_newbt_claim_block(cur, &ri->new_inobt, ptr); +} + +/* Feed one of the new finobt blocks to the bulk loader. */ +STATIC int +xrep_fibt_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_ibt *ri = priv; + + return xrep_newbt_claim_block(cur, &ri->new_finobt, ptr); +} + +/* Make sure the records do not overlap in inumber address space. */ +STATIC int +xrep_ibt_check_startino( + struct xrep_ibt *ri) +{ + struct xfs_inobt_rec_incore irec; + xfarray_idx_t cur; + xfs_agino_t next_agino = 0; + int error = 0; + + foreach_xfarray_idx(ri->inode_records, cur) { + if (xchk_should_terminate(ri->sc, &error)) + return error; + + error = xfarray_load(ri->inode_records, cur, &irec); + if (error) + return error; + + if (irec.ir_startino < next_agino) + return -EFSCORRUPTED; + + next_agino = irec.ir_startino + XFS_INODES_PER_CHUNK; + } + + return error; +} + +/* Build new inode btrees and dispose of the old one. */ +STATIC int +xrep_ibt_build_new_trees( + struct xrep_ibt *ri) +{ + struct xfs_scrub *sc = ri->sc; + struct xfs_btree_cur *ino_cur; + struct xfs_btree_cur *fino_cur = NULL; + xfs_fsblock_t fsbno; + bool need_finobt; + int error; + + need_finobt = xfs_has_finobt(sc->mp); + + /* + * Create new btrees for staging all the inobt records we collected + * earlier. The records were collected in order of increasing agino, + * so we do not have to sort them. Ensure there are no overlapping + * records. + */ + error = xrep_ibt_check_startino(ri); + if (error) + return error; + + /* + * The new inode btrees will not be rooted in the AGI until we've + * successfully rebuilt the tree. + * + * Start by setting up the inobt staging cursor. + */ + fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno, + XFS_IBT_BLOCK(sc->mp)), + xrep_newbt_init_ag(&ri->new_inobt, sc, &XFS_RMAP_OINFO_INOBT, fsbno, + XFS_AG_RESV_NONE); + ri->new_inobt.bload.claim_block = xrep_ibt_claim_block; + ri->new_inobt.bload.get_records = xrep_ibt_get_records; + + ino_cur = xfs_inobt_stage_cursor(sc->sa.pag, &ri->new_inobt.afake, + XFS_BTNUM_INO); + error = xfs_btree_bload_compute_geometry(ino_cur, &ri->new_inobt.bload, + xfarray_length(ri->inode_records)); + if (error) + goto err_inocur; + + /* Set up finobt staging cursor. */ + if (need_finobt) { + enum xfs_ag_resv_type resv = XFS_AG_RESV_METADATA; + + if (sc->mp->m_finobt_nores) + resv = XFS_AG_RESV_NONE; + + fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno, + XFS_FIBT_BLOCK(sc->mp)), + xrep_newbt_init_ag(&ri->new_finobt, sc, &XFS_RMAP_OINFO_INOBT, + fsbno, resv); + ri->new_finobt.bload.claim_block = xrep_fibt_claim_block; + ri->new_finobt.bload.get_records = xrep_fibt_get_records; + + fino_cur = xfs_inobt_stage_cursor(sc->sa.pag, + &ri->new_finobt.afake, XFS_BTNUM_FINO); + error = xfs_btree_bload_compute_geometry(fino_cur, + &ri->new_finobt.bload, ri->finobt_recs); + if (error) + goto err_finocur; + } + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto err_finocur; + + /* Reserve all the space we need to build the new btrees. */ + error = xrep_newbt_alloc_blocks(&ri->new_inobt, + ri->new_inobt.bload.nr_blocks); + if (error) + goto err_finocur; + + if (need_finobt) { + error = xrep_newbt_alloc_blocks(&ri->new_finobt, + ri->new_finobt.bload.nr_blocks); + if (error) + goto err_finocur; + } + + /* Add all inobt records. */ + ri->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(ino_cur, &ri->new_inobt.bload, ri); + if (error) + goto err_finocur; + + /* Add all finobt records. */ + if (need_finobt) { + ri->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(fino_cur, &ri->new_finobt.bload, ri); + if (error) + goto err_finocur; + } + + /* + * Install the new btrees in the AG header. After this point the old + * btrees are no longer accessible and the new trees are live. + */ + xfs_inobt_commit_staged_btree(ino_cur, sc->tp, sc->sa.agi_bp); + xfs_btree_del_cursor(ino_cur, 0); + + if (fino_cur) { + xfs_inobt_commit_staged_btree(fino_cur, sc->tp, sc->sa.agi_bp); + xfs_btree_del_cursor(fino_cur, 0); + } + + /* Reset the AGI counters now that we've changed the inode roots. */ + error = xrep_ibt_reset_counters(ri); + if (error) + goto err_finobt; + + /* Free unused blocks and bitmap. */ + if (need_finobt) { + error = xrep_newbt_commit(&ri->new_finobt); + if (error) + goto err_inobt; + } + error = xrep_newbt_commit(&ri->new_inobt); + if (error) + return error; + + return xrep_roll_ag_trans(sc); + +err_finocur: + if (need_finobt) + xfs_btree_del_cursor(fino_cur, error); +err_inocur: + xfs_btree_del_cursor(ino_cur, error); +err_finobt: + if (need_finobt) + xrep_newbt_cancel(&ri->new_finobt); +err_inobt: + xrep_newbt_cancel(&ri->new_inobt); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_ibt_remove_old_trees( + struct xrep_ibt *ri) +{ + struct xfs_scrub *sc = ri->sc; + int error; + + /* + * Free the old inode btree blocks if they're not in use. It's ok to + * reap with XFS_AG_RESV_NONE even if the finobt had a per-AG + * reservation because we reset the reservation before releasing the + * AGI and AGF header buffer locks. + */ + error = xrep_reap_agblocks(sc, &ri->old_iallocbt_blocks, + &XFS_RMAP_OINFO_INOBT, XFS_AG_RESV_NONE); + if (error) + return error; + + /* + * If the finobt is enabled and has a per-AG reservation, make sure we + * reinitialize the per-AG reservations. + */ + if (xfs_has_finobt(sc->mp) && !sc->mp->m_finobt_nores) + sc->flags |= XREP_RESET_PERAG_RESV; + + return 0; +} + +/* Repair both inode btrees. */ +int +xrep_iallocbt( + struct xfs_scrub *sc) +{ + struct xrep_ibt *ri; + struct xfs_mount *mp = sc->mp; + char *descr; + xfs_agino_t first_agino, last_agino; + int error = 0; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; + + ri = kzalloc(sizeof(struct xrep_ibt), XCHK_GFP_FLAGS); + if (!ri) + return -ENOMEM; + ri->sc = sc; + + /* We rebuild both inode btrees. */ + sc->sick_mask = XFS_SICK_AG_INOBT | XFS_SICK_AG_FINOBT; + + /* Set up enough storage to handle an AG with nothing but inodes. */ + xfs_agino_range(mp, sc->sa.pag->pag_agno, &first_agino, &last_agino); + last_agino /= XFS_INODES_PER_CHUNK; + descr = xchk_xfile_ag_descr(sc, "inode index records"); + error = xfarray_create(descr, last_agino, + sizeof(struct xfs_inobt_rec_incore), + &ri->inode_records); + kfree(descr); + if (error) + goto out_ri; + + /* Collect the inode data and find the old btree blocks. */ + xagb_bitmap_init(&ri->old_iallocbt_blocks); + error = xrep_ibt_find_inodes(ri); + if (error) + goto out_bitmap; + + /* Rebuild the inode indexes. */ + error = xrep_ibt_build_new_trees(ri); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_ibt_remove_old_trees(ri); + if (error) + goto out_bitmap; + +out_bitmap: + xagb_bitmap_destroy(&ri->old_iallocbt_blocks); + xfarray_destroy(ri->inode_records); +out_ri: + kfree(ri); + return error; +} + +/* Make sure both btrees are ok after we've rebuilt them. */ +int +xrep_revalidate_iallocbt( + struct xfs_scrub *sc) +{ + __u32 old_type = sc->sm->sm_type; + int error; + + /* + * We must update sm_type temporarily so that the tree-to-tree cross + * reference checks will work in the correct direction, and also so + * that tracing will report correctly if there are more errors. + */ + sc->sm->sm_type = XFS_SCRUB_TYPE_INOBT; + error = xchk_inobt(sc); + if (error) + goto out; + + if (xfs_has_finobt(sc->mp)) { + sc->sm->sm_type = XFS_SCRUB_TYPE_FINOBT; + error = xchk_finobt(sc); + } + +out: + sc->sm->sm_type = old_type; + return error; +} diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 828c0585701a4..ad1df212ec4c1 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -803,3 +803,62 @@ xrep_reinit_pagf( return 0; } + +/* + * Reinitialize the in-core AG state after a repair by rereading the AGI + * buffer. We had better get the same AGI buffer as the one that's attached + * to the scrub context. + */ +int +xrep_reinit_pagi( + struct xfs_scrub *sc) +{ + struct xfs_perag *pag = sc->sa.pag; + struct xfs_buf *bp; + int error; + + ASSERT(pag); + ASSERT(xfs_perag_initialised_agi(pag)); + + clear_bit(XFS_AGSTATE_AGI_INIT, &pag->pag_opstate); + error = xfs_ialloc_read_agi(pag, sc->tp, &bp); + if (error) + return error; + + if (bp != sc->sa.agi_bp) { + ASSERT(bp == sc->sa.agi_bp); + return -EFSCORRUPTED; + } + + return 0; +} + +/* Reinitialize the per-AG block reservation for the AG we just fixed. */ +int +xrep_reset_perag_resv( + struct xfs_scrub *sc) +{ + int error; + + if (!(sc->flags & XREP_RESET_PERAG_RESV)) + return 0; + + ASSERT(sc->sa.pag != NULL); + ASSERT(sc->ops->type == ST_PERAG); + ASSERT(sc->tp); + + sc->flags &= ~XREP_RESET_PERAG_RESV; + error = xfs_ag_resv_free(sc->sa.pag); + if (error) + goto out; + error = xfs_ag_resv_init(sc->sa.pag, sc->tp); + if (error == -ENOSPC) { + xfs_err(sc->mp, +"Insufficient free space to reset per-AG reservation for AG %u after repair.", + sc->sa.pag->pag_agno); + error = 0; + } + +out: + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index bc3353ecae8a1..05bd55430e6eb 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -59,6 +59,7 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp, struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp); void xrep_force_quotacheck(struct xfs_scrub *sc, xfs_dqtype_t type); int xrep_ino_dqattach(struct xfs_scrub *sc); +int xrep_reset_perag_resv(struct xfs_scrub *sc); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); @@ -68,6 +69,7 @@ void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa); /* Metadata revalidators */ int xrep_revalidate_allocbt(struct xfs_scrub *sc); +int xrep_revalidate_iallocbt(struct xfs_scrub *sc); /* Metadata repairers */ @@ -77,8 +79,10 @@ int xrep_agf(struct xfs_scrub *sc); int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); +int xrep_iallocbt(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); +int xrep_reinit_pagi(struct xfs_scrub *sc); #else @@ -99,6 +103,17 @@ xrep_calc_ag_resblks( return 0; } +static inline int +xrep_reset_perag_resv( + struct xfs_scrub *sc) +{ + if (!(sc->flags & XREP_RESET_PERAG_RESV)) + return 0; + + ASSERT(0); + return -EOPNOTSUPP; +} + /* repair setup functions for no-repair */ static inline int xrep_setup_nothing( @@ -109,6 +124,7 @@ xrep_setup_nothing( #define xrep_setup_ag_allocbt xrep_setup_nothing #define xrep_revalidate_allocbt (NULL) +#define xrep_revalidate_iallocbt (NULL) #define xrep_probe xrep_notsupported #define xrep_superblock xrep_notsupported @@ -116,6 +132,7 @@ xrep_setup_nothing( #define xrep_agfl xrep_notsupported #define xrep_agi xrep_notsupported #define xrep_allocbt xrep_notsupported +#define xrep_iallocbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index ebfaeb3793154..dfa949666e291 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -255,14 +255,16 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_PERAG, .setup = xchk_setup_ag_iallocbt, .scrub = xchk_inobt, - .repair = xrep_notsupported, + .repair = xrep_iallocbt, + .repair_eval = xrep_revalidate_iallocbt, }, [XFS_SCRUB_TYPE_FINOBT] = { /* finobt */ .type = ST_PERAG, .setup = xchk_setup_ag_iallocbt, .scrub = xchk_finobt, .has = xfs_has_finobt, - .repair = xrep_notsupported, + .repair = xrep_iallocbt, + .repair_eval = xrep_revalidate_iallocbt, }, [XFS_SCRUB_TYPE_RMAPBT] = { /* rmapbt */ .type = ST_PERAG, diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index cb6294e629836..4e6edff57b857 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -121,6 +121,7 @@ struct xfs_scrub { #define XCHK_HAVE_FREEZE_PROT (1U << 1) /* do we have freeze protection? */ #define XCHK_FSGATES_DRAIN (1U << 2) /* defer ops draining enabled */ #define XCHK_NEED_DRAIN (1U << 3) /* scrub needs to drain defer ops */ +#define XREP_RESET_PERAG_RESV (1U << 30) /* must reset AG space reservation */ #define XREP_ALREADY_FIXED (1U << 31) /* checking our repair work */ /* diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index ea518712efa81..c60f76231f0c7 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -106,6 +106,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS); { XCHK_HAVE_FREEZE_PROT, "nofreeze" }, \ { XCHK_FSGATES_DRAIN, "fsgates_drain" }, \ { XCHK_NEED_DRAIN, "need_drain" }, \ + { XREP_RESET_PERAG_RESV, "reset_perag_resv" }, \ { XREP_ALREADY_FIXED, "already_fixed" } DECLARE_EVENT_CLASS(xchk_class, @@ -1172,7 +1173,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \ xfs_agblock_t agbno, xfs_extlen_t len, \ uint64_t owner, uint64_t offset, unsigned int flags), \ TP_ARGS(mp, agno, agbno, len, owner, offset, flags)) -DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn); +DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap); DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn); DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn); @@ -1199,6 +1200,38 @@ TRACE_EVENT(xrep_abt_found, __entry->blockcount) ) +TRACE_EVENT(xrep_ibt_found, + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, + const struct xfs_inobt_rec_incore *rec), + TP_ARGS(mp, agno, rec), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agino_t, startino) + __field(uint16_t, holemask) + __field(uint8_t, count) + __field(uint8_t, freecount) + __field(uint64_t, freemask) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->agno = agno; + __entry->startino = rec->ir_startino; + __entry->holemask = rec->ir_holemask; + __entry->count = rec->ir_count; + __entry->freecount = rec->ir_freecount; + __entry->freemask = rec->ir_free; + ), + TP_printk("dev %d:%d agno 0x%x agino 0x%x holemask 0x%x count 0x%x freecount 0x%x freemask 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->startino, + __entry->holemask, + __entry->count, + __entry->freecount, + __entry->freemask) +) + TRACE_EVENT(xrep_refcount_extent_fn, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, struct xfs_refcount_irec *irec), @@ -1321,39 +1354,6 @@ TRACE_EVENT(xrep_reset_counters, MAJOR(__entry->dev), MINOR(__entry->dev)) ) -TRACE_EVENT(xrep_ialloc_insert, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - xfs_agino_t startino, uint16_t holemask, uint8_t count, - uint8_t freecount, uint64_t freemask), - TP_ARGS(mp, agno, startino, holemask, count, freecount, freemask), - TP_STRUCT__entry( - __field(dev_t, dev) - __field(xfs_agnumber_t, agno) - __field(xfs_agino_t, startino) - __field(uint16_t, holemask) - __field(uint8_t, count) - __field(uint8_t, freecount) - __field(uint64_t, freemask) - ), - TP_fast_assign( - __entry->dev = mp->m_super->s_dev; - __entry->agno = agno; - __entry->startino = startino; - __entry->holemask = holemask; - __entry->count = count; - __entry->freecount = freecount; - __entry->freemask = freemask; - ), - TP_printk("dev %d:%d agno 0x%x startino 0x%x holemask 0x%x count %u freecount %u freemask 0x%llx", - MAJOR(__entry->dev), MINOR(__entry->dev), - __entry->agno, - __entry->startino, - __entry->holemask, - __entry->count, - __entry->freecount, - __entry->freemask) -) - DECLARE_EVENT_CLASS(xrep_newbt_extent_class, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, xfs_extlen_t len, ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 4/5] xfs: repair inode btrees 2023-11-24 23:50 ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong @ 2023-11-25 6:12 ` Christoph Hellwig 2023-11-28 1:09 ` Darrick J. Wong 2023-11-28 15:57 ` Christoph Hellwig 1 sibling, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 6:12 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs > +/* Simple checks for inode records. */ > +xfs_failaddr_t > +xfs_inobt_check_irec( > + struct xfs_btree_cur *cur, > + const struct xfs_inobt_rec_incore *irec) > +{ > + return xfs_inobt_check_perag_irec(cur->bc_ag.pag, irec); > +} Same comment about just dropping the wrapper. Otherwise I'll need more digestion time for the new code. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/5] xfs: repair inode btrees 2023-11-25 6:12 ` Christoph Hellwig @ 2023-11-28 1:09 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 1:09 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Fri, Nov 24, 2023 at 10:12:45PM -0800, Christoph Hellwig wrote: > > +/* Simple checks for inode records. */ > > +xfs_failaddr_t > > +xfs_inobt_check_irec( > > + struct xfs_btree_cur *cur, > > + const struct xfs_inobt_rec_incore *irec) > > +{ > > + return xfs_inobt_check_perag_irec(cur->bc_ag.pag, irec); > > +} > > Same comment about just dropping the wrapper. Otherwise I'll > need more digestion time for the new code. Done. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/5] xfs: repair inode btrees 2023-11-24 23:50 ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong 2023-11-25 6:12 ` Christoph Hellwig @ 2023-11-28 15:57 ` Christoph Hellwig 2023-11-28 21:37 ` Darrick J. Wong 1 sibling, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 15:57 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs This generally looks good to me. A bunch of my superficial comments to the previous patch apply here as well, but I'm not going to repeat them, but I have a bunch of new just as nitpicky ones: > + uint64_t realfree; > > + if (!xfs_inobt_issparse(irec->ir_holemask)) > + realfree = irec->ir_free; > + else > + realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec); Nit: I'd write this as: uint64_t realfree = irec->ir_free; if (xfs_inobt_issparse(irec->ir_holemask)) realfree &= xfs_inobt_irec_to_allocmask(irec); return hweight64(realfree); to simplify the logic a bit (and yes, I see the sniplet was just copied out of an existing function). > +/* Record extents that belong to inode btrees. */ > +STATIC int > +xrep_ibt_walk_rmap( > + struct xfs_btree_cur *cur, > + const struct xfs_rmap_irec *rec, > + void *priv) > +{ > + struct xrep_ibt *ri = priv; > + struct xfs_mount *mp = cur->bc_mp; > + struct xfs_ino_geometry *igeo = M_IGEO(mp); > + xfs_agblock_t cluster_base; > + int error = 0; > + > + if (xchk_should_terminate(ri->sc, &error)) > + return error; > + > + if (rec->rm_owner == XFS_RMAP_OWN_INOBT) > + return xrep_ibt_record_old_btree_blocks(ri, rec); > + > + /* Skip extents which are not owned by this inode and fork. */ > + if (rec->rm_owner != XFS_RMAP_OWN_INODES) > + return 0; The "Skip extents.." comment is clearly wrong and looks like it got here by accident. And may ocaml-trained ind screams for a switch statement and another helper for the rest of the functin body here: switch (rec->rm_owner) { case XFS_RMAP_OWN_INOBT: return xrep_ibt_record_old_btree_blocks(ri, rec); case XFS_RMAP_OWN_INODES: return xrep_ibt_record_inode_blocks(mp, ri, rec); default: return 0; > + /* If we have a record ready to go, add it to the array. */ > + if (ri->rie.ir_startino == NULLAGINO) > + return 0; > + > + return xrep_ibt_stash(ri); > +} Superficial, but having the logic inverted from the comment makes my brain a little dizzy. Anything again: if (ri->rie.ir_startino != NULLAGINO) error = xrep_ibt_stash(ri); return error; ? > +/* Make sure the records do not overlap in inumber address space. */ > +STATIC int > +xrep_ibt_check_startino( Would xrep_ibt_check_overlap be a better name here? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/5] xfs: repair inode btrees 2023-11-28 15:57 ` Christoph Hellwig @ 2023-11-28 21:37 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 21:37 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs On Tue, Nov 28, 2023 at 07:57:20AM -0800, Christoph Hellwig wrote: > This generally looks good to me. > > A bunch of my superficial comments to the previous patch apply > here as well, but I'm not going to repeat them, but I have a bunch of > new just as nitpicky ones: I already fixed the nitpicks from yesterday. :) > > + uint64_t realfree; > > > > + if (!xfs_inobt_issparse(irec->ir_holemask)) > > + realfree = irec->ir_free; > > + else > > + realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec); > > Nit: > > I'd write this as: > > > uint64_t realfree = irec->ir_free; > > if (xfs_inobt_issparse(irec->ir_holemask)) > realfree &= xfs_inobt_irec_to_allocmask(irec); > return hweight64(realfree); > > to simplify the logic a bit (and yes, I see the sniplet was just copied > out of an existing function). Ok. > > +/* Record extents that belong to inode btrees. */ > > +STATIC int > > +xrep_ibt_walk_rmap( > > + struct xfs_btree_cur *cur, > > + const struct xfs_rmap_irec *rec, > > + void *priv) > > +{ > > + struct xrep_ibt *ri = priv; > > + struct xfs_mount *mp = cur->bc_mp; > > + struct xfs_ino_geometry *igeo = M_IGEO(mp); > > + xfs_agblock_t cluster_base; > > + int error = 0; > > + > > + if (xchk_should_terminate(ri->sc, &error)) > > + return error; > > + > > + if (rec->rm_owner == XFS_RMAP_OWN_INOBT) > > + return xrep_ibt_record_old_btree_blocks(ri, rec); > > + > > + /* Skip extents which are not owned by this inode and fork. */ > > + if (rec->rm_owner != XFS_RMAP_OWN_INODES) > > + return 0; > > The "Skip extents.." comment is clearly wrong and looks like it got > here by accident. Yep. "Skip mappings that are not inode records.", sorry about that. > And may ocaml-trained ind screams for a switch > statement and another helper for the rest of the functin body here: > > switch (rec->rm_owner) { > case XFS_RMAP_OWN_INOBT: > return xrep_ibt_record_old_btree_blocks(ri, rec); > case XFS_RMAP_OWN_INODES: > return xrep_ibt_record_inode_blocks(mp, ri, rec); > default: > return 0; Sounds good to me. > > + /* If we have a record ready to go, add it to the array. */ > > + if (ri->rie.ir_startino == NULLAGINO) > > + return 0; > > + > > + return xrep_ibt_stash(ri); > > +} > > Superficial, but having the logic inverted from the comment makes > my brain a little dizzy. Anything again: > > if (ri->rie.ir_startino != NULLAGINO) > error = xrep_ibt_stash(ri); > > return error; > > ? Done. > > +/* Make sure the records do not overlap in inumber address space. */ > > +STATIC int > > +xrep_ibt_check_startino( > > Would xrep_ibt_check_overlap be a better name here? Yes! Thank you for the suggestion. --D > > ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair refcount btrees 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong ` (3 preceding siblings ...) 2023-11-24 23:50 ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong @ 2023-11-24 23:51 ` Darrick J. Wong 2023-11-28 16:07 ` Christoph Hellwig 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:51 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Reconstruct the refcount data from the rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_ag.h | 1 fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_refcount.c | 18 + fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 + fs/xfs/scrub/refcount_repair.c | 793 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 22 + 11 files changed, 864 insertions(+), 18 deletions(-) create mode 100644 fs/xfs/scrub/refcount_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 7fed0e706cfa0..a6f708dc56cc2 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -185,6 +185,7 @@ xfs-y += $(addprefix scrub/, \ ialloc_repair.o \ newbt.o \ reap.o \ + refcount_repair.o \ repair.o \ ) endif diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 686f4eadd5743..616812911a23f 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -87,6 +87,7 @@ struct xfs_perag { * verifiers while rebuilding the AG btrees. */ uint8_t pagf_alt_levels[XFS_BTNUM_AGF]; + uint8_t pagf_alt_refcount_level; #endif spinlock_t pag_state_lock; diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index c100e92140be1..ea8d3659df208 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -5212,3 +5212,29 @@ xfs_btree_destroy_cur_caches(void) xfs_rmapbt_destroy_cur_cache(); xfs_refcountbt_destroy_cur_cache(); } + +/* Move the btree cursor before the first record. */ +int +xfs_btree_goto_left_edge( + struct xfs_btree_cur *cur) +{ + int stat = 0; + int error; + + memset(&cur->bc_rec, 0, sizeof(cur->bc_rec)); + error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat); + if (error) + return error; + if (!stat) + return 0; + + error = xfs_btree_decrement(cur, 0, &stat); + if (error) + return error; + if (stat != 0) { + ASSERT(0); + return -EFSCORRUPTED; + } + + return 0; +} diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index e0875cec49392..d906324e25c86 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -738,4 +738,6 @@ xfs_btree_alloc_cursor( int __init xfs_btree_init_cur_caches(void); void xfs_btree_destroy_cur_caches(void); +int xfs_btree_goto_left_edge(struct xfs_btree_cur *cur); + #endif /* __XFS_BTREE_H__ */ diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index 3702b4a071100..5fa1a6f32c17d 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -120,14 +120,11 @@ xfs_refcount_btrec_to_irec( irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount); } -/* Simple checks for refcount records. */ -xfs_failaddr_t -xfs_refcount_check_irec( - struct xfs_btree_cur *cur, +inline xfs_failaddr_t +xfs_refcount_check_perag_irec( + struct xfs_perag *pag, const struct xfs_refcount_irec *irec) { - struct xfs_perag *pag = cur->bc_ag.pag; - if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN) return __this_address; @@ -144,6 +141,15 @@ xfs_refcount_check_irec( return NULL; } +/* Simple checks for refcount records. */ +xfs_failaddr_t +xfs_refcount_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *irec) +{ + return xfs_refcount_check_perag_irec(cur->bc_ag.pag, irec); +} + static inline int xfs_refcount_complain_bad_rec( struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h index 783cd89ca1951..2d6fecb258bb1 100644 --- a/fs/xfs/libxfs/xfs_refcount.h +++ b/fs/xfs/libxfs/xfs_refcount.h @@ -117,6 +117,8 @@ extern int xfs_refcount_has_records(struct xfs_btree_cur *cur, union xfs_btree_rec; extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec, struct xfs_refcount_irec *irec); +xfs_failaddr_t xfs_refcount_check_perag_irec(struct xfs_perag *pag, + const struct xfs_refcount_irec *irec); xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur, const struct xfs_refcount_irec *irec); extern int xfs_refcount_insert(struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index 3fa795e2488dd..dc5d6d4d6316c 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -226,7 +226,18 @@ xfs_refcountbt_verify( level = be16_to_cpu(block->bb_level); if (pag && xfs_perag_initialised_agf(pag)) { - if (level >= pag->pagf_refcount_level) + unsigned int maxlevel = pag->pagf_refcount_level; + +#ifdef CONFIG_XFS_ONLINE_REPAIR + /* + * Online repair could be rewriting the refcount btree, so + * we'll validate against the larger of either tree while this + * is going on. + */ + maxlevel = max_t(unsigned int, maxlevel, + pag->pagf_alt_refcount_level); +#endif + if (level >= maxlevel) return __this_address; } else if (level >= mp->m_refc_maxlevels) return __this_address; diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c new file mode 100644 index 0000000000000..9fb050e23c0bd --- /dev/null +++ b/fs/xfs/scrub/refcount_repair.c @@ -0,0 +1,793 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_inode.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_alloc.h" +#include "xfs_ialloc.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_refcount.h" +#include "xfs_refcount_btree.h" +#include "xfs_error.h" +#include "xfs_ag.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/newbt.h" +#include "scrub/reap.h" + +/* + * Rebuilding the Reference Count Btree + * ==================================== + * + * This algorithm is "borrowed" from xfs_repair. Imagine the rmap + * entries as rectangles representing extents of physical blocks, and + * that the rectangles can be laid down to allow them to overlap each + * other; then we know that we must emit a refcnt btree entry wherever + * the amount of overlap changes, i.e. the emission stimulus is + * level-triggered: + * + * - --- + * -- ----- ---- --- ------ + * -- ---- ----------- ---- --------- + * -------------------------------- ----------- + * ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + * 2 1 23 21 3 43 234 2123 1 01 2 3 0 + * + * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner). + * + * Note that in the actual refcnt btree we don't store the refcount < 2 + * cases because the bnobt tells us which blocks are free; single-use + * blocks aren't recorded in the bnobt or the refcntbt. If the rmapbt + * supports storing multiple entries covering a given block we could + * theoretically dispense with the refcntbt and simply count rmaps, but + * that's inefficient in the (hot) write path, so we'll take the cost of + * the extra tree to save time. Also there's no guarantee that rmap + * will be enabled. + * + * Given an array of rmaps sorted by physical block number, a starting + * physical block (sp), a bag to hold rmaps that cover sp, and the next + * physical block where the level changes (np), we can reconstruct the + * refcount btree as follows: + * + * While there are still unprocessed rmaps in the array, + * - Set sp to the physical block (pblk) of the next unprocessed rmap. + * - Add to the bag all rmaps in the array where startblock == sp. + * - Set np to the physical block where the bag size will change. This + * is the minimum of (the pblk of the next unprocessed rmap) and + * (startblock + len of each rmap in the bag). + * - Record the bag size as old_bag_size. + * + * - While the bag isn't empty, + * - Remove from the bag all rmaps where startblock + len == np. + * - Add to the bag all rmaps in the array where startblock == np. + * - If the bag size isn't old_bag_size, store the refcount entry + * (sp, np - sp, bag_size) in the refcnt btree. + * - If the bag is empty, break out of the inner loop. + * - Set old_bag_size to the bag size + * - Set sp = np. + * - Set np to the physical block where the bag size will change. + * This is the minimum of (the pblk of the next unprocessed rmap) + * and (startblock + len of each rmap in the bag). + * + * Like all the other repairers, we make a list of all the refcount + * records we need, then reinitialize the refcount btree root and + * insert all the records. + */ + +/* The only parts of the rmap that we care about for computing refcounts. */ +struct xrep_refc_rmap { + xfs_agblock_t startblock; + xfs_extlen_t blockcount; +} __packed; + +struct xrep_refc { + /* refcount extents */ + struct xfarray *refcount_records; + + /* new refcountbt information */ + struct xrep_newbt new_btree; + + /* old refcountbt blocks */ + struct xagb_bitmap old_refcountbt_blocks; + + struct xfs_scrub *sc; + + /* get_records()'s position in the refcount record array. */ + xfarray_idx_t array_cur; + + /* # of refcountbt blocks */ + xfs_extlen_t btblocks; +}; + +/* Check for any obvious conflicts with this shared/CoW staging extent. */ +STATIC int +xrep_refc_check_ext( + struct xfs_scrub *sc, + const struct xfs_refcount_irec *rec) +{ + enum xbtree_recpacking outcome; + int error; + + if (xfs_refcount_check_perag_irec(sc->sa.pag, rec) != NULL) + return -EFSCORRUPTED; + + /* Make sure this isn't free space. */ + error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rc_startblock, + rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + /* Must not be an inode chunk. */ + error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur, + rec->rc_startblock, rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + return 0; +} + +/* Record a reference count extent. */ +STATIC int +xrep_refc_stash( + struct xrep_refc *rr, + enum xfs_refc_domain domain, + xfs_agblock_t agbno, + xfs_extlen_t len, + uint64_t refcount) +{ + struct xfs_refcount_irec irec = { + .rc_startblock = agbno, + .rc_blockcount = len, + .rc_domain = domain, + }; + struct xfs_scrub *sc = rr->sc; + int error = 0; + + if (xchk_should_terminate(sc, &error)) + return error; + + irec.rc_refcount = min_t(uint64_t, MAXREFCOUNT, refcount); + + error = xrep_refc_check_ext(rr->sc, &irec); + if (error) + return error; + + trace_xrep_refc_found(sc->sa.pag, &irec); + + return xfarray_append(rr->refcount_records, &irec); +} + +/* Record a CoW staging extent. */ +STATIC int +xrep_refc_stash_cow( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + return xrep_refc_stash(rr, XFS_REFC_DOMAIN_COW, agbno, len, 1); +} + +/* Decide if an rmap could describe a shared extent. */ +static inline bool +xrep_refc_rmap_shareable( + struct xfs_mount *mp, + const struct xfs_rmap_irec *rmap) +{ + /* AG metadata are never sharable */ + if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner)) + return false; + + /* Metadata in files are never shareable */ + if (xfs_internal_inum(mp, rmap->rm_owner)) + return false; + + /* Metadata and unwritten file blocks are not shareable. */ + if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN)) + return false; + + return true; +} + +/* + * Walk along the reverse mapping records until we find one that could describe + * a shared extent. + */ +STATIC int +xrep_refc_walk_rmaps( + struct xrep_refc *rr, + struct xrep_refc_rmap *rrm, + bool *have_rec) +{ + struct xfs_rmap_irec rmap; + struct xfs_btree_cur *cur = rr->sc->sa.rmap_cur; + struct xfs_mount *mp = cur->bc_mp; + int have_gt; + int error = 0; + + *have_rec = false; + + /* + * Loop through the remaining rmaps. Remember CoW staging + * extents and the refcountbt blocks from the old tree for later + * disposal. We can only share written data fork extents, so + * keep looping until we find an rmap for one. + */ + do { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfs_btree_increment(cur, 0, &have_gt); + if (error) + return error; + if (!have_gt) + return 0; + + error = xfs_rmap_get_rec(cur, &rmap, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(mp, !have_gt)) + return -EFSCORRUPTED; + + if (rmap.rm_owner == XFS_RMAP_OWN_COW) { + error = xrep_refc_stash_cow(rr, rmap.rm_startblock, + rmap.rm_blockcount); + if (error) + return error; + } else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) { + /* refcountbt block, dump it when we're done. */ + rr->btblocks += rmap.rm_blockcount; + error = xagb_bitmap_set(&rr->old_refcountbt_blocks, + rmap.rm_startblock, rmap.rm_blockcount); + if (error) + return error; + } + } while (!xrep_refc_rmap_shareable(mp, &rmap)); + + rrm->startblock = rmap.rm_startblock; + rrm->blockcount = rmap.rm_blockcount; + *have_rec = true; + return 0; +} + +static inline uint32_t +xrep_refc_encode_startblock( + const struct xfs_refcount_irec *irec) +{ + uint32_t start; + + start = irec->rc_startblock & ~XFS_REFC_COWFLAG; + if (irec->rc_domain == XFS_REFC_DOMAIN_COW) + start |= XFS_REFC_COWFLAG; + + return start; +} + +/* Sort in the same order as the ondisk records. */ +static int +xrep_refc_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_refcount_irec *ap = a; + const struct xfs_refcount_irec *bp = b; + uint32_t sa, sb; + + sa = xrep_refc_encode_startblock(ap); + sb = xrep_refc_encode_startblock(bp); + + if (sa > sb) + return 1; + if (sa < sb) + return -1; + return 0; +} + +/* + * Sort the refcount extents by startblock or else the btree records will be in + * the wrong order. Make sure the records do not overlap in physical space. + */ +STATIC int +xrep_refc_sort_records( + struct xrep_refc *rr) +{ + struct xfs_refcount_irec irec; + xfarray_idx_t cur; + enum xfs_refc_domain dom = XFS_REFC_DOMAIN_SHARED; + xfs_agblock_t next_agbno = 0; + int error; + + error = xfarray_sort(rr->refcount_records, xrep_refc_extent_cmp, + XFARRAY_SORT_KILLABLE); + if (error) + return error; + + foreach_xfarray_idx(rr->refcount_records, cur) { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfarray_load(rr->refcount_records, cur, &irec); + if (error) + return error; + + if (dom == XFS_REFC_DOMAIN_SHARED && + irec.rc_domain == XFS_REFC_DOMAIN_COW) { + dom = irec.rc_domain; + next_agbno = 0; + } + + if (dom != irec.rc_domain) + return -EFSCORRUPTED; + if (irec.rc_startblock < next_agbno) + return -EFSCORRUPTED; + + next_agbno = irec.rc_startblock + irec.rc_blockcount; + } + + return error; +} + +#define RRM_NEXT(r) ((r).startblock + (r).blockcount) +/* + * Find the next block where the refcount changes, given the next rmap we + * looked at and the ones we're already tracking. + */ +static inline int +xrep_refc_next_edge( + struct xfarray *rmap_bag, + struct xrep_refc_rmap *next_rrm, + bool next_valid, + xfs_agblock_t *nbnop) +{ + struct xrep_refc_rmap rrm; + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + xfs_agblock_t nbno = NULLAGBLOCK; + int error; + + if (next_valid) + nbno = next_rrm->startblock; + + while ((error = xfarray_iter(rmap_bag, &array_cur, &rrm)) == 1) + nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm)); + + if (error) + return error; + + /* + * We should have found /something/ because either next_rrm is the next + * interesting rmap to look at after emitting this refcount extent, or + * there are other rmaps in rmap_bag contributing to the current + * sharing count. But if something is seriously wrong, bail out. + */ + if (nbno == NULLAGBLOCK) + return -EFSCORRUPTED; + + *nbnop = nbno; + return 0; +} + +/* + * Walk forward through the rmap btree to collect all rmaps starting at + * @bno in @rmap_bag. These represent the file(s) that share ownership of + * the current block. Upon return, the rmap cursor points to the last record + * satisfying the startblock constraint. + */ +static int +xrep_refc_push_rmaps_at( + struct xrep_refc *rr, + struct xfarray *rmap_bag, + xfs_agblock_t bno, + struct xrep_refc_rmap *rrm, + bool *have, + uint64_t *stack_sz) +{ + struct xfs_scrub *sc = rr->sc; + int have_gt; + int error; + + while (*have && rrm->startblock == bno) { + error = xfarray_store_anywhere(rmap_bag, rrm); + if (error) + return error; + (*stack_sz)++; + error = xrep_refc_walk_rmaps(rr, rrm, have); + if (error) + return error; + } + + error = xfs_btree_decrement(sc->sa.rmap_cur, 0, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(sc->mp, !have_gt)) + return -EFSCORRUPTED; + + return 0; +} + +/* Iterate all the rmap records to generate reference count data. */ +STATIC int +xrep_refc_find_refcounts( + struct xrep_refc *rr) +{ + struct xrep_refc_rmap rrm; + struct xfs_scrub *sc = rr->sc; + struct xfarray *rmap_bag; + char *descr; + uint64_t old_stack_sz; + uint64_t stack_sz = 0; + xfs_agblock_t sbno; + xfs_agblock_t cbno; + xfs_agblock_t nbno; + bool have; + int error; + + xrep_ag_btcur_init(sc, &sc->sa); + + /* + * Set up a sparse array to store all the rmap records that we're + * tracking to generate a reference count record. If this exceeds + * MAXREFCOUNT, we clamp rc_refcount. + */ + descr = xchk_xfile_ag_descr(sc, "rmap record bag"); + error = xfarray_create(descr, 0, sizeof(struct xrep_refc_rmap), + &rmap_bag); + kfree(descr); + if (error) + goto out_cur; + + /* Start the rmapbt cursor to the left of all records. */ + error = xfs_btree_goto_left_edge(sc->sa.rmap_cur); + if (error) + goto out_bag; + + /* Process reverse mappings into refcount data. */ + while (xfs_btree_has_more_records(sc->sa.rmap_cur)) { + /* Push all rmaps with pblk == sbno onto the stack */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (!have) + break; + sbno = cbno = rrm.startblock; + error = xrep_refc_push_rmaps_at(rr, rmap_bag, sbno, + &rrm, &have, &stack_sz); + if (error) + goto out_bag; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + old_stack_sz = stack_sz; + + /* While stack isn't empty... */ + while (stack_sz) { + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + + /* Pop all rmaps that end at nbno */ + while ((error = xfarray_iter(rmap_bag, &array_cur, + &rrm)) == 1) { + if (RRM_NEXT(rrm) != nbno) + continue; + error = xfarray_unset(rmap_bag, array_cur - 1); + if (error) + goto out_bag; + stack_sz--; + } + if (error) + goto out_bag; + + /* Push array items that start at nbno */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (have) { + error = xrep_refc_push_rmaps_at(rr, rmap_bag, + nbno, &rrm, &have, &stack_sz); + if (error) + goto out_bag; + } + + /* Emit refcount if necessary */ + ASSERT(nbno > cbno); + if (stack_sz != old_stack_sz) { + if (old_stack_sz > 1) { + error = xrep_refc_stash(rr, + XFS_REFC_DOMAIN_SHARED, + cbno, nbno - cbno, + old_stack_sz); + if (error) + goto out_bag; + } + cbno = nbno; + } + + /* Stack empty, go find the next rmap */ + if (stack_sz == 0) + break; + old_stack_sz = stack_sz; + sbno = nbno; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, + &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + } + } + + ASSERT(stack_sz == 0); +out_bag: + xfarray_destroy(rmap_bag); +out_cur: + xchk_ag_btcur_free(&sc->sa); + return error; +} +#undef RRM_NEXT + +/* Retrieve refcountbt data for bulk load. */ +STATIC int +xrep_refc_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_refcount_irec *irec = &cur->bc_rec.rc; + struct xrep_refc *rr = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + error = xfarray_load(rr->refcount_records, rr->array_cur++, + irec); + if (error) + return error; + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_refc_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_refc *rr = priv; + + return xrep_newbt_claim_block(cur, &rr->new_btree, ptr); +} + +/* Update the AGF counters. */ +STATIC int +xrep_refc_reset_counters( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + + /* + * After we commit the new btree to disk, it is possible that the + * process to reap the old btree blocks will race with the AIL trying + * to checkpoint the old btree blocks into the filesystem. If the new + * tree is shorter than the old one, the refcountbt write verifier will + * fail and the AIL will shut down the filesystem. + * + * To avoid this, save the old incore btree height values as the alt + * height values before re-initializing the perag info from the updated + * AGF to capture all the new values. + */ + pag->pagf_alt_refcount_level = pag->pagf_refcount_level; + + /* Reinitialize with the values we just logged. */ + return xrep_reinit_pagf(sc); +} + +/* + * Use the collected refcount information to stage a new refcount btree. If + * this is successful we'll return with the new btree root information logged + * to the repair transaction but not yet committed. + */ +STATIC int +xrep_refc_build_new_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_btree_cur *refc_cur; + struct xfs_perag *pag = sc->sa.pag; + xfs_fsblock_t fsbno; + int error; + + error = xrep_refc_sort_records(rr); + if (error) + return error; + + /* + * Prepare to construct the new btree by reserving disk space for the + * new btree and setting up all the accounting information we'll need + * to root the new btree while it's under construction and before we + * attach it to the AG header. + */ + fsbno = XFS_AGB_TO_FSB(sc->mp, pag->pag_agno, xfs_refc_block(sc->mp)); + xrep_newbt_init_ag(&rr->new_btree, sc, &XFS_RMAP_OINFO_REFC, fsbno, + XFS_AG_RESV_METADATA); + rr->new_btree.bload.get_records = xrep_refc_get_records; + rr->new_btree.bload.claim_block = xrep_refc_claim_block; + + /* Compute how many blocks we'll need. */ + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, &rr->new_btree.afake, + pag); + error = xfs_btree_bload_compute_geometry(refc_cur, + &rr->new_btree.bload, + xfarray_length(rr->refcount_records)); + if (error) + goto err_cur; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto err_cur; + + /* Reserve the space we'll need for the new btree. */ + error = xrep_newbt_alloc_blocks(&rr->new_btree, + rr->new_btree.bload.nr_blocks); + if (error) + goto err_cur; + + /* + * Due to btree slack factors, it's possible for a new btree to be one + * level taller than the old btree. Update the incore btree height so + * that we don't trip the verifiers when writing the new btree blocks + * to disk. + */ + pag->pagf_alt_refcount_level = rr->new_btree.bload.btree_height; + + /* Add all observed refcount records. */ + rr->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(refc_cur, &rr->new_btree.bload, rr); + if (error) + goto err_level; + + /* + * Install the new btree in the AG header. After this point the old + * btree is no longer accessible and the new tree is live. + */ + xfs_refcountbt_commit_staged_btree(refc_cur, sc->tp, sc->sa.agf_bp); + xfs_btree_del_cursor(refc_cur, 0); + + /* Reset the AGF counters now that we've changed the btree shape. */ + error = xrep_refc_reset_counters(rr); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + error = xrep_newbt_commit(&rr->new_btree); + if (error) + return error; + + return xrep_roll_ag_trans(sc); + +err_level: + pag->pagf_alt_refcount_level = 0; +err_cur: + xfs_btree_del_cursor(refc_cur, error); +err_newbt: + xrep_newbt_cancel(&rr->new_btree); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_refc_remove_old_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + int error; + + /* Free the old refcountbt blocks if they're not in use. */ + error = xrep_reap_agblocks(sc, &rr->old_refcountbt_blocks, + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); + if (error) + return error; + + /* + * Now that we've zapped all the old refcountbt blocks we can turn off + * the alternate height mechanism and reset the per-AG space + * reservations. + */ + pag->pagf_alt_refcount_level = 0; + sc->flags |= XREP_RESET_PERAG_RESV; + return 0; +} + +/* Rebuild the refcount btree. */ +int +xrep_refcountbt( + struct xfs_scrub *sc) +{ + struct xrep_refc *rr; + struct xfs_mount *mp = sc->mp; + char *descr; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; + + rr = kzalloc(sizeof(struct xrep_refc), XCHK_GFP_FLAGS); + if (!rr) + return -ENOMEM; + rr->sc = sc; + + /* Set up enough storage to handle one refcount record per block. */ + descr = xchk_xfile_ag_descr(sc, "reference count records"); + error = xfarray_create(descr, mp->m_sb.sb_agblocks, + sizeof(struct xfs_refcount_irec), + &rr->refcount_records); + kfree(descr); + if (error) + goto out_rr; + + /* Collect all reference counts. */ + xagb_bitmap_init(&rr->old_refcountbt_blocks); + error = xrep_refc_find_refcounts(rr); + if (error) + goto out_bitmap; + + /* Rebuild the refcount information. */ + error = xrep_refc_build_new_tree(rr); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_refc_remove_old_tree(rr); + if (error) + goto out_bitmap; + +out_bitmap: + xagb_bitmap_destroy(&rr->old_refcountbt_blocks); + xfarray_destroy(rr->refcount_records); +out_rr: + kfree(rr); + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 05bd55430e6eb..cc7ea39427296 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -80,6 +80,7 @@ int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); +int xrep_refcountbt(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -133,6 +134,7 @@ xrep_setup_nothing( #define xrep_agi xrep_notsupported #define xrep_allocbt xrep_notsupported #define xrep_iallocbt xrep_notsupported +#define xrep_refcountbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index dfa949666e291..d0d6b2b41219e 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -278,7 +278,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .setup = xchk_setup_ag_refcountbt, .scrub = xchk_refcountbt, .has = xfs_has_reflink, - .repair = xrep_notsupported, + .repair = xrep_refcountbt, }, [XFS_SCRUB_TYPE_INODE] = { /* inode record */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index c60f76231f0c7..3f7af44309515 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1232,27 +1232,29 @@ TRACE_EVENT(xrep_ibt_found, __entry->freemask) ) -TRACE_EVENT(xrep_refcount_extent_fn, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - struct xfs_refcount_irec *irec), - TP_ARGS(mp, agno, irec), +TRACE_EVENT(xrep_refc_found, + TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *rec), + TP_ARGS(pag, rec), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) + __field(enum xfs_refc_domain, domain) __field(xfs_agblock_t, startblock) __field(xfs_extlen_t, blockcount) __field(xfs_nlink_t, refcount) ), TP_fast_assign( - __entry->dev = mp->m_super->s_dev; - __entry->agno = agno; - __entry->startblock = irec->rc_startblock; - __entry->blockcount = irec->rc_blockcount; - __entry->refcount = irec->rc_refcount; + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->domain = rec->rc_domain; + __entry->startblock = rec->rc_startblock; + __entry->blockcount = rec->rc_blockcount; + __entry->refcount = rec->rc_refcount; ), - TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u", + TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, + __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS), __entry->startblock, __entry->blockcount, __entry->refcount) ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair refcount btrees 2023-11-24 23:51 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong @ 2023-11-28 16:07 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 16:07 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs Besides all the nitpicks that are the same as for the previous two patches, this looks good to me: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong ` (3 preceding siblings ...) 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong @ 2023-11-24 23:45 ` Darrick J. Wong 2023-11-24 23:51 ` [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong ` (6 more replies) 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong ` (2 subsequent siblings) 7 siblings, 7 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:45 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, In this series, online repair gains the ability to repair inode records. To do this, we must repair the ondisk inode and fork information enough to pass the iget verifiers and hence make the inode igettable again. Once that's done, we can perform higher level repairs on the incore inode. The fstests counterpart of this patchset implements stress testing of repair. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-inodes fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-inodes --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_attr_leaf.c | 32 - fs/xfs/libxfs/xfs_attr_leaf.h | 2 fs/xfs/libxfs/xfs_bmap.c | 22 fs/xfs/libxfs/xfs_bmap.h | 2 fs/xfs/libxfs/xfs_dir2_priv.h | 2 fs/xfs/libxfs/xfs_dir2_sf.c | 29 - fs/xfs/libxfs/xfs_format.h | 3 fs/xfs/libxfs/xfs_shared.h | 1 fs/xfs/libxfs/xfs_symlink_remote.c | 21 fs/xfs/scrub/bmap.c | 48 + fs/xfs/scrub/common.c | 26 + fs/xfs/scrub/common.h | 8 fs/xfs/scrub/dir.c | 21 fs/xfs/scrub/inode.c | 14 fs/xfs/scrub/inode_repair.c | 1659 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/parent.c | 17 fs/xfs/scrub/repair.c | 57 + fs/xfs/scrub/repair.h | 29 + fs/xfs/scrub/rtbitmap.c | 4 fs/xfs/scrub/rtsummary.c | 4 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 174 ++++ 23 files changed, 2128 insertions(+), 50 deletions(-) create mode 100644 fs/xfs/scrub/inode_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong @ 2023-11-24 23:51 ` Darrick J. Wong 2023-11-25 6:13 ` Christoph Hellwig 2023-11-24 23:51 ` [PATCH 2/7] xfs: try to attach dquots to files before repairing them Darrick J. Wong ` (5 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:51 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Don't compile the quota helper functions if quota isn't being built into the XFS module. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/repair.c | 2 ++ fs/xfs/scrub/repair.h | 9 +++++++++ 2 files changed, 11 insertions(+) diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index ad1df212ec4c1..18f8d54948f26 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -673,6 +673,7 @@ xrep_find_ag_btree_roots( return error; } +#ifdef CONFIG_XFS_QUOTA /* Force a quotacheck the next time we mount. */ void xrep_force_quotacheck( @@ -734,6 +735,7 @@ xrep_ino_dqattach( return error; } +#endif /* CONFIG_XFS_QUOTA */ /* Initialize all the btree cursors for an AG repair. */ void diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index cc7ea39427296..93814acc678a8 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -57,8 +57,15 @@ struct xrep_find_ag_btree { int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp, struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp); + +#ifdef CONFIG_XFS_QUOTA void xrep_force_quotacheck(struct xfs_scrub *sc, xfs_dqtype_t type); int xrep_ino_dqattach(struct xfs_scrub *sc); +#else +# define xrep_force_quotacheck(sc, type) ((void)0) +# define xrep_ino_dqattach(sc) (0) +#endif /* CONFIG_XFS_QUOTA */ + int xrep_reset_perag_resv(struct xfs_scrub *sc); /* Repair setup functions */ @@ -87,6 +94,8 @@ int xrep_reinit_pagi(struct xfs_scrub *sc); #else +#define xrep_ino_dqattach(sc) (0) + static inline int xrep_attempt( struct xfs_scrub *sc, ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled 2023-11-24 23:51 ` [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong @ 2023-11-25 6:13 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 6:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 2/7] xfs: try to attach dquots to files before repairing them 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong 2023-11-24 23:51 ` [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong @ 2023-11-24 23:51 ` Darrick J. Wong 2023-11-25 6:14 ` Christoph Hellwig 2023-11-24 23:51 ` [PATCH 3/7] xfs: repair inode records Darrick J. Wong ` (4 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:51 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Inode resource usage is tracked in the quota metadata. Repairing a file might change the resources used by that file, which means that we need to attach dquots to the file that we're examining before accessing anything in the file protected by the ILOCK. However, there's a twist: a dquot cache miss requires the dquot to be read in from the quota file, during which we drop the ILOCK on the file being examined. This means that we *must* try to attach the dquots before taking the ILOCK. Therefore, dquots must be attached to files in the scrub setup function. If doing so yields corruption errors (or unknown dquot errors), we instead clear the quotachecked status, which will cause a quotacheck on next mount. A future series will make this trigger live quotacheck. While we're here, change the xrep_ino_dqattach function to use the unlocked dqattach functions so that we avoid cycling the ILOCK if the inode already has dquots attached. This makes the naming and locking requirements consistent with the rest of the filesystem. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/bmap.c | 4 ++++ fs/xfs/scrub/common.c | 25 +++++++++++++++++++++++++ fs/xfs/scrub/common.h | 6 ++++++ fs/xfs/scrub/inode.c | 4 ++++ fs/xfs/scrub/repair.c | 13 ++++++++----- fs/xfs/scrub/rtbitmap.c | 4 ++++ fs/xfs/scrub/rtsummary.c | 4 ++++ 7 files changed, 55 insertions(+), 5 deletions(-) diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index 06d8c1996a338..f74bd2a97c7f7 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -78,6 +78,10 @@ xchk_setup_inode_bmap( if (error) goto out; + error = xchk_ino_dqattach(sc); + if (error) + goto out; + xchk_ilock(sc, XFS_ILOCK_EXCL); out: /* scrub teardown will unlock and release the inode */ diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index b6725b05fb417..9b7d7010495b9 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -820,6 +820,26 @@ xchk_iget_agi( return 0; } +#ifdef CONFIG_XFS_QUOTA +/* + * Try to attach dquots to this inode if we think we might want to repair it. + * Callers must not hold any ILOCKs. If the dquots are broken and cannot be + * attached, a quotacheck will be scheduled. + */ +int +xchk_ino_dqattach( + struct xfs_scrub *sc) +{ + ASSERT(sc->tp != NULL); + ASSERT(sc->ip != NULL); + + if (!xchk_could_repair(sc)) + return 0; + + return xrep_ino_dqattach(sc); +} +#endif + /* Install an inode that we opened by handle for scrubbing. */ int xchk_install_handle_inode( @@ -1031,6 +1051,11 @@ xchk_setup_inode_contents( error = xchk_trans_alloc(sc, resblks); if (error) goto out; + + error = xchk_ino_dqattach(sc); + if (error) + goto out; + xchk_ilock(sc, XFS_ILOCK_EXCL); out: /* scrub teardown will unlock and release the inode for us */ diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h index 4b666f254d700..895918565df26 100644 --- a/fs/xfs/scrub/common.h +++ b/fs/xfs/scrub/common.h @@ -103,9 +103,15 @@ xchk_setup_rtsummary(struct xfs_scrub *sc) } #endif #ifdef CONFIG_XFS_QUOTA +int xchk_ino_dqattach(struct xfs_scrub *sc); int xchk_setup_quota(struct xfs_scrub *sc); #else static inline int +xchk_ino_dqattach(struct xfs_scrub *sc) +{ + return 0; +} +static inline int xchk_setup_quota(struct xfs_scrub *sc) { return -ENOENT; diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index b7a93380a1ab0..7e97db8255c63 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -39,6 +39,10 @@ xchk_prepare_iscrub( if (error) return error; + error = xchk_ino_dqattach(sc); + if (error) + return error; + xchk_ilock(sc, XFS_ILOCK_EXCL); return 0; } diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 18f8d54948f26..2e82dace10cc2 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -700,10 +700,10 @@ xrep_force_quotacheck( * * This function ensures that the appropriate dquots are attached to an inode. * We cannot allow the dquot code to allocate an on-disk dquot block here - * because we're already in transaction context with the inode locked. The - * on-disk dquot should already exist anyway. If the quota code signals - * corruption or missing quota information, schedule quotacheck, which will - * repair corruptions in the quota metadata. + * because we're already in transaction context. The on-disk dquot should + * already exist anyway. If the quota code signals corruption or missing quota + * information, schedule quotacheck, which will repair corruptions in the quota + * metadata. */ int xrep_ino_dqattach( @@ -711,7 +711,10 @@ xrep_ino_dqattach( { int error; - error = xfs_qm_dqattach_locked(sc->ip, false); + ASSERT(sc->tp != NULL); + ASSERT(sc->ip != NULL); + + error = xfs_qm_dqattach(sc->ip); switch (error) { case -EFSBADCRC: case -EFSCORRUPTED: diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c index 41a1d89ae8e6c..d509a08d3fc3e 100644 --- a/fs/xfs/scrub/rtbitmap.c +++ b/fs/xfs/scrub/rtbitmap.c @@ -32,6 +32,10 @@ xchk_setup_rtbitmap( if (error) return error; + error = xchk_ino_dqattach(sc); + if (error) + return error; + xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP); return 0; } diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c index 8b15c47408d03..f94800a029f35 100644 --- a/fs/xfs/scrub/rtsummary.c +++ b/fs/xfs/scrub/rtsummary.c @@ -63,6 +63,10 @@ xchk_setup_rtsummary( if (error) return error; + error = xchk_ino_dqattach(sc); + if (error) + return error; + /* * Locking order requires us to take the rtbitmap first. We must be * careful to unlock it ourselves when we are done with the rtbitmap ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 2/7] xfs: try to attach dquots to files before repairing them 2023-11-24 23:51 ` [PATCH 2/7] xfs: try to attach dquots to files before repairing them Darrick J. Wong @ 2023-11-25 6:14 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 6:14 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 3/7] xfs: repair inode records 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong 2023-11-24 23:51 ` [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong 2023-11-24 23:51 ` [PATCH 2/7] xfs: try to attach dquots to files before repairing them Darrick J. Wong @ 2023-11-24 23:51 ` Darrick J. Wong 2023-11-28 17:08 ` Christoph Hellwig 2023-11-24 23:52 ` [PATCH 4/7] xfs: zap broken inode forks Darrick J. Wong ` (3 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:51 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> If an inode is so badly damaged that it cannot be loaded into the cache, fix the ondisk metadata and try again. If there /is/ a cached inode, fix any problems and apply any optimizations that can be solved incore. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_format.h | 3 fs/xfs/scrub/inode.c | 10 - fs/xfs/scrub/inode_repair.c | 804 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.c | 42 ++ fs/xfs/scrub/repair.h | 20 + fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 129 +++++++ 8 files changed, 1008 insertions(+), 3 deletions(-) create mode 100644 fs/xfs/scrub/inode_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index a6f708dc56cc2..0d86d75422f60 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -183,6 +183,7 @@ xfs-y += $(addprefix scrub/, \ agheader_repair.o \ alloc_repair.o \ ialloc_repair.o \ + inode_repair.o \ newbt.o \ reap.o \ refcount_repair.o \ diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h index 9a88aba1589f8..9dd3b21434314 100644 --- a/fs/xfs/libxfs/xfs_format.h +++ b/fs/xfs/libxfs/xfs_format.h @@ -1012,7 +1012,8 @@ enum xfs_dinode_fmt { #define XFS_DFORK_APTR(dip) \ (XFS_DFORK_DPTR(dip) + XFS_DFORK_BOFF(dip)) #define XFS_DFORK_PTR(dip,w) \ - ((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : XFS_DFORK_APTR(dip)) + ((void *)((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : \ + XFS_DFORK_APTR(dip))) #define XFS_DFORK_FORMAT(dip,w) \ ((w) == XFS_DATA_FORK ? \ diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index 7e97db8255c63..8656dd0d95560 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -25,6 +25,7 @@ #include "scrub/common.h" #include "scrub/btree.h" #include "scrub/trace.h" +#include "scrub/repair.h" /* Prepare the attached inode for scrubbing. */ static inline int @@ -185,8 +186,11 @@ xchk_setup_inode( * saying the inode is allocated and the icache being unable to load * the inode until we can flag the corruption in xchk_inode. The * scrub function has to note the corruption, since we're not really - * supposed to do that from the setup function. + * supposed to do that from the setup function. Save the mapping to + * make repairs to the ondisk inode buffer. */ + if (xchk_could_repair(sc)) + xrep_setup_inode(sc, &imap); return 0; out_cancel: @@ -342,6 +346,10 @@ xchk_inode_flags2( if (xfs_dinode_has_bigtime(dip) && !xfs_has_bigtime(mp)) goto bad; + /* no large extent counts without the filesystem feature */ + if ((flags2 & XFS_DIFLAG2_NREXT64) && !xfs_has_large_extent_counts(mp)) + goto bad; + return; bad: xchk_ino_set_corrupt(sc, ino); diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c new file mode 100644 index 0000000000000..3967fe737fa9c --- /dev/null +++ b/fs/xfs/scrub/inode_repair.c @@ -0,0 +1,804 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_icache.h" +#include "xfs_inode_buf.h" +#include "xfs_inode_fork.h" +#include "xfs_ialloc.h" +#include "xfs_da_format.h" +#include "xfs_reflink.h" +#include "xfs_rmap.h" +#include "xfs_bmap.h" +#include "xfs_bmap_util.h" +#include "xfs_dir2.h" +#include "xfs_dir2_priv.h" +#include "xfs_quota_defs.h" +#include "xfs_quota.h" +#include "xfs_ag.h" +#include "xfs_rtbitmap.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" + +/* + * Inode Record Repair + * =================== + * + * Roughly speaking, inode problems can be classified based on whether or not + * they trip the dinode verifiers. If those trip, then we won't be able to + * xfs_iget ourselves the inode. + * + * Therefore, the xrep_dinode_* functions fix anything that will cause the + * inode buffer verifier or the dinode verifier. The xrep_inode_* functions + * fix things on live incore inodes. The inode repair functions make decisions + * with security and usability implications when reviving a file: + * + * - Files with zero di_mode or a garbage di_mode are converted to regular file + * that only root can read. This file may not actually contain user data, + * if the file was not previously a regular file. Setuid and setgid bits + * are cleared. + * + * - Zero-size directories can be truncated to look empty. It is necessary to + * run the bmapbtd and directory repair functions to fully rebuild the + * directory. + * + * - Zero-size symbolic link targets can be truncated to '.'. It is necessary + * to run the bmapbtd and symlink repair functions to salvage the symlink. + * + * - Invalid extent size hints will be removed. + * + * - Quotacheck will be scheduled if we repaired an inode that was so badly + * damaged that the ondisk inode had to be rebuilt. + * + * - Invalid user, group, or project IDs (aka -1U) will be reset to zero. + * Setuid and setgid bits are cleared. + */ + +/* + * All the information we need to repair the ondisk inode if we can't iget the + * incore inode. We don't allocate this buffer unless we're going to perform + * a repair to the ondisk inode cluster buffer. + */ +struct xrep_inode { + /* Inode mapping that we saved from the initial lookup attempt. */ + struct xfs_imap imap; + + struct xfs_scrub *sc; +}; + +/* Setup function for inode repair. */ +int +xrep_setup_inode( + struct xfs_scrub *sc, + struct xfs_imap *imap) +{ + struct xrep_inode *ri; + + /* + * The only information that needs to be passed between inode scrub and + * repair is the location of the ondisk metadata if iget fails. The + * rest of struct xrep_inode is context data that we need to massage + * the ondisk inode to the point that iget will work, which means that + * we don't allocate anything at all if the incore inode is loaded. + */ + if (!imap) + return 0; + + sc->buf = kzalloc(sizeof(struct xrep_inode), XCHK_GFP_FLAGS); + if (!sc->buf) + return -ENOMEM; + + ri = sc->buf; + memcpy(&ri->imap, imap, sizeof(struct xfs_imap)); + ri->sc = sc; + return 0; +} + +/* Make sure this inode cluster buffer can pass the inode buffer verifier. */ +STATIC void +xrep_dinode_buf( + struct xfs_scrub *sc, + struct xfs_buf *bp) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_trans *tp = sc->tp; + struct xfs_perag *pag; + struct xfs_dinode *dip; + xfs_agnumber_t agno; + xfs_agino_t agino; + int ioff; + int i; + int ni; + bool crc_ok; + bool magic_ok; + bool unlinked_ok; + + ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock; + agno = xfs_daddr_to_agno(mp, xfs_buf_daddr(bp)); + pag = xfs_perag_get(mp, agno); + for (i = 0; i < ni; i++) { + ioff = i << mp->m_sb.sb_inodelog; + dip = xfs_buf_offset(bp, ioff); + agino = be32_to_cpu(dip->di_next_unlinked); + + unlinked_ok = magic_ok = crc_ok = false; + + if (xfs_verify_agino_or_null(pag, agino)) + unlinked_ok = true; + + if (dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) && + xfs_dinode_good_version(mp, dip->di_version)) + magic_ok = true; + + if (xfs_verify_cksum((char *)dip, mp->m_sb.sb_inodesize, + XFS_DINODE_CRC_OFF)) + crc_ok = true; + + if (magic_ok && unlinked_ok && crc_ok) + continue; + + if (!magic_ok) { + dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC); + dip->di_version = 3; + } + if (!unlinked_ok) + dip->di_next_unlinked = cpu_to_be32(NULLAGINO); + xfs_dinode_calc_crc(mp, dip); + xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF); + xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1); + } + xfs_perag_put(pag); +} + +/* Reinitialize things that never change in an inode. */ +STATIC void +xrep_dinode_header( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + trace_xrep_dinode_header(sc, dip); + + dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC); + if (!xfs_dinode_good_version(sc->mp, dip->di_version)) + dip->di_version = 3; + dip->di_ino = cpu_to_be64(sc->sm->sm_ino); + uuid_copy(&dip->di_uuid, &sc->mp->m_sb.sb_meta_uuid); + dip->di_gen = cpu_to_be32(sc->sm->sm_gen); +} + +/* Turn di_mode into /something/ recognizable. */ +STATIC void +xrep_dinode_mode( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + uint16_t mode; + + trace_xrep_dinode_mode(sc, dip); + + mode = be16_to_cpu(dip->di_mode); + if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN) + return; + + /* bad mode, so we set it to a file that only root can read */ + mode = S_IFREG; + dip->di_mode = cpu_to_be16(mode); + dip->di_uid = 0; + dip->di_gid = 0; +} + +/* Fix any conflicting flags that the verifiers complain about. */ +STATIC void +xrep_dinode_flags( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + struct xfs_mount *mp = sc->mp; + uint64_t flags2; + uint16_t mode; + uint16_t flags; + + trace_xrep_dinode_flags(sc, dip); + + mode = be16_to_cpu(dip->di_mode); + flags = be16_to_cpu(dip->di_flags); + flags2 = be64_to_cpu(dip->di_flags2); + + if (xfs_has_reflink(mp) && S_ISREG(mode)) + flags2 |= XFS_DIFLAG2_REFLINK; + else + flags2 &= ~(XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE); + if (flags & XFS_DIFLAG_REALTIME) + flags2 &= ~XFS_DIFLAG2_REFLINK; + if (!xfs_has_bigtime(mp)) + flags2 &= ~XFS_DIFLAG2_BIGTIME; + if (!xfs_has_large_extent_counts(mp)) + flags2 &= ~XFS_DIFLAG2_NREXT64; + if (flags2 & XFS_DIFLAG2_NREXT64) + dip->di_nrext64_pad = 0; + else if (dip->di_version >= 3) + dip->di_v3_pad = 0; + dip->di_flags = cpu_to_be16(flags); + dip->di_flags2 = cpu_to_be64(flags2); +} + +/* + * Blow out symlink; now it points to the current dir. We don't have to worry + * about incore state because this inode is failing the verifiers. + */ +STATIC void +xrep_dinode_zap_symlink( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + char *p; + + trace_xrep_dinode_zap_symlink(sc, dip); + + dip->di_format = XFS_DINODE_FMT_LOCAL; + dip->di_size = cpu_to_be64(1); + p = XFS_DFORK_PTR(dip, XFS_DATA_FORK); + *p = '.'; +} + +/* + * Blow out dir, make it point to the root. In the future repair will + * reconstruct this directory for us. Note that there's no in-core directory + * inode because the sf verifier tripped, so we don't have to worry about the + * dentry cache. + */ +STATIC void +xrep_dinode_zap_dir( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_dir2_sf_hdr *sfp; + int i8count; + + trace_xrep_dinode_zap_dir(sc, dip); + + dip->di_format = XFS_DINODE_FMT_LOCAL; + i8count = mp->m_sb.sb_rootino > XFS_DIR2_MAX_SHORT_INUM; + sfp = XFS_DFORK_PTR(dip, XFS_DATA_FORK); + sfp->count = 0; + sfp->i8count = i8count; + xfs_dir2_sf_put_parent_ino(sfp, mp->m_sb.sb_rootino); + dip->di_size = cpu_to_be64(xfs_dir2_sf_hdr_size(i8count)); +} + +/* Make sure we don't have a garbage file size. */ +STATIC void +xrep_dinode_size( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + uint64_t size; + uint16_t mode; + + trace_xrep_dinode_size(sc, dip); + + mode = be16_to_cpu(dip->di_mode); + size = be64_to_cpu(dip->di_size); + switch (mode & S_IFMT) { + case S_IFIFO: + case S_IFCHR: + case S_IFBLK: + case S_IFSOCK: + /* di_size can't be nonzero for special files */ + dip->di_size = 0; + break; + case S_IFREG: + /* Regular files can't be larger than 2^63-1 bytes. */ + dip->di_size = cpu_to_be64(size & ~(1ULL << 63)); + break; + case S_IFLNK: + /* + * Truncate ridiculously oversized symlinks. If the size is + * zero, reset it to point to the current directory. Both of + * these conditions trigger dinode verifier errors, so there + * is no in-core state to reset. + */ + if (size > XFS_SYMLINK_MAXLEN) + dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN); + else if (size == 0) + xrep_dinode_zap_symlink(sc, dip); + break; + case S_IFDIR: + /* + * Directories can't have a size larger than 32G. If the size + * is zero, reset it to an empty directory. Both of these + * conditions trigger dinode verifier errors, so there is no + * in-core state to reset. + */ + if (size > XFS_DIR2_SPACE_SIZE) + dip->di_size = cpu_to_be64(XFS_DIR2_SPACE_SIZE); + else if (size == 0) + xrep_dinode_zap_dir(sc, dip); + break; + } +} + +/* Fix extent size hints. */ +STATIC void +xrep_dinode_extsize_hints( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + struct xfs_mount *mp = sc->mp; + uint64_t flags2; + uint16_t flags; + uint16_t mode; + xfs_failaddr_t fa; + + trace_xrep_dinode_extsize_hints(sc, dip); + + mode = be16_to_cpu(dip->di_mode); + flags = be16_to_cpu(dip->di_flags); + flags2 = be64_to_cpu(dip->di_flags2); + + fa = xfs_inode_validate_extsize(mp, be32_to_cpu(dip->di_extsize), + mode, flags); + if (fa) { + dip->di_extsize = 0; + dip->di_flags &= ~cpu_to_be16(XFS_DIFLAG_EXTSIZE | + XFS_DIFLAG_EXTSZINHERIT); + } + + if (dip->di_version < 3) + return; + + fa = xfs_inode_validate_cowextsize(mp, be32_to_cpu(dip->di_cowextsize), + mode, flags, flags2); + if (fa) { + dip->di_cowextsize = 0; + dip->di_flags2 &= ~cpu_to_be64(XFS_DIFLAG2_COWEXTSIZE); + } +} + +/* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */ +STATIC int +xrep_dinode_core( + struct xrep_inode *ri) +{ + struct xfs_scrub *sc = ri->sc; + struct xfs_buf *bp; + struct xfs_dinode *dip; + xfs_ino_t ino = sc->sm->sm_ino; + int error; + + /* Read the inode cluster buffer. */ + error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp, + ri->imap.im_blkno, ri->imap.im_len, XBF_UNMAPPED, &bp, + NULL); + if (error) + return error; + + /* Make sure we can pass the inode buffer verifier. */ + xrep_dinode_buf(sc, bp); + bp->b_ops = &xfs_inode_buf_ops; + + /* Fix everything the verifier will complain about. */ + dip = xfs_buf_offset(bp, ri->imap.im_boffset); + xrep_dinode_header(sc, dip); + xrep_dinode_mode(sc, dip); + xrep_dinode_flags(sc, dip); + xrep_dinode_size(sc, dip); + xrep_dinode_extsize_hints(sc, dip); + + /* Write out the inode. */ + trace_xrep_dinode_fixed(sc, dip); + xfs_dinode_calc_crc(sc->mp, dip); + xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_DINO_BUF); + xfs_trans_log_buf(sc->tp, bp, ri->imap.im_boffset, + ri->imap.im_boffset + sc->mp->m_sb.sb_inodesize - 1); + + /* + * Now that we've finished rewriting anything in the ondisk metadata + * that would prevent iget from giving us an incore inode, commit the + * inode cluster buffer updates and drop the AGI buffer that we've been + * holding since scrub setup. + */ + error = xrep_trans_commit(sc); + if (error) + return error; + + /* Try again to load the inode. */ + error = xchk_iget_safe(sc, ino, &sc->ip); + if (error) + return error; + + xchk_ilock(sc, XFS_IOLOCK_EXCL); + error = xchk_trans_alloc(sc, 0); + if (error) + return error; + + error = xrep_ino_dqattach(sc); + if (error) + return error; + + xchk_ilock(sc, XFS_ILOCK_EXCL); + return 0; +} + +/* Fix everything xfs_dinode_verify cares about. */ +STATIC int +xrep_dinode_problems( + struct xrep_inode *ri) +{ + struct xfs_scrub *sc = ri->sc; + int error; + + error = xrep_dinode_core(ri); + if (error) + return error; + + /* We had to fix a totally busted inode, schedule quotacheck. */ + if (XFS_IS_UQUOTA_ON(sc->mp)) + xrep_force_quotacheck(sc, XFS_DQTYPE_USER); + if (XFS_IS_GQUOTA_ON(sc->mp)) + xrep_force_quotacheck(sc, XFS_DQTYPE_GROUP); + if (XFS_IS_PQUOTA_ON(sc->mp)) + xrep_force_quotacheck(sc, XFS_DQTYPE_PROJ); + + return 0; +} + +/* + * Fix problems that the verifiers don't care about. In general these are + * errors that don't cause problems elsewhere in the kernel that we can easily + * detect, so we don't check them all that rigorously. + */ + +/* Make sure block and extent counts are ok. */ +STATIC int +xrep_inode_blockcounts( + struct xfs_scrub *sc) +{ + struct xfs_ifork *ifp; + xfs_filblks_t count; + xfs_filblks_t acount; + xfs_extnum_t nextents; + int error; + + trace_xrep_inode_blockcounts(sc); + + /* Set data fork counters from the data fork mappings. */ + error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_DATA_FORK, + &nextents, &count); + if (error) + return error; + if (xfs_has_reflink(sc->mp)) { + ; /* data fork blockcount can exceed physical storage */ + } else if (XFS_IS_REALTIME_INODE(sc->ip)) { + if (count >= sc->mp->m_sb.sb_rblocks) + return -EFSCORRUPTED; + } else { + if (count >= sc->mp->m_sb.sb_dblocks) + return -EFSCORRUPTED; + } + error = xrep_ino_ensure_extent_count(sc, XFS_DATA_FORK, nextents); + if (error) + return error; + sc->ip->i_df.if_nextents = nextents; + + /* Set attr fork counters from the attr fork mappings. */ + ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK); + if (ifp) { + error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK, + &nextents, &acount); + if (error) + return error; + if (count >= sc->mp->m_sb.sb_dblocks) + return -EFSCORRUPTED; + error = xrep_ino_ensure_extent_count(sc, XFS_ATTR_FORK, + nextents); + if (error) + return error; + ifp->if_nextents = nextents; + } else { + acount = 0; + } + + sc->ip->i_nblocks = count + acount; + return 0; +} + +/* Check for invalid uid/gid/prid. */ +STATIC void +xrep_inode_ids( + struct xfs_scrub *sc) +{ + bool dirty = false; + + trace_xrep_inode_ids(sc); + + if (i_uid_read(VFS_I(sc->ip)) == -1U) { + i_uid_write(VFS_I(sc->ip), 0); + dirty = true; + if (XFS_IS_UQUOTA_ON(sc->mp)) + xrep_force_quotacheck(sc, XFS_DQTYPE_USER); + } + + if (i_gid_read(VFS_I(sc->ip)) == -1U) { + i_gid_write(VFS_I(sc->ip), 0); + dirty = true; + if (XFS_IS_GQUOTA_ON(sc->mp)) + xrep_force_quotacheck(sc, XFS_DQTYPE_GROUP); + } + + if (sc->ip->i_projid == -1U) { + sc->ip->i_projid = 0; + dirty = true; + if (XFS_IS_PQUOTA_ON(sc->mp)) + xrep_force_quotacheck(sc, XFS_DQTYPE_PROJ); + } + + /* strip setuid/setgid if we touched any of the ids */ + if (dirty) + VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID); +} + +static inline void +xrep_clamp_timestamp( + struct xfs_inode *ip, + struct timespec64 *ts) +{ + ts->tv_nsec = clamp_t(long, ts->tv_nsec, 0, NSEC_PER_SEC); + *ts = timestamp_truncate(*ts, VFS_I(ip)); +} + +/* Nanosecond counters can't have more than 1 billion. */ +STATIC void +xrep_inode_timestamps( + struct xfs_inode *ip) +{ + struct timespec64 tstamp; + struct inode *inode = VFS_I(ip); + + tstamp = inode_get_atime(inode); + xrep_clamp_timestamp(ip, &tstamp); + inode_set_atime_to_ts(inode, tstamp); + + tstamp = inode_get_mtime(inode); + xrep_clamp_timestamp(ip, &tstamp); + inode_set_mtime_to_ts(inode, tstamp); + + tstamp = inode_get_ctime(inode); + xrep_clamp_timestamp(ip, &tstamp); + inode_set_ctime_to_ts(inode, tstamp); + + xrep_clamp_timestamp(ip, &ip->i_crtime); +} + +/* Fix inode flags that don't make sense together. */ +STATIC void +xrep_inode_flags( + struct xfs_scrub *sc) +{ + uint16_t mode; + + trace_xrep_inode_flags(sc); + + mode = VFS_I(sc->ip)->i_mode; + + /* Clear junk flags */ + if (sc->ip->i_diflags & ~XFS_DIFLAG_ANY) + sc->ip->i_diflags &= ~XFS_DIFLAG_ANY; + + /* NEWRTBM only applies to realtime bitmaps */ + if (sc->ip->i_ino == sc->mp->m_sb.sb_rbmino) + sc->ip->i_diflags |= XFS_DIFLAG_NEWRTBM; + else + sc->ip->i_diflags &= ~XFS_DIFLAG_NEWRTBM; + + /* These only make sense for directories. */ + if (!S_ISDIR(mode)) + sc->ip->i_diflags &= ~(XFS_DIFLAG_RTINHERIT | + XFS_DIFLAG_EXTSZINHERIT | + XFS_DIFLAG_PROJINHERIT | + XFS_DIFLAG_NOSYMLINKS); + + /* These only make sense for files. */ + if (!S_ISREG(mode)) + sc->ip->i_diflags &= ~(XFS_DIFLAG_REALTIME | + XFS_DIFLAG_EXTSIZE); + + /* These only make sense for non-rt files. */ + if (sc->ip->i_diflags & XFS_DIFLAG_REALTIME) + sc->ip->i_diflags &= ~XFS_DIFLAG_FILESTREAM; + + /* Immutable and append only? Drop the append. */ + if ((sc->ip->i_diflags & XFS_DIFLAG_IMMUTABLE) && + (sc->ip->i_diflags & XFS_DIFLAG_APPEND)) + sc->ip->i_diflags &= ~XFS_DIFLAG_APPEND; + + /* Clear junk flags. */ + if (sc->ip->i_diflags2 & ~XFS_DIFLAG2_ANY) + sc->ip->i_diflags2 &= ~XFS_DIFLAG2_ANY; + + /* No reflink flag unless we support it and it's a file. */ + if (!xfs_has_reflink(sc->mp) || !S_ISREG(mode)) + sc->ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK; + + /* DAX only applies to files and dirs. */ + if (!(S_ISREG(mode) || S_ISDIR(mode))) + sc->ip->i_diflags2 &= ~XFS_DIFLAG2_DAX; + + /* No reflink files on the realtime device. */ + if (sc->ip->i_diflags & XFS_DIFLAG_REALTIME) + sc->ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK; +} + +/* + * Fix size problems with block/node format directories. If we fail to find + * the extent list, just bail out and let the bmapbtd repair functions clean + * up that mess. + */ +STATIC void +xrep_inode_blockdir_size( + struct xfs_scrub *sc) +{ + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec got; + struct xfs_ifork *ifp; + xfs_fileoff_t off; + int error; + + trace_xrep_inode_blockdir_size(sc); + + /* Find the last block before 32G; this is the dir size. */ + error = xfs_iread_extents(sc->tp, sc->ip, XFS_DATA_FORK); + if (error) + return; + + ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK); + off = XFS_B_TO_FSB(sc->mp, XFS_DIR2_SPACE_SIZE); + if (!xfs_iext_lookup_extent_before(sc->ip, ifp, &off, &icur, &got)) { + /* zero-extents directory? */ + return; + } + + off = got.br_startoff + got.br_blockcount; + sc->ip->i_disk_size = min_t(loff_t, XFS_DIR2_SPACE_SIZE, + XFS_FSB_TO_B(sc->mp, off)); +} + +/* Fix size problems with short format directories. */ +STATIC void +xrep_inode_sfdir_size( + struct xfs_scrub *sc) +{ + struct xfs_ifork *ifp; + + trace_xrep_inode_sfdir_size(sc); + + ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK); + sc->ip->i_disk_size = ifp->if_bytes; +} + +/* + * Fix any irregularities in an inode's size now that we can iterate extent + * maps and access other regular inode data. + */ +STATIC void +xrep_inode_size( + struct xfs_scrub *sc) +{ + trace_xrep_inode_size(sc); + + /* + * Currently we only support fixing size on extents or btree format + * directories. Files can be any size and sizes for the other inode + * special types are fixed by xrep_dinode_size. + */ + if (!S_ISDIR(VFS_I(sc->ip)->i_mode)) + return; + switch (sc->ip->i_df.if_format) { + case XFS_DINODE_FMT_EXTENTS: + case XFS_DINODE_FMT_BTREE: + xrep_inode_blockdir_size(sc); + break; + case XFS_DINODE_FMT_LOCAL: + xrep_inode_sfdir_size(sc); + break; + } +} + +/* Fix extent size hint problems. */ +STATIC void +xrep_inode_extsize( + struct xfs_scrub *sc) +{ + /* Fix misaligned extent size hints on a directory. */ + if ((sc->ip->i_diflags & XFS_DIFLAG_RTINHERIT) && + (sc->ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) && + xfs_extlen_to_rtxmod(sc->mp, sc->ip->i_extsize) > 0) { + sc->ip->i_extsize = 0; + sc->ip->i_diflags &= ~XFS_DIFLAG_EXTSZINHERIT; + } +} + +/* Fix any irregularities in an inode that the verifiers don't catch. */ +STATIC int +xrep_inode_problems( + struct xfs_scrub *sc) +{ + int error; + + error = xrep_inode_blockcounts(sc); + if (error) + return error; + xrep_inode_timestamps(sc->ip); + xrep_inode_flags(sc); + xrep_inode_ids(sc); + xrep_inode_size(sc); + xrep_inode_extsize(sc); + + trace_xrep_inode_fixed(sc); + xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE); + return xrep_roll_trans(sc); +} + +/* Repair an inode's fields. */ +int +xrep_inode( + struct xfs_scrub *sc) +{ + int error = 0; + + /* + * No inode? That means we failed the _iget verifiers. Repair all + * the things that the inode verifiers care about, then retry _iget. + */ + if (!sc->ip) { + struct xrep_inode *ri = sc->buf; + + ASSERT(ri != NULL); + + error = xrep_dinode_problems(ri); + if (error) + return error; + + /* By this point we had better have a working incore inode. */ + if (!sc->ip) + return -EFSCORRUPTED; + } + + xfs_trans_ijoin(sc->tp, sc->ip, 0); + + /* If we found corruption of any kind, try to fix it. */ + if ((sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) || + (sc->sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT)) { + error = xrep_inode_problems(sc); + if (error) + return error; + } + + /* See if we can clear the reflink flag. */ + if (xfs_is_reflink_inode(sc->ip)) { + error = xfs_reflink_clear_inode_flag(sc->ip, &sc->tp); + if (error) + return error; + } + + return xrep_defer_finish(sc); +} diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 2e82dace10cc2..82c9760776248 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -176,6 +176,16 @@ xrep_roll_ag_trans( return 0; } +/* Roll the scrub transaction, holding the primary metadata locked. */ +int +xrep_roll_trans( + struct xfs_scrub *sc) +{ + if (!sc->ip) + return xrep_roll_ag_trans(sc); + return xfs_trans_roll_inode(&sc->tp, sc->ip); +} + /* Finish all deferred work attached to the repair transaction. */ int xrep_defer_finish( @@ -740,6 +750,38 @@ xrep_ino_dqattach( } #endif /* CONFIG_XFS_QUOTA */ +/* + * Ensure that the inode being repaired is ready to handle a certain number of + * extents, or return EFSCORRUPTED. Caller must hold the ILOCK of the inode + * being repaired and have joined it to the scrub transaction. + */ +int +xrep_ino_ensure_extent_count( + struct xfs_scrub *sc, + int whichfork, + xfs_extnum_t nextents) +{ + xfs_extnum_t max_extents; + bool inode_has_nrext64; + + inode_has_nrext64 = xfs_inode_has_large_extent_counts(sc->ip); + max_extents = xfs_iext_max_nextents(inode_has_nrext64, whichfork); + if (nextents <= max_extents) + return 0; + if (inode_has_nrext64) + return -EFSCORRUPTED; + if (!xfs_has_large_extent_counts(sc->mp)) + return -EFSCORRUPTED; + + max_extents = xfs_iext_max_nextents(true, whichfork); + if (nextents > max_extents) + return -EFSCORRUPTED; + + sc->ip->i_diflags2 |= XFS_DIFLAG2_NREXT64; + xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE); + return 0; +} + /* Initialize all the btree cursors for an AG repair. */ void xrep_ag_btcur_init( diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 93814acc678a8..70a6b18e5ad3c 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -30,11 +30,22 @@ static inline int xrep_notsupported(struct xfs_scrub *sc) int xrep_attempt(struct xfs_scrub *sc, struct xchk_stats_run *run); void xrep_failure(struct xfs_mount *mp); int xrep_roll_ag_trans(struct xfs_scrub *sc); +int xrep_roll_trans(struct xfs_scrub *sc); int xrep_defer_finish(struct xfs_scrub *sc); bool xrep_ag_has_space(struct xfs_perag *pag, xfs_extlen_t nr_blocks, enum xfs_ag_resv_type type); xfs_extlen_t xrep_calc_ag_resblks(struct xfs_scrub *sc); +static inline int +xrep_trans_commit( + struct xfs_scrub *sc) +{ + int error = xfs_trans_commit(sc->tp); + + sc->tp = NULL; + return error; +} + struct xbitmap; struct xagb_bitmap; @@ -66,11 +77,16 @@ int xrep_ino_dqattach(struct xfs_scrub *sc); # define xrep_ino_dqattach(sc) (0) #endif /* CONFIG_XFS_QUOTA */ +int xrep_ino_ensure_extent_count(struct xfs_scrub *sc, int whichfork, + xfs_extnum_t nextents); int xrep_reset_perag_resv(struct xfs_scrub *sc); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); +struct xfs_imap; +int xrep_setup_inode(struct xfs_scrub *sc, struct xfs_imap *imap); + void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa); /* Metadata revalidators */ @@ -88,6 +104,7 @@ int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); int xrep_refcountbt(struct xfs_scrub *sc); +int xrep_inode(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -133,6 +150,8 @@ xrep_setup_nothing( } #define xrep_setup_ag_allocbt xrep_setup_nothing +#define xrep_setup_inode(sc, imap) ((void)0) + #define xrep_revalidate_allocbt (NULL) #define xrep_revalidate_iallocbt (NULL) @@ -144,6 +163,7 @@ xrep_setup_nothing( #define xrep_allocbt xrep_notsupported #define xrep_iallocbt xrep_notsupported #define xrep_refcountbt xrep_notsupported +#define xrep_inode xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index d0d6b2b41219e..b9edda17ab64b 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -284,7 +284,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_INODE, .setup = xchk_setup_inode, .scrub = xchk_inode, - .repair = xrep_notsupported, + .repair = xrep_inode, }, [XFS_SCRUB_TYPE_BMBTD] = { /* inode data fork */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 3f7af44309515..4ab1e6c3e36bc 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1393,6 +1393,135 @@ DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_alloc_file_blocks); DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_free_blocks); DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_claim_block); +DECLARE_EVENT_CLASS(xrep_dinode_class, + TP_PROTO(struct xfs_scrub *sc, struct xfs_dinode *dip), + TP_ARGS(sc, dip), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(uint16_t, mode) + __field(uint8_t, version) + __field(uint8_t, format) + __field(uint32_t, uid) + __field(uint32_t, gid) + __field(uint64_t, size) + __field(uint64_t, nblocks) + __field(uint32_t, extsize) + __field(uint32_t, nextents) + __field(uint16_t, anextents) + __field(uint8_t, forkoff) + __field(uint8_t, aformat) + __field(uint16_t, flags) + __field(uint32_t, gen) + __field(uint64_t, flags2) + __field(uint32_t, cowextsize) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->ino = sc->sm->sm_ino; + __entry->mode = be16_to_cpu(dip->di_mode); + __entry->version = dip->di_version; + __entry->format = dip->di_format; + __entry->uid = be32_to_cpu(dip->di_uid); + __entry->gid = be32_to_cpu(dip->di_gid); + __entry->size = be64_to_cpu(dip->di_size); + __entry->nblocks = be64_to_cpu(dip->di_nblocks); + __entry->extsize = be32_to_cpu(dip->di_extsize); + __entry->nextents = be32_to_cpu(dip->di_nextents); + __entry->anextents = be16_to_cpu(dip->di_anextents); + __entry->forkoff = dip->di_forkoff; + __entry->aformat = dip->di_aformat; + __entry->flags = be16_to_cpu(dip->di_flags); + __entry->gen = be32_to_cpu(dip->di_gen); + __entry->flags2 = be64_to_cpu(dip->di_flags2); + __entry->cowextsize = be32_to_cpu(dip->di_cowextsize); + ), + TP_printk("dev %d:%d ino 0x%llx mode 0x%x version %u format %u uid %u gid %u disize 0x%llx nblocks 0x%llx extsize %u nextents %u anextents %u forkoff 0x%x aformat %u flags 0x%x gen 0x%x flags2 0x%llx cowextsize %u", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->mode, + __entry->version, + __entry->format, + __entry->uid, + __entry->gid, + __entry->size, + __entry->nblocks, + __entry->extsize, + __entry->nextents, + __entry->anextents, + __entry->forkoff, + __entry->aformat, + __entry->flags, + __entry->gen, + __entry->flags2, + __entry->cowextsize) +) + +#define DEFINE_REPAIR_DINODE_EVENT(name) \ +DEFINE_EVENT(xrep_dinode_class, name, \ + TP_PROTO(struct xfs_scrub *sc, struct xfs_dinode *dip), \ + TP_ARGS(sc, dip)) +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_header); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_mode); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_flags); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_size); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_extsize_hints); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_symlink); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_dir); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_fixed); + +DECLARE_EVENT_CLASS(xrep_inode_class, + TP_PROTO(struct xfs_scrub *sc), + TP_ARGS(sc), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_fsize_t, size) + __field(xfs_rfsblock_t, nblocks) + __field(uint16_t, flags) + __field(uint64_t, flags2) + __field(uint32_t, nextents) + __field(uint8_t, format) + __field(uint32_t, anextents) + __field(uint8_t, aformat) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->ino = sc->sm->sm_ino; + __entry->size = sc->ip->i_disk_size; + __entry->nblocks = sc->ip->i_nblocks; + __entry->flags = sc->ip->i_diflags; + __entry->flags2 = sc->ip->i_diflags2; + __entry->nextents = sc->ip->i_df.if_nextents; + __entry->format = sc->ip->i_df.if_format; + __entry->anextents = sc->ip->i_af.if_nextents; + __entry->aformat = sc->ip->i_af.if_format; + ), + TP_printk("dev %d:%d ino 0x%llx disize 0x%llx nblocks 0x%llx flags 0x%x flags2 0x%llx nextents %u format %u anextents %u aformat %u", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->size, + __entry->nblocks, + __entry->flags, + __entry->flags2, + __entry->nextents, + __entry->format, + __entry->anextents, + __entry->aformat) +) + +#define DEFINE_REPAIR_INODE_EVENT(name) \ +DEFINE_EVENT(xrep_inode_class, name, \ + TP_PROTO(struct xfs_scrub *sc), \ + TP_ARGS(sc)) +DEFINE_REPAIR_INODE_EVENT(xrep_inode_blockcounts); +DEFINE_REPAIR_INODE_EVENT(xrep_inode_ids); +DEFINE_REPAIR_INODE_EVENT(xrep_inode_flags); +DEFINE_REPAIR_INODE_EVENT(xrep_inode_blockdir_size); +DEFINE_REPAIR_INODE_EVENT(xrep_inode_sfdir_size); +DEFINE_REPAIR_INODE_EVENT(xrep_inode_size); +DEFINE_REPAIR_INODE_EVENT(xrep_inode_fixed); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 3/7] xfs: repair inode records 2023-11-24 23:51 ` [PATCH 3/7] xfs: repair inode records Darrick J. Wong @ 2023-11-28 17:08 ` Christoph Hellwig 2023-11-28 23:08 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 17:08 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs > @@ -1012,7 +1012,8 @@ enum xfs_dinode_fmt { > #define XFS_DFORK_APTR(dip) \ > (XFS_DFORK_DPTR(dip) + XFS_DFORK_BOFF(dip)) > #define XFS_DFORK_PTR(dip,w) \ > - ((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : XFS_DFORK_APTR(dip)) > + ((void *)((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : \ > + XFS_DFORK_APTR(dip))) Not requiring a cast when using XFS_DFORK_PTR is a good thing, but I think this is the wrong way to do it. Instead of adding another cast here we can just change the char * cast in XFS_DFORK_DPTR to a void * one and rely on the widely used void pointer arithmetics extension in gcc (and clang). That'll also need a fixup to use a void instead of char * cast in xchk_dinode. And in the long run many of these helpers relly should become inline functions.. > + /* no large extent counts without the filesystem feature */ > + if ((flags2 & XFS_DIFLAG2_NREXT64) && !xfs_has_large_extent_counts(mp)) > + goto bad; This is just a missing check and not really related to repair, is it? > + /* > + * The only information that needs to be passed between inode scrub and > + * repair is the location of the ondisk metadata if iget fails. The > + * rest of struct xrep_inode is context data that we need to massage > + * the ondisk inode to the point that iget will work, which means that > + * we don't allocate anything at all if the incore inode is loaded. > + */ > + if (!imap) > + return 0; I don't really understand why this comment is here, and how it relates to the imap NULL check. But as the only caller passes the address of an on-stack imap I also don't understand why the check is here to start with. > + for (i = 0; i < ni; i++) { > + ioff = i << mp->m_sb.sb_inodelog; > + dip = xfs_buf_offset(bp, ioff); > + agino = be32_to_cpu(dip->di_next_unlinked); > + > + unlinked_ok = magic_ok = crc_ok = false; I'd split the body of this loop into a separate helper and keep a lot of the variables local to it. > +/* Reinitialize things that never change in an inode. */ > +STATIC void > +xrep_dinode_header( > + struct xfs_scrub *sc, > + struct xfs_dinode *dip) > +{ > + trace_xrep_dinode_header(sc, dip); > + > + dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC); > + if (!xfs_dinode_good_version(sc->mp, dip->di_version)) > + dip->di_version = 3; Can we ever end up here for v4 file systems? Because in that case the sane default inode version would be 2. > + > +/* Turn di_mode into /something/ recognizable. */ > +STATIC void > +xrep_dinode_mode( > + struct xfs_scrub *sc, > + struct xfs_dinode *dip) > +{ > + uint16_t mode; > + > + trace_xrep_dinode_mode(sc, dip); > + > + mode = be16_to_cpu(dip->di_mode); > + if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN) This is a somewhat odd way to check for a valid mode, but it works, so.. > + if (xfs_has_reflink(mp) && S_ISREG(mode)) > + flags2 |= XFS_DIFLAG2_REFLINK; We set the reflink flag by default, because a later stage will clear it if there aren't any shared blocks, right? Maybe add a comment to avoid any future confusion. > +STATIC void > +xrep_dinode_zap_symlink( > + struct xfs_scrub *sc, > + struct xfs_dinode *dip) > +{ > + char *p; > + > + trace_xrep_dinode_zap_symlink(sc, dip); > + > + dip->di_format = XFS_DINODE_FMT_LOCAL; > + dip->di_size = cpu_to_be64(1); > + p = XFS_DFORK_PTR(dip, XFS_DATA_FORK); > + *p = '.'; Hmm, changing a symlink to actually point somewhere seems very surprising, but making it point to the current directory almost begs for userspace code to run in loops. > +} > + > +/* > + * Blow out dir, make it point to the root. In the future repair will > + * reconstruct this directory for us. Note that there's no in-core directory > + * inode because the sf verifier tripped, so we don't have to worry about the > + * dentry cache. > + */ "make it point to root" isn't what I read in the code below. I parents it in root I think. > +/* Make sure we don't have a garbage file size. */ > +STATIC void > +xrep_dinode_size( > + struct xfs_scrub *sc, > + struct xfs_dinode *dip) > +{ > + uint64_t size; > + uint16_t mode; > + > + trace_xrep_dinode_size(sc, dip); > + > + mode = be16_to_cpu(dip->di_mode); > + size = be64_to_cpu(dip->di_size); Any reason to not simplify initialize the variables at declaration time? (Same for a while bunch of other functions / variables) > + if (xfs_has_reflink(sc->mp)) { > + ; /* data fork blockcount can exceed physical storage */ ... because we would be reflinking the same blocks into the same inode at different offsets over and over again ... ? Still, shouldn't we limit the condition to xfs_is_reflink_inode? > +/* Check for invalid uid/gid/prid. */ > +STATIC void > +xrep_inode_ids( > + struct xfs_scrub *sc) > +{ > + bool dirty = false; > + > + trace_xrep_inode_ids(sc); > + > + if (i_uid_read(VFS_I(sc->ip)) == -1U) { What is invalid about all-F uid/gid/projid? > + tstamp = inode_get_atime(inode); > + xrep_clamp_timestamp(ip, &tstamp); > + inode_set_atime_to_ts(inode, tstamp); Meh, I hate these new VFS timestamp access helper.. > + /* Find the last block before 32G; this is the dir size. */ > + error = xfs_iread_extents(sc->tp, sc->ip, XFS_DATA_FORK); I think that comments needs to go down to the off asignment and xfs_iext_lookup_extent_before call. > +/* > + * Fix any irregularities in an inode's size now that we can iterate extent > + * maps and access other regular inode data. > + */ > +STATIC void > +xrep_inode_size( > + struct xfs_scrub *sc) > +{ > + trace_xrep_inode_size(sc); > + > + /* > + * Currently we only support fixing size on extents or btree format > + * directories. Files can be any size and sizes for the other inode > + * special types are fixed by xrep_dinode_size. > + */ > + if (!S_ISDIR(VFS_I(sc->ip)->i_mode)) > + return; I think moving this check to the caller and renaming the function would be a bit nicer, especially if we grow more file type specific checks in the future. Otherwise this looks reasonable to me. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/7] xfs: repair inode records 2023-11-28 17:08 ` Christoph Hellwig @ 2023-11-28 23:08 ` Darrick J. Wong 2023-11-29 6:02 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 23:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 09:08:35AM -0800, Christoph Hellwig wrote: > > @@ -1012,7 +1012,8 @@ enum xfs_dinode_fmt { > > #define XFS_DFORK_APTR(dip) \ > > (XFS_DFORK_DPTR(dip) + XFS_DFORK_BOFF(dip)) > > #define XFS_DFORK_PTR(dip,w) \ > > - ((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : XFS_DFORK_APTR(dip)) > > + ((void *)((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : \ > > + XFS_DFORK_APTR(dip))) > > Not requiring a cast when using XFS_DFORK_PTR is a good thing, but I > think this is the wrong way to do it. Instead of adding another cast > here we can just change the char * cast in XFS_DFORK_DPTR to a void * > one and rely on the widely used void pointer arithmetics extension in > gcc (and clang). Ok. > That'll also need a fixup to use a void instead of > char * cast in xchk_dinode. I'll change the conditional to: if (XFS_DFORK_BOFF(dip) >= mp->m_sb.sb_inodesize) > And in the long run many of these helpers relly should become inline > functions.. > > > + /* no large extent counts without the filesystem feature */ > > + if ((flags2 & XFS_DIFLAG2_NREXT64) && !xfs_has_large_extent_counts(mp)) > > + goto bad; > > This is just a missing check and not really related to repair, is it? Yep. I guess I'll pull that out into a separate patch. > > + /* > > + * The only information that needs to be passed between inode scrub and > > + * repair is the location of the ondisk metadata if iget fails. The > > + * rest of struct xrep_inode is context data that we need to massage > > + * the ondisk inode to the point that iget will work, which means that > > + * we don't allocate anything at all if the incore inode is loaded. > > + */ > > + if (!imap) > > + return 0; > > I don't really understand why this comment is here, and how it relates > to the imap NULL check. But as the only caller passes the address of an > on-stack imap I also don't understand why the check is here to start > with. Hmm. I think I've been through too many iterations of this code -- at one point I remember the null check was actually useful for something. But now it's not, so it can go. > > > + for (i = 0; i < ni; i++) { > > + ioff = i << mp->m_sb.sb_inodelog; > > + dip = xfs_buf_offset(bp, ioff); > > + agino = be32_to_cpu(dip->di_next_unlinked); > > + > > + unlinked_ok = magic_ok = crc_ok = false; > > I'd split the body of this loop into a separate helper and keep a lot of > the variables local to it. Ok. > > +/* Reinitialize things that never change in an inode. */ > > +STATIC void > > +xrep_dinode_header( > > + struct xfs_scrub *sc, > > + struct xfs_dinode *dip) > > +{ > > + trace_xrep_dinode_header(sc, dip); > > + > > + dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC); > > + if (!xfs_dinode_good_version(sc->mp, dip->di_version)) > > + dip->di_version = 3; > > Can we ever end up here for v4 file systems? Because in that case > the sane default inode version would be 2. No. xchk_validate_inputs will reject IFLAG_REPAIR on a V4 fs. Those are deprecated, there's no point in going back. > > + > > +/* Turn di_mode into /something/ recognizable. */ > > +STATIC void > > +xrep_dinode_mode( > > + struct xfs_scrub *sc, > > + struct xfs_dinode *dip) > > +{ > > + uint16_t mode; > > + > > + trace_xrep_dinode_mode(sc, dip); > > + > > + mode = be16_to_cpu(dip->di_mode); > > + if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN) > > This is a somewhat odd way to check for a valid mode, but it works, so.. :) > > + if (xfs_has_reflink(mp) && S_ISREG(mode)) > > + flags2 |= XFS_DIFLAG2_REFLINK; > > We set the reflink flag by default, because a later stage will clear > it if there aren't any shared blocks, right? Maybe add a comment to > avoid any future confusion. /* * For regular files on a reflink filesystem, set the REFLINK flag to * protect shared extents. A later stage will actually check those * extents and clear the flag if possible. */ > > > +STATIC void > > +xrep_dinode_zap_symlink( > > + struct xfs_scrub *sc, > > + struct xfs_dinode *dip) > > +{ > > + char *p; > > + > > + trace_xrep_dinode_zap_symlink(sc, dip); > > + > > + dip->di_format = XFS_DINODE_FMT_LOCAL; > > + dip->di_size = cpu_to_be64(1); > > + p = XFS_DFORK_PTR(dip, XFS_DATA_FORK); > > + *p = '.'; > > Hmm, changing a symlink to actually point somewhere seems very > surprising, but making it point to the current directory almost begs > for userspace code to run in loops. How about '🤷'? That's only four bytes. Or maybe a question mark. > > +} > > + > > +/* > > + * Blow out dir, make it point to the root. In the future repair will > > + * reconstruct this directory for us. Note that there's no in-core directory > > + * inode because the sf verifier tripped, so we don't have to worry about the > > + * dentry cache. > > + */ > > "make it point to root" isn't what I read in the code below. I parents > it in root I think. Yes. Changed to "Blow out dir, make the parent point to the root." > > +/* Make sure we don't have a garbage file size. */ > > +STATIC void > > +xrep_dinode_size( > > + struct xfs_scrub *sc, > > + struct xfs_dinode *dip) > > +{ > > + uint64_t size; > > + uint16_t mode; > > + > > + trace_xrep_dinode_size(sc, dip); > > + > > + mode = be16_to_cpu(dip->di_mode); > > + size = be64_to_cpu(dip->di_size); > > Any reason to not simplify initialize the variables at declaration > time? (Same for a while bunch of other functions / variables) No, not really. Will fix. > > + if (xfs_has_reflink(sc->mp)) { > > + ; /* data fork blockcount can exceed physical storage */ > > ... because we would be reflinking the same blocks into the same inode > at different offsets over and over again ... ? Yes. That's not a terribly functional file, but users can do such things if they want to pay for the cpu/metadata. > Still, shouldn't we limit the condition to xfs_is_reflink_inode? Yep. > > +/* Check for invalid uid/gid/prid. */ > > +STATIC void > > +xrep_inode_ids( > > + struct xfs_scrub *sc) > > +{ > > + bool dirty = false; > > + > > + trace_xrep_inode_ids(sc); > > + > > + if (i_uid_read(VFS_I(sc->ip)) == -1U) { > > What is invalid about all-F uid/gid/projid? I thought those were invalid, though apparently they're not now? uidgid.h says: static inline bool uid_valid(kuid_t uid) { return __kuid_val(uid) != (uid_t) -1; } Which is why I thought that it's not possible to have a uid of -1 on a file. Trying to set that uid on a file causes the kernel to reject the value, but OTOH I can apparently create inodes with a -1 UID via idmapping shenanigans. <shrug> > > + tstamp = inode_get_atime(inode); > > + xrep_clamp_timestamp(ip, &tstamp); > > + inode_set_atime_to_ts(inode, tstamp); > > Meh, I hate these new VFS timestamp access helper.. They're very clunky. > > + /* Find the last block before 32G; this is the dir size. */ > > + error = xfs_iread_extents(sc->tp, sc->ip, XFS_DATA_FORK); > > I think that comments needs to go down to the off asignment and > xfs_iext_lookup_extent_before call. Done. > > +/* > > + * Fix any irregularities in an inode's size now that we can iterate extent > > + * maps and access other regular inode data. > > + */ > > +STATIC void > > +xrep_inode_size( > > + struct xfs_scrub *sc) > > +{ > > + trace_xrep_inode_size(sc); > > + > > + /* > > + * Currently we only support fixing size on extents or btree format > > + * directories. Files can be any size and sizes for the other inode > > + * special types are fixed by xrep_dinode_size. > > + */ > > + if (!S_ISDIR(VFS_I(sc->ip)->i_mode)) > > + return; > > I think moving this check to the caller and renaming the function would > be a bit nicer, especially if we grow more file type specific checks > in the future. That's the only one, so I'll rename it to xrep_inode_dir_size and hoist this check to the caller. > Otherwise this looks reasonable to me. Woo, thanks for reading through all this. :) --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/7] xfs: repair inode records 2023-11-28 23:08 ` Darrick J. Wong @ 2023-11-29 6:02 ` Christoph Hellwig 2023-12-05 23:08 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 6:02 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Nov 28, 2023 at 03:08:48PM -0800, Darrick J. Wong wrote: > > > > We set the reflink flag by default, because a later stage will clear > > it if there aren't any shared blocks, right? Maybe add a comment to > > avoid any future confusion. > > /* > * For regular files on a reflink filesystem, set the REFLINK flag to > * protect shared extents. A later stage will actually check those > * extents and clear the flag if possible. > */ Sounds good. > > Hmm, changing a symlink to actually point somewhere seems very > > surprising, but making it point to the current directory almost begs > > for userspace code to run in loops. > > How about '🤷'? That's only four bytes. > > Or maybe a question mark. Heh. I guess question marks seems a bit better, but the general idea of having new names / different pointing locations show up in the name space scare me a little. I wonder if something like the classic lost+found directory might be the right thing for anything we're not sure about as people know about it. > > > + if (xfs_has_reflink(sc->mp)) { > > > + ; /* data fork blockcount can exceed physical storage */ > > > > ... because we would be reflinking the same blocks into the same inode > > at different offsets over and over again ... ? > > Yes. That's not a terribly functional file, but users can do such > things if they want to pay for the cpu/metadata. Yeah. But maybe expand the comment a bit - having spent a fair amout of time with the reflink code this was obvious to me, but for someone new the above might be a bit too cryptic. > > > + > > > + if (i_uid_read(VFS_I(sc->ip)) == -1U) { > > > > What is invalid about all-F uid/gid/projid? > > I thought those were invalid, though apparently they're not now? > uidgid.h says: > > static inline bool uid_valid(kuid_t uid) > { > return __kuid_val(uid) != (uid_t) -1; > } > > Which is why I thought that it's not possible to have a uid of -1 on a > file. Trying to set that uid on a file causes the kernel to reject the > value, but OTOH I can apparently create inodes with a -1 UID via > idmapping shenanigans. Heh. Just wanted an explanation for the check. So a commnt is fine, or finding a way to use the uid_valid and co helpers to make it self-explanatory. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/7] xfs: repair inode records 2023-11-29 6:02 ` Christoph Hellwig @ 2023-12-05 23:08 ` Darrick J. Wong 2023-12-06 5:16 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-12-05 23:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 10:02:42PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 03:08:48PM -0800, Darrick J. Wong wrote: > > > > > > We set the reflink flag by default, because a later stage will clear > > > it if there aren't any shared blocks, right? Maybe add a comment to > > > avoid any future confusion. > > > > /* > > * For regular files on a reflink filesystem, set the REFLINK flag to > > * protect shared extents. A later stage will actually check those > > * extents and clear the flag if possible. > > */ > > Sounds good. > > > > Hmm, changing a symlink to actually point somewhere seems very > > > surprising, but making it point to the current directory almost begs > > > for userspace code to run in loops. > > > > How about '🤷'? That's only four bytes. > > > > Or maybe a question mark. > > Heh. I guess question marks seems a bit better, but the general idea > of having new names / different pointing locations show up in the > name space scare me a little. I wonder if something like the classic > lost+found directory might be the right thing for anything we're not > sure about as people know about it. Hmm. I suppose a problem with "?" is that question-mark is a valid filename, which means that our zapped symlink could now suddenly point to a different file that a user created. "/lost+found" isn't different in that respect, but societal convention might at least provide for raised eyebrows. That said, mkfs.xfs doesn't create one for us like mke2fs does, so maybe a broken symlink to the orphanage is... well, now I'm bikeshedding my own creation. May I try to make a case for "🚽"? ;) > > > > + if (xfs_has_reflink(sc->mp)) { > > > > + ; /* data fork blockcount can exceed physical storage */ > > > > > > ... because we would be reflinking the same blocks into the same inode > > > at different offsets over and over again ... ? > > > > Yes. That's not a terribly functional file, but users can do such > > things if they want to pay for the cpu/metadata. > > Yeah. But maybe expand the comment a bit - having spent a fair amout > of time with the reflink code this was obvious to me, but for someone > new the above might be a bit too cryptic. /* * data fork blockcount can exceed physical storage if a * user reflinks the same block over and over again. */ > > > > + > > > > + if (i_uid_read(VFS_I(sc->ip)) == -1U) { > > > > > > What is invalid about all-F uid/gid/projid? > > > > I thought those were invalid, though apparently they're not now? > > uidgid.h says: > > > > static inline bool uid_valid(kuid_t uid) > > { > > return __kuid_val(uid) != (uid_t) -1; > > } > > > > Which is why I thought that it's not possible to have a uid of -1 on a > > file. Trying to set that uid on a file causes the kernel to reject the > > value, but OTOH I can apparently create inodes with a -1 UID via > > idmapping shenanigans. > > Heh. Just wanted an explanation for the check. So a commnt is fine, > or finding a way to use the uid_valid and co helpers to make it > self-explanatory. Yeah, I think this converts easily to: if (!uid_valid(VFS_I(sc->ip)->i_uid)) { /* zap it */ } --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/7] xfs: repair inode records 2023-12-05 23:08 ` Darrick J. Wong @ 2023-12-06 5:16 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-12-06 5:16 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Dec 05, 2023 at 03:08:43PM -0800, Darrick J. Wong wrote: > Hmm. I suppose a problem with "?" is that question-mark is a valid > filename, which means that our zapped symlink could now suddenly point > to a different file that a user created. "/lost+found" isn't different > in that respect, but societal convention might at least provide for > raised eyebrows. That said, mkfs.xfs doesn't create one for us like > mke2fs does, so maybe a broken symlink to the orphanage is... well, now > I'm bikeshedding my own creation. > > May I try to make a case for "🚽"? ;) Haha.. I suspect not allowing to follow the link at all if is marked sick is the best idea, i.e. the concept we've talked about for regular files. Make that consistent for all file times, and then we need to look into an expedited on-disk flag for that to make it persistent. > > /* > * data fork blockcount can exceed physical storage if a > * user reflinks the same block over and over again. > */ Yup. > if (!uid_valid(VFS_I(sc->ip)->i_uid)) { > /* zap it */ > } Perfect. ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 4/7] xfs: zap broken inode forks 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:51 ` [PATCH 3/7] xfs: repair inode records Darrick J. Wong @ 2023-11-24 23:52 ` Darrick J. Wong 2023-11-30 4:44 ` Christoph Hellwig 2023-11-24 23:52 ` [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong ` (2 subsequent siblings) 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:52 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Determine if inode fork damage is responsible for the inode being unable to pass the ifork verifiers in xfs_iget and zap the fork contents if this is true. Once this is done the fork will be empty but we'll be able to construct an in-core inode, and a subsequent call to the inode fork repair ioctl will search the rmapbt to rebuild the records that were in the fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_attr_leaf.c | 32 +- fs/xfs/libxfs/xfs_attr_leaf.h | 2 fs/xfs/libxfs/xfs_bmap.c | 22 + fs/xfs/libxfs/xfs_bmap.h | 2 fs/xfs/libxfs/xfs_dir2_priv.h | 2 fs/xfs/libxfs/xfs_dir2_sf.c | 29 +- fs/xfs/libxfs/xfs_shared.h | 1 fs/xfs/libxfs/xfs_symlink_remote.c | 21 + fs/xfs/scrub/inode_repair.c | 696 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/trace.h | 42 ++ 10 files changed, 812 insertions(+), 37 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index 2580ae47209a6..24d266c98bc97 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -1040,23 +1040,16 @@ xfs_attr_shortform_allfit( return xfs_attr_shortform_bytesfit(dp, bytes); } -/* Verify the consistency of an inline attribute fork. */ +/* Verify the consistency of a raw inline attribute fork. */ xfs_failaddr_t -xfs_attr_shortform_verify( - struct xfs_inode *ip) +xfs_attr_shortform_verify_struct( + struct xfs_attr_shortform *sfp, + size_t size) { - struct xfs_attr_shortform *sfp; struct xfs_attr_sf_entry *sfep; struct xfs_attr_sf_entry *next_sfep; char *endp; - struct xfs_ifork *ifp; int i; - int64_t size; - - ASSERT(ip->i_af.if_format == XFS_DINODE_FMT_LOCAL); - ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK); - sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data; - size = ifp->if_bytes; /* * Give up if the attribute is way too short. @@ -1116,6 +1109,23 @@ xfs_attr_shortform_verify( return NULL; } +/* Verify the consistency of an inline attribute fork. */ +xfs_failaddr_t +xfs_attr_shortform_verify( + struct xfs_inode *ip) +{ + struct xfs_attr_shortform *sfp; + struct xfs_ifork *ifp; + int64_t size; + + ASSERT(ip->i_af.if_format == XFS_DINODE_FMT_LOCAL); + ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK); + sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data; + size = ifp->if_bytes; + + return xfs_attr_shortform_verify_struct(sfp, size); +} + /* * Convert a leaf attribute list to shortform attribute list */ diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h index 368f4d9fa1d59..0711a448f64ce 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.h +++ b/fs/xfs/libxfs/xfs_attr_leaf.h @@ -56,6 +56,8 @@ int xfs_attr_sf_findname(struct xfs_da_args *args, unsigned int *basep); int xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp); int xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes); +xfs_failaddr_t xfs_attr_shortform_verify_struct(struct xfs_attr_shortform *sfp, + size_t size); xfs_failaddr_t xfs_attr_shortform_verify(struct xfs_inode *ip); void xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp); diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 68be1dd4f0f26..9968a3a6e6d8d 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -6179,19 +6179,18 @@ xfs_bmap_finish_one( return error; } -/* Check that an inode's extent does not have invalid flags or bad ranges. */ +/* Check that an extent does not have invalid flags or bad ranges. */ xfs_failaddr_t -xfs_bmap_validate_extent( - struct xfs_inode *ip, +xfs_bmap_validate_extent_raw( + struct xfs_mount *mp, + bool rtfile, int whichfork, struct xfs_bmbt_irec *irec) { - struct xfs_mount *mp = ip->i_mount; - if (!xfs_verify_fileext(mp, irec->br_startoff, irec->br_blockcount)) return __this_address; - if (XFS_IS_REALTIME_INODE(ip) && whichfork == XFS_DATA_FORK) { + if (rtfile && whichfork == XFS_DATA_FORK) { if (!xfs_verify_rtbext(mp, irec->br_startblock, irec->br_blockcount)) return __this_address; @@ -6221,3 +6220,14 @@ xfs_bmap_intent_destroy_cache(void) kmem_cache_destroy(xfs_bmap_intent_cache); xfs_bmap_intent_cache = NULL; } + +/* Check that an inode's extent does not have invalid flags or bad ranges. */ +xfs_failaddr_t +xfs_bmap_validate_extent( + struct xfs_inode *ip, + int whichfork, + struct xfs_bmbt_irec *irec) +{ + return xfs_bmap_validate_extent_raw(ip->i_mount, + XFS_IS_REALTIME_INODE(ip), whichfork, irec); +} diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index e33470e39728d..8518324db2855 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -263,6 +263,8 @@ static inline uint32_t xfs_bmap_fork_to_state(int whichfork) } } +xfs_failaddr_t xfs_bmap_validate_extent_raw(struct xfs_mount *mp, bool rtfile, + int whichfork, struct xfs_bmbt_irec *irec); xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork, struct xfs_bmbt_irec *irec); int xfs_bmap_complain_bad_rec(struct xfs_inode *ip, int whichfork, diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h index 7404a9ff1a929..b10859a43776d 100644 --- a/fs/xfs/libxfs/xfs_dir2_priv.h +++ b/fs/xfs/libxfs/xfs_dir2_priv.h @@ -175,6 +175,8 @@ extern int xfs_dir2_sf_create(struct xfs_da_args *args, xfs_ino_t pino); extern int xfs_dir2_sf_lookup(struct xfs_da_args *args); extern int xfs_dir2_sf_removename(struct xfs_da_args *args); extern int xfs_dir2_sf_replace(struct xfs_da_args *args); +extern xfs_failaddr_t xfs_dir2_sf_verify_struct(struct xfs_mount *mp, + struct xfs_dir2_sf_hdr *sfp, int64_t size); extern xfs_failaddr_t xfs_dir2_sf_verify(struct xfs_inode *ip); int xfs_dir2_sf_entsize(struct xfs_mount *mp, struct xfs_dir2_sf_hdr *hdr, int len); diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c index 8cd37e6e9d387..0089046585247 100644 --- a/fs/xfs/libxfs/xfs_dir2_sf.c +++ b/fs/xfs/libxfs/xfs_dir2_sf.c @@ -706,12 +706,11 @@ xfs_dir2_sf_check( /* Verify the consistency of an inline directory. */ xfs_failaddr_t -xfs_dir2_sf_verify( - struct xfs_inode *ip) +xfs_dir2_sf_verify_struct( + struct xfs_mount *mp, + struct xfs_dir2_sf_hdr *sfp, + int64_t size) { - struct xfs_mount *mp = ip->i_mount; - struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); - struct xfs_dir2_sf_hdr *sfp; struct xfs_dir2_sf_entry *sfep; struct xfs_dir2_sf_entry *next_sfep; char *endp; @@ -719,15 +718,9 @@ xfs_dir2_sf_verify( int i; int i8count; int offset; - int64_t size; int error; uint8_t filetype; - ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); - - sfp = (struct xfs_dir2_sf_hdr *)ifp->if_u1.if_data; - size = ifp->if_bytes; - /* * Give up if the directory is way too short. */ @@ -803,6 +796,20 @@ xfs_dir2_sf_verify( return NULL; } +xfs_failaddr_t +xfs_dir2_sf_verify( + struct xfs_inode *ip) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); + struct xfs_dir2_sf_hdr *sfp; + + ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); + + sfp = (struct xfs_dir2_sf_hdr *)ifp->if_u1.if_data; + return xfs_dir2_sf_verify_struct(mp, sfp, ifp->if_bytes); +} + /* * Create a new (shortform) directory. */ diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h index c4381388c0c1a..57a52fa76a496 100644 --- a/fs/xfs/libxfs/xfs_shared.h +++ b/fs/xfs/libxfs/xfs_shared.h @@ -139,6 +139,7 @@ bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset, uint32_t size, struct xfs_buf *bp); void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp, struct xfs_inode *ip, struct xfs_ifork *ifp); +xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size); xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip); /* Computed inode geometry for the filesystem. */ diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c index bdc777b9ec4a6..7660a95b1ea97 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.c +++ b/fs/xfs/libxfs/xfs_symlink_remote.c @@ -201,16 +201,12 @@ xfs_symlink_local_to_remote( * does not do on-disk format checks. */ xfs_failaddr_t -xfs_symlink_shortform_verify( - struct xfs_inode *ip) +xfs_symlink_sf_verify_struct( + void *sfp, + int64_t size) { - struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); - char *sfp = (char *)ifp->if_u1.if_data; - int size = ifp->if_bytes; char *endp = sfp + size; - ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); - /* * Zero length symlinks should never occur in memory as they are * never allowed to exist on disk. @@ -231,3 +227,14 @@ xfs_symlink_shortform_verify( return __this_address; return NULL; } + +xfs_failaddr_t +xfs_symlink_shortform_verify( + struct xfs_inode *ip) +{ + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); + + ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); + + return xfs_symlink_sf_verify_struct(ifp->if_u1.if_data, ifp->if_bytes); +} diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c index 3967fe737fa9c..a73205702ffa5 100644 --- a/fs/xfs/scrub/inode_repair.c +++ b/fs/xfs/scrub/inode_repair.c @@ -22,8 +22,11 @@ #include "xfs_ialloc.h" #include "xfs_da_format.h" #include "xfs_reflink.h" +#include "xfs_alloc.h" #include "xfs_rmap.h" +#include "xfs_rmap_btree.h" #include "xfs_bmap.h" +#include "xfs_bmap_btree.h" #include "xfs_bmap_util.h" #include "xfs_dir2.h" #include "xfs_dir2_priv.h" @@ -31,6 +34,8 @@ #include "xfs_quota.h" #include "xfs_ag.h" #include "xfs_rtbitmap.h" +#include "xfs_attr_leaf.h" +#include "xfs_log_priv.h" #include "scrub/xfs_scrub.h" #include "scrub/scrub.h" #include "scrub/common.h" @@ -70,6 +75,16 @@ * * - Invalid user, group, or project IDs (aka -1U) will be reset to zero. * Setuid and setgid bits are cleared. + * + * - Data and attr forks are reset to extents format with zero extents if the + * fork data is inconsistent. It is necessary to run the bmapbtd or bmapbta + * repair functions to recover the space mapping. + * + * - ACLs will not be recovered if the attr fork is zapped or the extended + * attribute structure itself requires salvaging. + * + * - If the attr fork is zapped, the user and group ids are reset to root and + * the setuid and setgid bits are removed. */ /* @@ -82,6 +97,28 @@ struct xrep_inode { struct xfs_imap imap; struct xfs_scrub *sc; + + /* Blocks in use on the data device by data extents or bmbt blocks. */ + xfs_rfsblock_t data_blocks; + + /* Blocks in use on the rt device. */ + xfs_rfsblock_t rt_blocks; + + /* Blocks in use by the attr fork. */ + xfs_rfsblock_t attr_blocks; + + /* Number of data device extents for the data fork. */ + xfs_extnum_t data_extents; + + /* + * Number of realtime device extents for the data fork. If + * data_extents and rt_extents indicate that the data fork has extents + * on both devices, we'll just back away slowly. + */ + xfs_extnum_t rt_extents; + + /* Number of (data device) extents for the attr fork. */ + xfs_aextnum_t attr_extents; }; /* Setup function for inode repair. */ @@ -209,7 +246,8 @@ xrep_dinode_mode( STATIC void xrep_dinode_flags( struct xfs_scrub *sc, - struct xfs_dinode *dip) + struct xfs_dinode *dip, + bool isrt) { struct xfs_mount *mp = sc->mp; uint64_t flags2; @@ -222,6 +260,11 @@ xrep_dinode_flags( flags = be16_to_cpu(dip->di_flags); flags2 = be64_to_cpu(dip->di_flags2); + if (isrt) + flags |= XFS_DIFLAG_REALTIME; + else + flags &= ~XFS_DIFLAG_REALTIME; + if (xfs_has_reflink(mp) && S_ISREG(mode)) flags2 |= XFS_DIFLAG2_REFLINK; else @@ -374,6 +417,649 @@ xrep_dinode_extsize_hints( } } +/* Count extents and blocks for an inode given an rmap. */ +STATIC int +xrep_dinode_walk_rmap( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec, + void *priv) +{ + struct xrep_inode *ri = priv; + int error = 0; + + if (xchk_should_terminate(ri->sc, &error)) + return error; + + /* We only care about this inode. */ + if (rec->rm_owner != ri->sc->sm->sm_ino) + return 0; + + if (rec->rm_flags & XFS_RMAP_ATTR_FORK) { + ri->attr_blocks += rec->rm_blockcount; + if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK)) + ri->attr_extents++; + + return 0; + } + + ri->data_blocks += rec->rm_blockcount; + if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK)) + ri->data_extents++; + + return 0; +} + +/* Count extents and blocks for an inode from all AG rmap data. */ +STATIC int +xrep_dinode_count_ag_rmaps( + struct xrep_inode *ri, + struct xfs_perag *pag) +{ + struct xfs_btree_cur *cur; + struct xfs_buf *agf; + int error; + + error = xfs_alloc_read_agf(pag, ri->sc->tp, 0, &agf); + if (error) + return error; + + cur = xfs_rmapbt_init_cursor(ri->sc->mp, ri->sc->tp, agf, pag); + error = xfs_rmap_query_all(cur, xrep_dinode_walk_rmap, ri); + xfs_btree_del_cursor(cur, error); + xfs_trans_brelse(ri->sc->tp, agf); + return error; +} + +/* Count extents and blocks for a given inode from all rmap data. */ +STATIC int +xrep_dinode_count_rmaps( + struct xrep_inode *ri) +{ + struct xfs_perag *pag; + xfs_agnumber_t agno; + int error; + + if (!xfs_has_rmapbt(ri->sc->mp) || xfs_has_realtime(ri->sc->mp)) + return -EOPNOTSUPP; + + for_each_perag(ri->sc->mp, agno, pag) { + error = xrep_dinode_count_ag_rmaps(ri, pag); + if (error) { + xfs_perag_rele(pag); + return error; + } + } + + /* Can't have extents on both the rt and the data device. */ + if (ri->data_extents && ri->rt_extents) + return -EFSCORRUPTED; + + trace_xrep_dinode_count_rmaps(ri->sc, + ri->data_blocks, ri->rt_blocks, ri->attr_blocks, + ri->data_extents, ri->rt_extents, ri->attr_extents); + return 0; +} + +/* Return true if this extents-format ifork looks like garbage. */ +STATIC bool +xrep_dinode_bad_extents_fork( + struct xfs_scrub *sc, + struct xfs_dinode *dip, + unsigned int dfork_size, + int whichfork) +{ + struct xfs_bmbt_irec new; + struct xfs_bmbt_rec *dp; + xfs_extnum_t nex; + bool isrt; + unsigned int i; + + nex = xfs_dfork_nextents(dip, whichfork); + if (nex > dfork_size / sizeof(struct xfs_bmbt_rec)) + return true; + + dp = XFS_DFORK_PTR(dip, whichfork); + + isrt = dip->di_flags & cpu_to_be16(XFS_DIFLAG_REALTIME); + for (i = 0; i < nex; i++, dp++) { + xfs_failaddr_t fa; + + xfs_bmbt_disk_get_all(dp, &new); + fa = xfs_bmap_validate_extent_raw(sc->mp, isrt, whichfork, + &new); + if (fa) + return true; + } + + return false; +} + +/* Return true if this btree-format ifork looks like garbage. */ +STATIC bool +xrep_dinode_bad_bmbt_fork( + struct xfs_scrub *sc, + struct xfs_dinode *dip, + unsigned int dfork_size, + int whichfork) +{ + struct xfs_bmdr_block *dfp; + xfs_extnum_t nex; + unsigned int i; + unsigned int dmxr; + unsigned int nrecs; + unsigned int level; + + nex = xfs_dfork_nextents(dip, whichfork); + if (nex <= dfork_size / sizeof(struct xfs_bmbt_rec)) + return true; + + if (dfork_size < sizeof(struct xfs_bmdr_block)) + return true; + + dfp = XFS_DFORK_PTR(dip, whichfork); + nrecs = be16_to_cpu(dfp->bb_numrecs); + level = be16_to_cpu(dfp->bb_level); + + if (nrecs == 0 || XFS_BMDR_SPACE_CALC(nrecs) > dfork_size) + return true; + if (level == 0 || level >= XFS_BM_MAXLEVELS(sc->mp, whichfork)) + return true; + + dmxr = xfs_bmdr_maxrecs(dfork_size, 0); + for (i = 1; i <= nrecs; i++) { + struct xfs_bmbt_key *fkp; + xfs_bmbt_ptr_t *fpp; + xfs_fileoff_t fileoff; + xfs_fsblock_t fsbno; + + fkp = XFS_BMDR_KEY_ADDR(dfp, i); + fileoff = be64_to_cpu(fkp->br_startoff); + if (!xfs_verify_fileoff(sc->mp, fileoff)) + return true; + + fpp = XFS_BMDR_PTR_ADDR(dfp, i, dmxr); + fsbno = be64_to_cpu(*fpp); + if (!xfs_verify_fsbno(sc->mp, fsbno)) + return true; + } + + return false; +} + +/* + * Check the data fork for things that will fail the ifork verifiers or the + * ifork formatters. + */ +STATIC bool +xrep_dinode_check_dfork( + struct xfs_scrub *sc, + struct xfs_dinode *dip, + uint16_t mode) +{ + void *dfork_ptr; + int64_t data_size; + unsigned int fmt; + unsigned int dfork_size; + + /* + * Verifier functions take signed int64_t, so check for bogus negative + * values first. + */ + data_size = be64_to_cpu(dip->di_size); + if (data_size < 0) + return true; + + fmt = XFS_DFORK_FORMAT(dip, XFS_DATA_FORK); + switch (mode & S_IFMT) { + case S_IFIFO: + case S_IFCHR: + case S_IFBLK: + case S_IFSOCK: + if (fmt != XFS_DINODE_FMT_DEV) + return true; + break; + case S_IFREG: + if (fmt == XFS_DINODE_FMT_LOCAL) + return true; + fallthrough; + case S_IFLNK: + case S_IFDIR: + switch (fmt) { + case XFS_DINODE_FMT_LOCAL: + case XFS_DINODE_FMT_EXTENTS: + case XFS_DINODE_FMT_BTREE: + break; + default: + return true; + } + break; + default: + return true; + } + + dfork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_DATA_FORK); + dfork_ptr = XFS_DFORK_PTR(dip, XFS_DATA_FORK); + + switch (fmt) { + case XFS_DINODE_FMT_DEV: + break; + case XFS_DINODE_FMT_LOCAL: + /* dir/symlink structure cannot be larger than the fork */ + if (data_size > dfork_size) + return true; + /* directory structure must pass verification. */ + if (S_ISDIR(mode) && xfs_dir2_sf_verify_struct(sc->mp, + dfork_ptr, data_size) != NULL) + return true; + /* symlink structure must pass verification. */ + if (S_ISLNK(mode) && xfs_symlink_sf_verify_struct(dfork_ptr, + data_size) != NULL) + return true; + break; + case XFS_DINODE_FMT_EXTENTS: + if (xrep_dinode_bad_extents_fork(sc, dip, dfork_size, + XFS_DATA_FORK)) + return true; + break; + case XFS_DINODE_FMT_BTREE: + if (xrep_dinode_bad_bmbt_fork(sc, dip, dfork_size, + XFS_DATA_FORK)) + return true; + break; + default: + return true; + } + + return false; +} + +static void +xrep_dinode_set_data_nextents( + struct xfs_dinode *dip, + xfs_extnum_t nextents) +{ + if (xfs_dinode_has_large_extent_counts(dip)) + dip->di_big_nextents = cpu_to_be64(nextents); + else + dip->di_nextents = cpu_to_be32(nextents); +} + +static void +xrep_dinode_set_attr_nextents( + struct xfs_dinode *dip, + xfs_extnum_t nextents) +{ + if (xfs_dinode_has_large_extent_counts(dip)) + dip->di_big_anextents = cpu_to_be32(nextents); + else + dip->di_anextents = cpu_to_be16(nextents); +} + +/* Reset the data fork to something sane. */ +STATIC void +xrep_dinode_zap_dfork( + struct xrep_inode *ri, + struct xfs_dinode *dip, + uint16_t mode) +{ + struct xfs_scrub *sc = ri->sc; + + trace_xrep_dinode_zap_dfork(sc, dip); + + xrep_dinode_set_data_nextents(dip, 0); + + /* Special files always get reset to DEV */ + switch (mode & S_IFMT) { + case S_IFIFO: + case S_IFCHR: + case S_IFBLK: + case S_IFSOCK: + dip->di_format = XFS_DINODE_FMT_DEV; + dip->di_size = 0; + return; + } + + /* + * If we have data extents, reset to an empty map and hope the user + * will run the bmapbtd checker next. + */ + if (ri->data_extents || ri->rt_extents || S_ISREG(mode)) { + dip->di_format = XFS_DINODE_FMT_EXTENTS; + return; + } + + /* Otherwise, reset the local format to the minimum. */ + switch (mode & S_IFMT) { + case S_IFLNK: + xrep_dinode_zap_symlink(sc, dip); + break; + case S_IFDIR: + xrep_dinode_zap_dir(sc, dip); + break; + } +} + +/* + * Check the attr fork for things that will fail the ifork verifiers or the + * ifork formatters. + */ +STATIC bool +xrep_dinode_check_afork( + struct xfs_scrub *sc, + struct xfs_dinode *dip) +{ + struct xfs_attr_shortform *afork_ptr; + size_t attr_size; + unsigned int afork_size; + + if (XFS_DFORK_BOFF(dip) == 0) + return dip->di_aformat != XFS_DINODE_FMT_EXTENTS || + xfs_dfork_attr_extents(dip) != 0; + + afork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK); + afork_ptr = XFS_DFORK_PTR(dip, XFS_ATTR_FORK); + + switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) { + case XFS_DINODE_FMT_LOCAL: + /* Fork has to be large enough to extract the xattr size. */ + if (afork_size < sizeof(struct xfs_attr_sf_hdr)) + return true; + + /* xattr structure cannot be larger than the fork */ + attr_size = be16_to_cpu(afork_ptr->hdr.totsize); + if (attr_size > afork_size) + return true; + + /* xattr structure must pass verification. */ + return xfs_attr_shortform_verify_struct(afork_ptr, + attr_size) != NULL; + case XFS_DINODE_FMT_EXTENTS: + if (xrep_dinode_bad_extents_fork(sc, dip, afork_size, + XFS_ATTR_FORK)) + return true; + break; + case XFS_DINODE_FMT_BTREE: + if (xrep_dinode_bad_bmbt_fork(sc, dip, afork_size, + XFS_ATTR_FORK)) + return true; + break; + default: + return true; + } + + return false; +} + +/* + * Reset the attr fork to empty. Since the attr fork could have contained + * ACLs, make the file readable only by root. + */ +STATIC void +xrep_dinode_zap_afork( + struct xrep_inode *ri, + struct xfs_dinode *dip, + uint16_t mode) +{ + struct xfs_scrub *sc = ri->sc; + + trace_xrep_dinode_zap_afork(sc, dip); + + dip->di_aformat = XFS_DINODE_FMT_EXTENTS; + xrep_dinode_set_attr_nextents(dip, 0); + + /* + * If the data fork is in btree format, removing the attr fork entirely + * might cause verifier failures if the next level down in the bmbt + * could now fit in the data fork area. + */ + if (dip->di_format != XFS_DINODE_FMT_BTREE) + dip->di_forkoff = 0; + dip->di_mode = cpu_to_be16(mode & ~0777); + dip->di_uid = 0; + dip->di_gid = 0; +} + +/* Make sure the fork offset is a sensible value. */ +STATIC void +xrep_dinode_ensure_forkoff( + struct xrep_inode *ri, + struct xfs_dinode *dip, + uint16_t mode) +{ + struct xfs_bmdr_block *bmdr; + struct xfs_scrub *sc = ri->sc; + xfs_extnum_t attr_extents, data_extents; + size_t bmdr_minsz = XFS_BMDR_SPACE_CALC(1); + unsigned int lit_sz = XFS_LITINO(sc->mp); + unsigned int afork_min, dfork_min; + + trace_xrep_dinode_ensure_forkoff(sc, dip); + + /* + * Before calling this function, xrep_dinode_core ensured that both + * forks actually fit inside their respective literal areas. If this + * was not the case, the fork was reset to FMT_EXTENTS with zero + * records. If the rmapbt scan found attr or data fork blocks, this + * will be noted in the dinode_stats, and we must leave enough room + * for the bmap repair code to reconstruct the mapping structure. + * + * First, compute the minimum space required for the attr fork. + */ + switch (dip->di_aformat) { + case XFS_DINODE_FMT_LOCAL: + /* + * If we still have a shortform xattr structure at all, that + * means the attr fork area was exactly large enough to fit + * the sf structure. + */ + afork_min = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK); + break; + case XFS_DINODE_FMT_EXTENTS: + attr_extents = xfs_dfork_attr_extents(dip); + if (attr_extents) { + /* + * We must maintain sufficient space to hold the entire + * extent map array in the data fork. Note that we + * previously zapped the fork if it had no chance of + * fitting in the inode. + */ + afork_min = sizeof(struct xfs_bmbt_rec) * attr_extents; + } else if (ri->attr_extents > 0) { + /* + * The attr fork thinks it has zero extents, but we + * found some xattr extents. We need to leave enough + * empty space here so that the incore attr fork will + * get created (and hence trigger the attr fork bmap + * repairer). + */ + afork_min = bmdr_minsz; + } else { + /* No extents on disk or found in rmapbt. */ + afork_min = 0; + } + break; + case XFS_DINODE_FMT_BTREE: + /* Must have space for btree header and key/pointers. */ + bmdr = XFS_DFORK_PTR(dip, XFS_ATTR_FORK); + afork_min = XFS_BMAP_BROOT_SPACE(sc->mp, bmdr); + break; + default: + /* We should never see any other formats. */ + afork_min = 0; + break; + } + + /* Compute the minimum space required for the data fork. */ + switch (dip->di_format) { + case XFS_DINODE_FMT_DEV: + dfork_min = sizeof(__be32); + break; + case XFS_DINODE_FMT_UUID: + dfork_min = sizeof(uuid_t); + break; + case XFS_DINODE_FMT_LOCAL: + /* + * If we still have a shortform data fork at all, that means + * the data fork area was large enough to fit whatever was in + * there. + */ + dfork_min = be64_to_cpu(dip->di_size); + break; + case XFS_DINODE_FMT_EXTENTS: + data_extents = xfs_dfork_data_extents(dip); + if (data_extents) { + /* + * We must maintain sufficient space to hold the entire + * extent map array in the data fork. Note that we + * previously zapped the fork if it had no chance of + * fitting in the inode. + */ + dfork_min = sizeof(struct xfs_bmbt_rec) * data_extents; + } else if (ri->data_extents > 0 || ri->rt_extents > 0) { + /* + * The data fork thinks it has zero extents, but we + * found some data extents. We need to leave enough + * empty space here so that the data fork bmap repair + * will recover the mappings. + */ + dfork_min = bmdr_minsz; + } else { + /* No extents on disk or found in rmapbt. */ + dfork_min = 0; + } + break; + case XFS_DINODE_FMT_BTREE: + /* Must have space for btree header and key/pointers. */ + bmdr = XFS_DFORK_PTR(dip, XFS_DATA_FORK); + dfork_min = XFS_BMAP_BROOT_SPACE(sc->mp, bmdr); + break; + default: + dfork_min = 0; + break; + } + + /* + * Round all values up to the nearest 8 bytes, because that is the + * precision of di_forkoff. + */ + afork_min = roundup(afork_min, 8); + dfork_min = roundup(dfork_min, 8); + bmdr_minsz = roundup(bmdr_minsz, 8); + + ASSERT(dfork_min <= lit_sz); + ASSERT(afork_min <= lit_sz); + + /* + * If the data fork was zapped and we don't have enough space for the + * recovery fork, move the attr fork up. + */ + if (dip->di_format == XFS_DINODE_FMT_EXTENTS && + xfs_dfork_data_extents(dip) == 0 && + (ri->data_extents > 0 || ri->rt_extents > 0) && + bmdr_minsz > XFS_DFORK_DSIZE(dip, sc->mp)) { + if (bmdr_minsz + afork_min > lit_sz) { + /* + * The attr for and the stub fork we need to recover + * the data fork won't both fit. Zap the attr fork. + */ + xrep_dinode_zap_afork(ri, dip, mode); + afork_min = bmdr_minsz; + } else { + void *before, *after; + + /* Otherwise, just slide the attr fork up. */ + before = XFS_DFORK_APTR(dip); + dip->di_forkoff = bmdr_minsz >> 3; + after = XFS_DFORK_APTR(dip); + memmove(after, before, XFS_DFORK_ASIZE(dip, sc->mp)); + } + } + + /* + * If the attr fork was zapped and we don't have enough space for the + * recovery fork, move the attr fork down. + */ + if (dip->di_aformat == XFS_DINODE_FMT_EXTENTS && + xfs_dfork_attr_extents(dip) == 0 && + ri->attr_extents > 0 && + bmdr_minsz > XFS_DFORK_ASIZE(dip, sc->mp)) { + if (dip->di_format == XFS_DINODE_FMT_BTREE) { + /* + * If the data fork is in btree format then we can't + * adjust forkoff because that runs the risk of + * violating the extents/btree format transition rules. + */ + } else if (bmdr_minsz + dfork_min > lit_sz) { + /* + * If we can't move the attr fork, too bad, we lose the + * attr fork and leak its blocks. + */ + xrep_dinode_zap_afork(ri, dip, mode); + } else { + /* + * Otherwise, just slide the attr fork down. The attr + * fork is empty, so we don't have any old contents to + * move here. + */ + dip->di_forkoff = (lit_sz - bmdr_minsz) >> 3; + } + } +} + +/* + * Zap the data/attr forks if we spot anything that isn't going to pass the + * ifork verifiers or the ifork formatters, because we need to get the inode + * into good enough shape that the higher level repair functions can run. + */ +STATIC void +xrep_dinode_zap_forks( + struct xrep_inode *ri, + struct xfs_dinode *dip) +{ + struct xfs_scrub *sc = ri->sc; + xfs_extnum_t data_extents; + xfs_extnum_t attr_extents; + xfs_filblks_t nblocks; + uint16_t mode; + bool zap_datafork = false; + bool zap_attrfork = false; + + trace_xrep_dinode_zap_forks(sc, dip); + + mode = be16_to_cpu(dip->di_mode); + + data_extents = xfs_dfork_data_extents(dip); + attr_extents = xfs_dfork_attr_extents(dip); + nblocks = be64_to_cpu(dip->di_nblocks); + + /* Inode counters don't make sense? */ + if (data_extents > nblocks) + zap_datafork = true; + if (attr_extents > nblocks) + zap_attrfork = true; + if (data_extents + attr_extents > nblocks) + zap_datafork = zap_attrfork = true; + + if (!zap_datafork) + zap_datafork = xrep_dinode_check_dfork(sc, dip, mode); + if (!zap_attrfork) + zap_attrfork = xrep_dinode_check_afork(sc, dip); + + /* Zap whatever's bad. */ + if (zap_attrfork) + xrep_dinode_zap_afork(ri, dip, mode); + if (zap_datafork) + xrep_dinode_zap_dfork(ri, dip, mode); + xrep_dinode_ensure_forkoff(ri, dip, mode); + dip->di_nblocks = 0; + if (!zap_attrfork) + be64_add_cpu(&dip->di_nblocks, ri->attr_blocks); + if (!zap_datafork) { + be64_add_cpu(&dip->di_nblocks, ri->data_blocks); + be64_add_cpu(&dip->di_nblocks, ri->rt_blocks); + } +} + /* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */ STATIC int xrep_dinode_core( @@ -385,6 +1071,11 @@ xrep_dinode_core( xfs_ino_t ino = sc->sm->sm_ino; int error; + /* Figure out what this inode had mapped in both forks. */ + error = xrep_dinode_count_rmaps(ri); + if (error) + return error; + /* Read the inode cluster buffer. */ error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp, ri->imap.im_blkno, ri->imap.im_len, XBF_UNMAPPED, &bp, @@ -400,9 +1091,10 @@ xrep_dinode_core( dip = xfs_buf_offset(bp, ri->imap.im_boffset); xrep_dinode_header(sc, dip); xrep_dinode_mode(sc, dip); - xrep_dinode_flags(sc, dip); + xrep_dinode_flags(sc, dip, ri->rt_extents > 0); xrep_dinode_size(sc, dip); xrep_dinode_extsize_hints(sc, dip); + xrep_dinode_zap_forks(ri, dip); /* Write out the inode. */ trace_xrep_dinode_fixed(sc, dip); diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 4ab1e6c3e36bc..75f0d57088b29 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1469,6 +1469,10 @@ DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_extsize_hints); DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_symlink); DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_dir); DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_fixed); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_forks); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_dfork); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_afork); +DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_ensure_forkoff); DECLARE_EVENT_CLASS(xrep_inode_class, TP_PROTO(struct xfs_scrub *sc), @@ -1522,6 +1526,44 @@ DEFINE_REPAIR_INODE_EVENT(xrep_inode_sfdir_size); DEFINE_REPAIR_INODE_EVENT(xrep_inode_size); DEFINE_REPAIR_INODE_EVENT(xrep_inode_fixed); +TRACE_EVENT(xrep_dinode_count_rmaps, + TP_PROTO(struct xfs_scrub *sc, xfs_rfsblock_t data_blocks, + xfs_rfsblock_t rt_blocks, xfs_rfsblock_t attr_blocks, + xfs_extnum_t data_extents, xfs_extnum_t rt_extents, + xfs_aextnum_t attr_extents), + TP_ARGS(sc, data_blocks, rt_blocks, attr_blocks, data_extents, + rt_extents, attr_extents), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_rfsblock_t, data_blocks) + __field(xfs_rfsblock_t, rt_blocks) + __field(xfs_rfsblock_t, attr_blocks) + __field(xfs_extnum_t, data_extents) + __field(xfs_extnum_t, rt_extents) + __field(xfs_aextnum_t, attr_extents) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->ino = sc->sm->sm_ino; + __entry->data_blocks = data_blocks; + __entry->rt_blocks = rt_blocks; + __entry->attr_blocks = attr_blocks; + __entry->data_extents = data_extents; + __entry->rt_extents = rt_extents; + __entry->attr_extents = attr_extents; + ), + TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %llu rtextents %llu aextents %u", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->data_blocks, + __entry->rt_blocks, + __entry->attr_blocks, + __entry->data_extents, + __entry->rt_extents, + __entry->attr_extents) +); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 4/7] xfs: zap broken inode forks 2023-11-24 23:52 ` [PATCH 4/7] xfs: zap broken inode forks Darrick J. Wong @ 2023-11-30 4:44 ` Christoph Hellwig 2023-11-30 21:08 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 4:44 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs > +/* Verify the consistency of an inline attribute fork. */ > +xfs_failaddr_t > +xfs_attr_shortform_verify( > + struct xfs_inode *ip) > +{ > + struct xfs_attr_shortform *sfp; > + struct xfs_ifork *ifp; > + int64_t size; > + > + ASSERT(ip->i_af.if_format == XFS_DINODE_FMT_LOCAL); > + ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK); > + sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data; > + size = ifp->if_bytes; > + > + return xfs_attr_shortform_verify_struct(sfp, size); Given that xfs_attr_shortform_verify only has a single caller in the kernel and no extra n in xfsprogs I'd just change the calling convention to pass the xfs_attr_shortform structure and size there and not bother with the wrapper. > +/* Check that an inode's extent does not have invalid flags or bad ranges. */ > +xfs_failaddr_t > +xfs_bmap_validate_extent( > + struct xfs_inode *ip, > + int whichfork, > + struct xfs_bmbt_irec *irec) > +{ > + return xfs_bmap_validate_extent_raw(ip->i_mount, > + XFS_IS_REALTIME_INODE(ip), whichfork, irec); > +} .. while this one has a bunch of caller so I expect it's actually somewhat useful. > +extern xfs_failaddr_t xfs_dir2_sf_verify_struct(struct xfs_mount *mp, > + struct xfs_dir2_sf_hdr *sfp, int64_t size); It would be nice if we didn't add more pointless externs to function declarations in heders. > +xfs_failaddr_t > +xfs_dir2_sf_verify( > + struct xfs_inode *ip) > +{ > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); > + struct xfs_dir2_sf_hdr *sfp; > + > + ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); > + > + sfp = (struct xfs_dir2_sf_hdr *)ifp->if_u1.if_data; > + return xfs_dir2_sf_verify_struct(mp, sfp, ifp->if_bytes); > +} This one also only has a single caller in the kernel and user space combined, so I wou;dn't bother with the wrapper. > +xfs_failaddr_t > +xfs_symlink_shortform_verify( > + struct xfs_inode *ip) > +{ > + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); > + > + ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); > + > + return xfs_symlink_sf_verify_struct(ifp->if_u1.if_data, ifp->if_bytes); > +} Same here. Once past thes nitpicks the zapping functionality looks fine to me, but leaves me with a very high level question: As far as I can tell the inodes with the zapped fork(s) remains in it's normal place, and normaly accessible, and I think any read will return zeroes because i_size isn't reset. Which would change the data seen by an application using it. Don't we need to block access to it until it is fully repaired? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/7] xfs: zap broken inode forks 2023-11-30 4:44 ` Christoph Hellwig @ 2023-11-30 21:08 ` Darrick J. Wong 2023-12-04 4:39 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-30 21:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Wed, Nov 29, 2023 at 08:44:59PM -0800, Christoph Hellwig wrote: > > +/* Verify the consistency of an inline attribute fork. */ > > +xfs_failaddr_t > > +xfs_attr_shortform_verify( > > + struct xfs_inode *ip) > > +{ > > + struct xfs_attr_shortform *sfp; > > + struct xfs_ifork *ifp; > > + int64_t size; > > + > > + ASSERT(ip->i_af.if_format == XFS_DINODE_FMT_LOCAL); > > + ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK); > > + sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data; > > + size = ifp->if_bytes; > > + > > + return xfs_attr_shortform_verify_struct(sfp, size); > > Given that xfs_attr_shortform_verify only has a single caller in the > kernel and no extra n in xfsprogs I'd just change the calling > convention to pass the xfs_attr_shortform structure and size there > and not bother with the wrapper. Ok. > > +/* Check that an inode's extent does not have invalid flags or bad ranges. */ > > +xfs_failaddr_t > > +xfs_bmap_validate_extent( > > + struct xfs_inode *ip, > > + int whichfork, > > + struct xfs_bmbt_irec *irec) > > +{ > > + return xfs_bmap_validate_extent_raw(ip->i_mount, > > + XFS_IS_REALTIME_INODE(ip), whichfork, irec); > > +} > > .. while this one has a bunch of caller so I expect it's actually > somewhat useful. Yep. :) > > +extern xfs_failaddr_t xfs_dir2_sf_verify_struct(struct xfs_mount *mp, > > + struct xfs_dir2_sf_hdr *sfp, int64_t size); > > It would be nice if we didn't add more pointless externs to function > declarations in heders. I'll get rid of the extern. > > +xfs_failaddr_t > > +xfs_dir2_sf_verify( > > + struct xfs_inode *ip) > > +{ > > + struct xfs_mount *mp = ip->i_mount; > > + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); > > + struct xfs_dir2_sf_hdr *sfp; > > + > > + ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); > > + > > + sfp = (struct xfs_dir2_sf_hdr *)ifp->if_u1.if_data; > > + return xfs_dir2_sf_verify_struct(mp, sfp, ifp->if_bytes); > > +} > > This one also only has a single caller in the kernel and user space > combined, so I wou;dn't bother with the wrapper. <nod> > > +xfs_failaddr_t > > +xfs_symlink_shortform_verify( > > + struct xfs_inode *ip) > > +{ > > + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); > > + > > + ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); > > + > > + return xfs_symlink_sf_verify_struct(ifp->if_u1.if_data, ifp->if_bytes); > > +} > > Same here. Fixed. > Once past thes nitpicks the zapping functionality looks fine to me, > but leaves me with a very high level question: > > As far as I can tell the inodes with the zapped fork(s) remains in it's > normal place, and normaly accessible, and I think any read will return > zeroes because i_size isn't reset. Which would change the data seen > by an application using it. Don't we need to block access to it until > it is fully repaired? Ideally, yes, we ought to do that. It's tricky to do this, however, because i_rwsem doesn't exist until iget succeeds, and we're doing surgery on the dinode buffer to get it into good enough shape that iget will work. Unfortunately for me, the usual locking order is i_rwsem -> tx freeze protection -> ILOCK. Lockdep will not be happy if I try to grab i_rwsem from withina transaction. Hence the current repair code commits the dinode cleaning function before it tries to iget the inode. But. trylock exists. Looking at that code again, the inode scrubber sets us up with the AGI buffer if it can't iget the inode. Repairs to the dinode core acquires the inode cluster buffer, which means that nobody else can be calling iget. So I think we can grab the inode in the same transaction as the inode core repairs. Nobody else should even be able to see that inode, so it should be safe to grab i_rwsem before committing the transaction. Even if I have to use trylock in a loop to make lockdep happy. I'll try that out and get back to you. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/7] xfs: zap broken inode forks 2023-11-30 21:08 ` Darrick J. Wong @ 2023-12-04 4:39 ` Christoph Hellwig 2023-12-04 20:43 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-12-04 4:39 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Thu, Nov 30, 2023 at 01:08:58PM -0800, Darrick J. Wong wrote: > So I think we can grab the inode in the same transaction as the inode > core repairs. Nobody else should even be able to see that inode, so it > should be safe to grab i_rwsem before committing the transaction. Even > if I have to use trylock in a loop to make lockdep happy. Hmm, I though more of an inode flag that makes access to the inode outside of the scrubbe return -EIO. I can also warm up to the idea of having all inodes that are broken in some way in lost+found.. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/7] xfs: zap broken inode forks 2023-12-04 4:39 ` Christoph Hellwig @ 2023-12-04 20:43 ` Darrick J. Wong 2023-12-05 4:28 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-12-04 20:43 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Sun, Dec 03, 2023 at 08:39:57PM -0800, Christoph Hellwig wrote: > On Thu, Nov 30, 2023 at 01:08:58PM -0800, Darrick J. Wong wrote: > > So I think we can grab the inode in the same transaction as the inode > > core repairs. Nobody else should even be able to see that inode, so it > > should be safe to grab i_rwsem before committing the transaction. Even > > if I have to use trylock in a loop to make lockdep happy. > > Hmm, I though more of an inode flag that makes access to the inode > outside of the scrubbe return -EIO. I can also warm up to the idea of > having all inodes that are broken in some way in lost+found.. Moving things around in the directory tree might be worse, since we'd now have to read the parent pointer(s) from the file to remove those directory connections and add the new ones to lost+found. I /think/ scouring around in a zapped data fork for a directory access will return EFSCORRUPTED anyway, though that might occur at a late enough stage in the process that the fs goes down, which isn't desirable. However, once xrep_inode massages the ondisk inode into good enough shape that iget starts working again, I could set XFS_SICK_INO_BMBTD (and XFS_SICK_INO_DIR as appropriate) after zapping the data fork so that the directory accesses would return EFSCORRUPTED instead of scouring around in the zapped fork. Once we start persisting the sick flags, the prevention will last until scrub or someone came along to fix the inode, instead of being a purely incore flag. But, babysteps for now. I'll fix this patch to set the XFS_SICK_INO_* flags after zapping things, and the predicates to pick them up. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 4/7] xfs: zap broken inode forks 2023-12-04 20:43 ` Darrick J. Wong @ 2023-12-05 4:28 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-12-05 4:28 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Mon, Dec 04, 2023 at 12:43:51PM -0800, Darrick J. Wong wrote: > Moving things around in the directory tree might be worse, since we'd > now have to read the parent pointer(s) from the file to remove those > directory connections and add the new ones to lost+found. True. > I /think/ scouring around in a zapped data fork for a directory access > will return EFSCORRUPTED anyway, though that might occur at a late > enough stage in the process that the fs goes down, which isn't > desirable. > > However, once xrep_inode massages the ondisk inode into good enough > shape that iget starts working again, I could set XFS_SICK_INO_BMBTD (and > XFS_SICK_INO_DIR as appropriate) after zapping the data fork so that the > directory accesses would return EFSCORRUPTED instead of scouring around > in the zapped fork. > > Once we start persisting the sick flags, the prevention will last until > scrub or someone came along to fix the inode, instead of being a purely > incore flag. But, babysteps for now. I'll fix this patch to set the > XFS_SICK_INO_* flags after zapping things, and the predicates to pick > them up. Sounds good. ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong ` (3 preceding siblings ...) 2023-11-24 23:52 ` [PATCH 4/7] xfs: zap broken inode forks Darrick J. Wong @ 2023-11-24 23:52 ` Darrick J. Wong 2023-11-30 4:47 ` Christoph Hellwig 2023-11-24 23:52 ` [PATCH 6/7] xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped Darrick J. Wong 2023-11-24 23:52 ` [PATCH 7/7] xfs: repair obviously broken inode modes Darrick J. Wong 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:52 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> In the previous patch, we added some code to perform sufficient repairs to an ondisk inode record such that the inode cache would be willing to load the inode. If the broken inode was a shortform directory, it will reset the directory to something plausible, which is to say an empty subdirectory of the root. The telltale signs that something is seriously wrong is the broken link count. Such directories look clean, but they shouldn't participate in a filesystem scan to find or confirm a directory parent pointer. Create a predicate that identifies such directories and abort the scrub. Found by fuzzing xfs/1554 with multithreaded xfs_scrub enabled and u3.bmx[0].startblock = zeroes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/common.c | 1 + fs/xfs/scrub/common.h | 2 ++ fs/xfs/scrub/dir.c | 21 +++++++++++++++++++++ fs/xfs/scrub/parent.c | 17 +++++++++++++++++ 4 files changed, 41 insertions(+) diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index 9b7d7010495b9..67ed4c55a27e3 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -26,6 +26,7 @@ #include "xfs_trans_priv.h" #include "xfs_da_format.h" #include "xfs_da_btree.h" +#include "xfs_dir2_priv.h" #include "xfs_attr.h" #include "xfs_reflink.h" #include "xfs_ag.h" diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h index 895918565df26..506b808b9fbb3 100644 --- a/fs/xfs/scrub/common.h +++ b/fs/xfs/scrub/common.h @@ -192,6 +192,8 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm) XFS_SCRUB_OFLAG_XCORRUPT); } +bool xchk_dir_looks_zapped(struct xfs_inode *dp); + #ifdef CONFIG_XFS_ONLINE_REPAIR /* Decide if a repair is required. */ static inline bool xchk_needs_repair(const struct xfs_scrub_metadata *sm) diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index 0b491784b7594..acae43d20f387 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -788,3 +788,24 @@ xchk_directory( error = 0; return error; } + +/* + * Decide if this directory has been zapped to satisfy the inode and ifork + * verifiers. Checking and repairing should be postponed until the directory + * is fixed. + */ +bool +xchk_dir_looks_zapped( + struct xfs_inode *dp) +{ + /* + * If the dinode repair found a bad data fork, it will reset the fork + * to extents format with zero records and wait for the bmapbtd + * scrubber to reconstruct the block mappings. Directories always + * contain some content, so this is a clear sign of a zapped directory. + */ + if (dp->i_df.if_format == XFS_DINODE_FMT_EXTENTS) + return dp->i_df.if_nextents == 0; + + return false; +} diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c index e6155d86f7916..7db8736721461 100644 --- a/fs/xfs/scrub/parent.c +++ b/fs/xfs/scrub/parent.c @@ -156,6 +156,16 @@ xchk_parent_validate( goto out_rele; } + /* + * We cannot yet validate this parent pointer if the directory looks as + * though it has been zapped by the inode record repair code. + */ + if (xchk_dir_looks_zapped(dp)) { + error = -EBUSY; + xchk_set_incomplete(sc); + goto out_unlock; + } + /* Look for a directory entry in the parent pointing to the child. */ error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc); if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error)) @@ -217,6 +227,13 @@ xchk_parent( */ error = xchk_parent_validate(sc, parent_ino); } while (error == -EAGAIN); + if (error == -EBUSY) { + /* + * We could not scan a directory, so we marked the check + * incomplete. No further error return is necessary. + */ + return 0; + } return error; } ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory 2023-11-24 23:52 ` [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong @ 2023-11-30 4:47 ` Christoph Hellwig 2023-11-30 21:37 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 4:47 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:52:23PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > In the previous patch, we added some code to perform sufficient repairs > to an ondisk inode record such that the inode cache would be willing to > load the inode. This is now a few commits back. My adjust this to be less specific. > If the broken inode was a shortform directory, it will > reset the directory to something plausible, which is to say an empty > subdirectory of the root. The telltale signs that something is > seriously wrong is the broken link count. > > Such directories look clean, but they shouldn't participate in a > filesystem scan to find or confirm a directory parent pointer. Create a > predicate that identifies such directories and abort the scrub. > > Found by fuzzing xfs/1554 with multithreaded xfs_scrub enabled and > u3.bmx[0].startblock = zeroes. This kind of ties into my comment on the previous comment, but needing heuristics to find zapped inodes or inode forks just seems to be asking for trouble. I suspect we'll need proper on-disk flags to notice the corrupted / half-rebuilt state. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory 2023-11-30 4:47 ` Christoph Hellwig @ 2023-11-30 21:37 ` Darrick J. Wong 2023-12-04 4:41 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-30 21:37 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Wed, Nov 29, 2023 at 08:47:34PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:52:23PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > In the previous patch, we added some code to perform sufficient repairs > > to an ondisk inode record such that the inode cache would be willing to > > load the inode. > > This is now a few commits back. My adjust this to be less specific. > > > If the broken inode was a shortform directory, it will > > reset the directory to something plausible, which is to say an empty > > subdirectory of the root. The telltale signs that something is > > seriously wrong is the broken link count. > > > > Such directories look clean, but they shouldn't participate in a > > filesystem scan to find or confirm a directory parent pointer. Create a > > predicate that identifies such directories and abort the scrub. > > > > Found by fuzzing xfs/1554 with multithreaded xfs_scrub enabled and > > u3.bmx[0].startblock = zeroes. > > This kind of ties into my comment on the previous comment, but needing > heuristics to find zapped inodes or inode forks just seems to be asking > for trouble. I suspect we'll need proper on-disk flags to notice the > corrupted / half-rebuilt state. Hmm. A single "zapped" bit would be a good way to signal to xchk_dir_looks_zapped and xchk_bmap_want_check_rmaps that a file is probably broken. Clearing that bit would be harder though -- userspace would have to call back into the kernel after checking all the metadata. A simpler way might be to persist the entire per-inode sick state (both forks and the contents within, for three bits). That would be more to track, but each scrubber could clear its corresponding sick-state bit. A bit further on in this series is a big patchset to set the sick state every time the hot paths encounter an EFSCORRUPTED. IO operations could check the sick state bit and fail out to userspace, which would solve the problem of keeping programs away from a partially fixed file. The ondisk state tracking like an entire project on its own. Thoughts? --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory 2023-11-30 21:37 ` Darrick J. Wong @ 2023-12-04 4:41 ` Christoph Hellwig 2023-12-04 20:44 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-12-04 4:41 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Thu, Nov 30, 2023 at 01:37:09PM -0800, Darrick J. Wong wrote: > Hmm. A single "zapped" bit would be a good way to signal to > xchk_dir_looks_zapped and xchk_bmap_want_check_rmaps that a file is > probably broken. Clearing that bit would be harder though -- userspace > would have to call back into the kernel after checking all the metadata. Doesn't sound too horrible to have a special scrub call just for that. > A simpler way might be to persist the entire per-inode sick state (both > forks and the contents within, for three bits). That would be more to > track, but each scrubber could clear its corresponding sick-state bit. > A bit further on in this series is a big patchset to set the sick state > every time the hot paths encounter an EFSCORRUPTED. That does sound even better. > IO operations could check the sick state bit and fail out to userspace, > which would solve the problem of keeping programs away from a partially > fixed file. > > The ondisk state tracking like an entire project on its own. Thoughts? Incore for now sounds fine to me. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory 2023-12-04 4:41 ` Christoph Hellwig @ 2023-12-04 20:44 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-12-04 20:44 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Sun, Dec 03, 2023 at 08:41:09PM -0800, Christoph Hellwig wrote: > On Thu, Nov 30, 2023 at 01:37:09PM -0800, Darrick J. Wong wrote: > > Hmm. A single "zapped" bit would be a good way to signal to > > xchk_dir_looks_zapped and xchk_bmap_want_check_rmaps that a file is > > probably broken. Clearing that bit would be harder though -- userspace > > would have to call back into the kernel after checking all the metadata. > > Doesn't sound too horrible to have a special scrub call just for that. > > > A simpler way might be to persist the entire per-inode sick state (both > > forks and the contents within, for three bits). That would be more to > > track, but each scrubber could clear its corresponding sick-state bit. > > A bit further on in this series is a big patchset to set the sick state > > every time the hot paths encounter an EFSCORRUPTED. > > That does sound even better. > > > IO operations could check the sick state bit and fail out to userspace, > > which would solve the problem of keeping programs away from a partially > > fixed file. > > > > The ondisk state tracking like an entire project on its own. Thoughts? > > Incore for now sounds fine to me. Excellent! I'll go work on that for v28.2 or v29 or whatever the next version number is. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 6/7] xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong ` (4 preceding siblings ...) 2023-11-24 23:52 ` [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong @ 2023-11-24 23:52 ` Darrick J. Wong 2023-11-24 23:52 ` [PATCH 7/7] xfs: repair obviously broken inode modes Darrick J. Wong 6 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:52 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> The attribute fork scrubber can optionally scan the reverse mapping records of the filesystem to determine if the fork is missing mappings that it should have. However, this is a very expensive operation, so we only want to do this if we suspect that the fork is missing records. For attribute forks the criteria for suspicion is that the attr fork is in EXTENTS format and has zero extents. However, there are several ways that a file can end up in this state through regular filesystem usage. For example, an LSM can set a s_security hook but then decide not to set an ACL; or an attr set can create the attr fork but then the actual set operation fails with ENOSPC; or we can delete all the attrs on a file whose data fork is in btree format, in which case we do not delete the attr fork. We don't want to run the expensive check for any case that can be arrived at through regular operations. However. When online inode repair decides to zap an attribute fork, it cannot determine if it is zapping ACL information. As a precaution it removes all the discretionary access control permissions and sets the user and group ids to zero. Check these three additional conditions to decide if we want to scan the rmap records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/bmap.c | 44 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index f74bd2a97c7f7..c12ccc9141163 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -662,16 +662,46 @@ xchk_bmap_want_check_rmaps( * The inode repair code zaps broken inode forks by resetting them back * to EXTENTS format and zero extent records. If we encounter a fork * in this state along with evidence that the fork isn't supposed to be - * empty, we need to scan the reverse mappings to decide if we're going - * to rebuild the fork. Data forks with nonzero file size are scanned. - * xattr forks are never empty of content, so they are always scanned. + * empty, we might want scan the reverse mappings to decide if we're + * going to rebuild the fork. */ ifp = xfs_ifork_ptr(sc->ip, info->whichfork); if (ifp->if_format == XFS_DINODE_FMT_EXTENTS && ifp->if_nextents == 0) { - if (info->whichfork == XFS_DATA_FORK && - i_size_read(VFS_I(sc->ip)) == 0) - return false; - + switch (info->whichfork) { + case XFS_DATA_FORK: + /* + * Data forks with zero file size are presumed not to + * have any written data blocks. Skip the scan. + */ + if (i_size_read(VFS_I(sc->ip)) == 0) + return false; + break; + case XFS_ATTR_FORK: + /* + * Files can have an attr fork in EXTENTS format with + * zero records for several reasons: + * + * a) an attr set created a fork but ran out of space + * b) attr replace deleted an old attr but failed + * during the set step + * c) the data fork was in btree format when all attrs + * were deleted, so the fork was left in place + * d) the inode repair code zapped the fork + * + * Only in case (d) do we want to scan the rmapbt to + * see if we need to rebuild the attr fork. The fork + * zap code clears all DAC permission bits and zeroes + * the uid and gid, so avoid the scan if any of those + * three conditions are not met. + */ + if ((VFS_I(sc->ip)->i_mode & 0777) != 0) + return false; + if (!uid_eq(VFS_I(sc->ip)->i_uid, GLOBAL_ROOT_UID)) + return false; + if (!gid_eq(VFS_I(sc->ip)->i_gid, GLOBAL_ROOT_GID)) + return false; + break; + } return true; } ^ permalink raw reply related [flat|nested] 156+ messages in thread
* [PATCH 7/7] xfs: repair obviously broken inode modes 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong ` (5 preceding siblings ...) 2023-11-24 23:52 ` [PATCH 6/7] xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped Darrick J. Wong @ 2023-11-24 23:52 ` Darrick J. Wong 2023-11-30 4:49 ` Christoph Hellwig 6 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:52 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Building off the rmap scanner that we added in the previous patch, we can now find block 0 and try to use the information contained inside of it to guess the mode of an inode if it's totally improper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/inode_repair.c | 181 +++++++++++++++++++++++++++++++++++++++++-- fs/xfs/scrub/trace.h | 11 ++- 2 files changed, 179 insertions(+), 13 deletions(-) diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c index a73205702ffa5..96114ed889707 100644 --- a/fs/xfs/scrub/inode_repair.c +++ b/fs/xfs/scrub/inode_repair.c @@ -56,10 +56,13 @@ * fix things on live incore inodes. The inode repair functions make decisions * with security and usability implications when reviving a file: * - * - Files with zero di_mode or a garbage di_mode are converted to regular file - * that only root can read. This file may not actually contain user data, - * if the file was not previously a regular file. Setuid and setgid bits - * are cleared. + * - Files with zero di_mode or a garbage di_mode are converted to a file + * that only root can read. If the immediate data fork area or block 0 of + * the data fork look like a directory, the file type will be set to a + * directory. If the immediate data fork area has no nulls, it will be + * turned into a symbolic link. Otherwise, it is turned into a regular file. + * This file may not actually contain user data, if the file was not + * previously a regular file. Setuid and setgid bits are cleared. * * - Zero-size directories can be truncated to look empty. It is necessary to * run the bmapbtd and directory repair functions to fully rebuild the @@ -107,6 +110,9 @@ struct xrep_inode { /* Blocks in use by the attr fork. */ xfs_rfsblock_t attr_blocks; + /* Physical block containing data block 0. */ + xfs_fsblock_t block0; + /* Number of data device extents for the data fork. */ xfs_extnum_t data_extents; @@ -146,6 +152,7 @@ xrep_setup_inode( ri = sc->buf; memcpy(&ri->imap, imap, sizeof(struct xfs_imap)); ri->sc = sc; + ri->block0 = NULLFSBLOCK; return 0; } @@ -221,12 +228,159 @@ xrep_dinode_header( dip->di_gen = cpu_to_be32(sc->sm->sm_gen); } +/* Parse enough of the directory block header to guess if this is a dir. */ +static inline bool +xrep_dinode_is_dir( + xfs_ino_t ino, + xfs_daddr_t daddr, + struct xfs_buf *bp) +{ + struct xfs_dir3_blk_hdr *hdr3 = bp->b_addr; + struct xfs_dir2_data_free *bf; + struct xfs_mount *mp = bp->b_mount; + xfs_lsn_t lsn = be64_to_cpu(hdr3->lsn); + + /* Does the dir3 header match the filesystem? */ + if (hdr3->magic != cpu_to_be32(XFS_DIR3_BLOCK_MAGIC) && + hdr3->magic != cpu_to_be32(XFS_DIR3_DATA_MAGIC)) + return false; + + if (be64_to_cpu(hdr3->owner) != ino) + return false; + + if (!uuid_equal(&hdr3->uuid, &mp->m_sb.sb_meta_uuid)) + return false; + + if (be64_to_cpu(hdr3->blkno) != daddr) + return false; + + /* Directory blocks are always logged and must have a valid LSN. */ + if (lsn == NULLCOMMITLSN) + return false; + if (!xlog_valid_lsn(mp->m_log, lsn)) + return false; + + /* + * bestfree information lives immediately after the end of the header, + * so we won't run off the end of the buffer. + */ + bf = xfs_dir2_data_bestfree_p(mp, bp->b_addr); + if (!bf[0].length && bf[0].offset) + return false; + if (!bf[1].length && bf[1].offset) + return false; + if (!bf[2].length && bf[2].offset) + return false; + + if (be16_to_cpu(bf[0].length) < be16_to_cpu(bf[1].length)) + return false; + if (be16_to_cpu(bf[1].length) < be16_to_cpu(bf[2].length)) + return false; + + return true; +} + +/* Guess the mode of this file from the contents. */ +STATIC uint16_t +xrep_dinode_guess_mode( + struct xrep_inode *ri, + struct xfs_dinode *dip) +{ + struct xfs_buf *bp; + struct xfs_mount *mp = ri->sc->mp; + xfs_daddr_t daddr; + uint64_t fsize = be64_to_cpu(dip->di_size); + unsigned int dfork_sz = XFS_DFORK_DSIZE(dip, mp); + uint16_t mode = S_IFREG; + int error; + + switch (dip->di_format) { + case XFS_DINODE_FMT_LOCAL: + /* + * If the data fork is local format, the size of the data area + * is reasonable and is big enough to contain the entire file, + * we can guess the file type from the local data. + * + * If there are no nulls, guess this is a symbolic link. + * Otherwise, this is probably a shortform directory. + */ + if (dfork_sz <= XFS_LITINO(mp) && dfork_sz >= fsize) { + if (!memchr(XFS_DFORK_DPTR(dip), 0, fsize)) + return S_IFLNK; + return S_IFDIR; + } + + /* By default, we guess regular file. */ + return S_IFREG; + case XFS_DINODE_FMT_DEV: + /* + * If the data fork is dev format, the size of the data area is + * reasonable and large enough to store a dev_t, and the file + * size is zero, this could be a blockdev, a chardev, a fifo, + * or a socket. There is no solid way to distinguish between + * those choices, so we guess blockdev if the device number is + * nonzero and chardev if it's zero (aka whiteout). + */ + if (dfork_sz <= XFS_LITINO(mp) && + dfork_sz >= sizeof(__be32) && fsize == 0) { + xfs_dev_t dev = xfs_dinode_get_rdev(dip); + + return dev != 0 ? S_IFBLK : S_IFCHR; + } + + /* By default, we guess regular file. */ + return S_IFREG; + case XFS_DINODE_FMT_EXTENTS: + case XFS_DINODE_FMT_BTREE: + /* There are data blocks to examine below. */ + break; + default: + /* Everything else is considered a regular file. */ + return S_IFREG; + } + + /* There are no zero-length directories. */ + if (fsize == 0) + return S_IFREG; + + /* + * If we didn't find a written mapping for file block zero, we'll guess + * that it's a sparse regular file. + */ + if (ri->block0 == NULLFSBLOCK) + return S_IFREG; + + /* Directories can't have rt extents. */ + if (ri->rt_extents > 0) + return S_IFREG; + + /* + * Read the first block of the file. Since we have no idea what kind + * of file geometry (e.g. dirblock size) we might be reading into, use + * an uncached buffer so that we don't pollute the buffer cache. We + * can't do uncached mapped buffers, so the best we can do is guess + * from the directory header. + */ + daddr = XFS_FSB_TO_DADDR(mp, ri->block0); + error = xfs_buf_read_uncached(mp->m_ddev_targp, daddr, + XFS_FSS_TO_BB(mp, 1), 0, &bp, NULL); + if (error) + return S_IFREG; + + if (xrep_dinode_is_dir(ri->sc->sm->sm_ino, daddr, bp)) + mode = S_IFDIR; + + xfs_buf_relse(bp); + return mode; +} + /* Turn di_mode into /something/ recognizable. */ STATIC void xrep_dinode_mode( - struct xfs_scrub *sc, + struct xrep_inode *ri, struct xfs_dinode *dip) { + struct xfs_scrub *sc = ri->sc; uint16_t mode; trace_xrep_dinode_mode(sc, dip); @@ -236,7 +390,7 @@ xrep_dinode_mode( return; /* bad mode, so we set it to a file that only root can read */ - mode = S_IFREG; + mode = xrep_dinode_guess_mode(ri, dip); dip->di_mode = cpu_to_be16(mode); dip->di_uid = 0; dip->di_gid = 0; @@ -443,9 +597,17 @@ xrep_dinode_walk_rmap( } ri->data_blocks += rec->rm_blockcount; - if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK)) + if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK)) { ri->data_extents++; + if (rec->rm_offset == 0 && + !(rec->rm_flags & XFS_RMAP_UNWRITTEN)) { + if (ri->block0 != NULLFSBLOCK) + return -EFSCORRUPTED; + ri->block0 = rec->rm_startblock; + } + } + return 0; } @@ -496,7 +658,8 @@ xrep_dinode_count_rmaps( trace_xrep_dinode_count_rmaps(ri->sc, ri->data_blocks, ri->rt_blocks, ri->attr_blocks, - ri->data_extents, ri->rt_extents, ri->attr_extents); + ri->data_extents, ri->rt_extents, ri->attr_extents, + ri->block0); return 0; } @@ -1090,7 +1253,7 @@ xrep_dinode_core( /* Fix everything the verifier will complain about. */ dip = xfs_buf_offset(bp, ri->imap.im_boffset); xrep_dinode_header(sc, dip); - xrep_dinode_mode(sc, dip); + xrep_dinode_mode(ri, dip); xrep_dinode_flags(sc, dip, ri->rt_extents > 0); xrep_dinode_size(sc, dip); xrep_dinode_extsize_hints(sc, dip); diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 75f0d57088b29..6cd5d04c0410c 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1530,9 +1530,9 @@ TRACE_EVENT(xrep_dinode_count_rmaps, TP_PROTO(struct xfs_scrub *sc, xfs_rfsblock_t data_blocks, xfs_rfsblock_t rt_blocks, xfs_rfsblock_t attr_blocks, xfs_extnum_t data_extents, xfs_extnum_t rt_extents, - xfs_aextnum_t attr_extents), + xfs_aextnum_t attr_extents, xfs_fsblock_t block0), TP_ARGS(sc, data_blocks, rt_blocks, attr_blocks, data_extents, - rt_extents, attr_extents), + rt_extents, attr_extents, block0), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_ino_t, ino) @@ -1542,6 +1542,7 @@ TRACE_EVENT(xrep_dinode_count_rmaps, __field(xfs_extnum_t, data_extents) __field(xfs_extnum_t, rt_extents) __field(xfs_aextnum_t, attr_extents) + __field(xfs_fsblock_t, block0) ), TP_fast_assign( __entry->dev = sc->mp->m_super->s_dev; @@ -1552,8 +1553,9 @@ TRACE_EVENT(xrep_dinode_count_rmaps, __entry->data_extents = data_extents; __entry->rt_extents = rt_extents; __entry->attr_extents = attr_extents; + __entry->block0 = block0; ), - TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %llu rtextents %llu aextents %u", + TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %llu rtextents %llu aextents %u startblock0 0x%llx", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, __entry->data_blocks, @@ -1561,7 +1563,8 @@ TRACE_EVENT(xrep_dinode_count_rmaps, __entry->attr_blocks, __entry->data_extents, __entry->rt_extents, - __entry->attr_extents) + __entry->attr_extents, + __entry->block0) ); #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 7/7] xfs: repair obviously broken inode modes 2023-11-24 23:52 ` [PATCH 7/7] xfs: repair obviously broken inode modes Darrick J. Wong @ 2023-11-30 4:49 ` Christoph Hellwig 2023-11-30 21:18 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 4:49 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:52:54PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Building off the rmap scanner that we added in the previous patch, we > can now find block 0 and try to use the information contained inside of > it to guess the mode of an inode if it's totally improper. Maybe I'm missing something important, but I don't see why a normal user couldn't construct a file that looks like an XFS directory, and that's a perfectly fine thing to do? ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 7/7] xfs: repair obviously broken inode modes 2023-11-30 4:49 ` Christoph Hellwig @ 2023-11-30 21:18 ` Darrick J. Wong 2023-12-04 4:42 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-30 21:18 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Wed, Nov 29, 2023 at 08:49:36PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:52:54PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Building off the rmap scanner that we added in the previous patch, we > > can now find block 0 and try to use the information contained inside of > > it to guess the mode of an inode if it's totally improper. > > Maybe I'm missing something important, but I don't see why a normal > user couldn't construct a file that looks like an XFS directory, and > that's a perfectly fine thing to do? They could very well do that, and it might confuse the scanner. However, I'd like to draw your attention to xrep_dinode_mode, which will set the user/group to root with 0000 access mode. That at least will keep unprivileged users from seeing the potentially weird file until the higher level repair functions can deal with it (or the sysadmin deletes it). Hmm. That code really ought to zap the attr fork because there could be ACLs attached to the file. Let me go do that. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 7/7] xfs: repair obviously broken inode modes 2023-11-30 21:18 ` Darrick J. Wong @ 2023-12-04 4:42 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-12-04 4:42 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Thu, Nov 30, 2023 at 01:18:56PM -0800, Darrick J. Wong wrote: > > Maybe I'm missing something important, but I don't see why a normal > > user couldn't construct a file that looks like an XFS directory, and > > that's a perfectly fine thing to do? > > They could very well do that, and it might confuse the scanner. > However, I'd like to draw your attention to xrep_dinode_mode, which will > set the user/group to root with 0000 access mode. That at least will > keep unprivileged users from seeing the potentially weird file until the > higher level repair functions can deal with it (or the sysadmin deletes > it). Having a perfectly valid (but weird) file cause repair action just seems like a really bad idea. ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong ` (4 preceding siblings ...) 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong @ 2023-11-24 23:46 ` Darrick J. Wong 2023-11-24 23:53 ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong ` (4 more replies) 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong 7 siblings, 5 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:46 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, In this series, online repair gains the ability to rebuild data and attr fork mappings from the reverse mapping information. It is at this point where we reintroduce the ability to reap file extents. Repair of CoW forks is a little different -- on disk, CoW staging extents are owned by the refcount btree and cannot be mapped back to individual files. Hence we can only detect staging extents that don't quite look right (missing reverse mappings, shared staging extents) and replace them with fresh allocations. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-file-mappings fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-file-mappings --- fs/xfs/Makefile | 2 fs/xfs/libxfs/xfs_bmap_btree.c | 119 ++++- fs/xfs/libxfs/xfs_bmap_btree.h | 5 fs/xfs/libxfs/xfs_btree_staging.c | 11 fs/xfs/libxfs/xfs_btree_staging.h | 2 fs/xfs/libxfs/xfs_iext_tree.c | 23 + fs/xfs/libxfs/xfs_inode_fork.c | 1 fs/xfs/libxfs/xfs_inode_fork.h | 3 fs/xfs/libxfs/xfs_refcount.c | 41 ++ fs/xfs/libxfs/xfs_refcount.h | 10 fs/xfs/scrub/bitmap.h | 56 ++ fs/xfs/scrub/bmap.c | 18 + fs/xfs/scrub/bmap_repair.c | 846 +++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.h | 6 fs/xfs/scrub/cow_repair.c | 612 +++++++++++++++++++++++++++ fs/xfs/scrub/reap.c | 152 ++++++- fs/xfs/scrub/reap.h | 2 fs/xfs/scrub/repair.c | 50 ++ fs/xfs/scrub/repair.h | 11 fs/xfs/scrub/scrub.c | 20 - fs/xfs/scrub/trace.h | 118 +++++ fs/xfs/xfs_trans.c | 95 ++++ fs/xfs/xfs_trans.h | 4 23 files changed, 2155 insertions(+), 52 deletions(-) create mode 100644 fs/xfs/scrub/bmap_repair.c create mode 100644 fs/xfs/scrub/cow_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong @ 2023-11-24 23:53 ` Darrick J. Wong 2023-11-30 4:53 ` Christoph Hellwig 2023-11-24 23:53 ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:53 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Reintroduce to xrep_reap_extents the ability to reap extents from any AG. We dropped this before because it was buggy, but in the next patch we will gain the ability to reap old bmap btrees, which can have blocks in any AG. To do this, we require that sc->sa is uninitialized, so that we can use it to hold all the per-AG context for a given extent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/bitmap.h | 28 +++++++++++ fs/xfs/scrub/reap.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++-- fs/xfs/scrub/reap.h | 2 + fs/xfs/scrub/repair.h | 1 4 files changed, 147 insertions(+), 4 deletions(-) diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h index 9cdc41b6cb02d..1356a76710ede 100644 --- a/fs/xfs/scrub/bitmap.h +++ b/fs/xfs/scrub/bitmap.h @@ -121,4 +121,32 @@ int xagb_bitmap_set_btblocks(struct xagb_bitmap *bitmap, int xagb_bitmap_set_btcur_path(struct xagb_bitmap *bitmap, struct xfs_btree_cur *cur); +/* Bitmaps, but for type-checked for xfs_fsblock_t */ + +struct xfsb_bitmap { + struct xbitmap fsbitmap; +}; + +static inline void xfsb_bitmap_init(struct xfsb_bitmap *bitmap) +{ + xbitmap_init(&bitmap->fsbitmap); +} + +static inline void xfsb_bitmap_destroy(struct xfsb_bitmap *bitmap) +{ + xbitmap_destroy(&bitmap->fsbitmap); +} + +static inline int xfsb_bitmap_set(struct xfsb_bitmap *bitmap, + xfs_fsblock_t start, xfs_filblks_t len) +{ + return xbitmap_set(&bitmap->fsbitmap, start, len); +} + +static inline int xfsb_bitmap_walk(struct xfsb_bitmap *bitmap, + xbitmap_walk_fn fn, void *priv) +{ + return xbitmap_walk(&bitmap->fsbitmap, fn, priv); +} + #endif /* __XFS_SCRUB_BITMAP_H__ */ diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c index c8c8e3f9bc7a4..35794df952bbe 100644 --- a/fs/xfs/scrub/reap.c +++ b/fs/xfs/scrub/reap.c @@ -74,10 +74,10 @@ * with only the same rmap owner but the block is not owned by something with * the same rmap owner, the block will be freed. * - * The caller is responsible for locking the AG headers for the entire rebuild - * operation so that nothing else can sneak in and change the AG state while - * we're not looking. We must also invalidate any buffers associated with - * @bitmap. + * The caller is responsible for locking the AG headers/inode for the entire + * rebuild operation so that nothing else can sneak in and change the incore + * state while we're not looking. We must also invalidate any buffers + * associated with @bitmap. */ /* Information about reaping extents after a repair. */ @@ -500,3 +500,115 @@ xrep_reap_agblocks( return 0; } + +/* + * Break a file metadata extent into sub-extents by fate (crosslinked, not + * crosslinked), and dispose of each sub-extent separately. The extent must + * not cross an AG boundary. + */ +STATIC int +xreap_fsmeta_extent( + uint64_t fsbno, + uint64_t len, + void *priv) +{ + struct xreap_state *rs = priv; + struct xfs_scrub *sc = rs->sc; + xfs_agnumber_t agno = XFS_FSB_TO_AGNO(sc->mp, fsbno); + xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno); + xfs_agblock_t agbno_next = agbno + len; + int error = 0; + + ASSERT(len <= XFS_MAX_BMBT_EXTLEN); + ASSERT(sc->ip != NULL); + ASSERT(!sc->sa.pag); + + /* + * We're reaping blocks after repairing file metadata, which means that + * we have to init the xchk_ag structure ourselves. + */ + sc->sa.pag = xfs_perag_get(sc->mp, agno); + if (!sc->sa.pag) + return -EFSCORRUPTED; + + error = xfs_alloc_read_agf(sc->sa.pag, sc->tp, 0, &sc->sa.agf_bp); + if (error) + goto out_pag; + + while (agbno < agbno_next) { + xfs_extlen_t aglen; + bool crosslinked; + + error = xreap_agextent_select(rs, agbno, agbno_next, + &crosslinked, &aglen); + if (error) + goto out_agf; + + error = xreap_agextent_iter(rs, agbno, &aglen, crosslinked); + if (error) + goto out_agf; + + if (xreap_want_defer_finish(rs)) { + /* + * Holds the AGF buffer across the deferred chain + * processing. + */ + error = xrep_defer_finish(sc); + if (error) + goto out_agf; + xreap_defer_finish_reset(rs); + } else if (xreap_want_roll(rs)) { + /* + * Hold the AGF buffer across the transaction roll so + * that we don't have to reattach it to the scrub + * context. + */ + xfs_trans_bhold(sc->tp, sc->sa.agf_bp); + error = xfs_trans_roll_inode(&sc->tp, sc->ip); + xfs_trans_bjoin(sc->tp, sc->sa.agf_bp); + if (error) + goto out_agf; + xreap_reset(rs); + } + + agbno += aglen; + } + +out_agf: + xfs_trans_brelse(sc->tp, sc->sa.agf_bp); + sc->sa.agf_bp = NULL; +out_pag: + xfs_perag_put(sc->sa.pag); + sc->sa.pag = NULL; + return error; +} + +/* + * Dispose of every block of every fs metadata extent in the bitmap. + * Do not use this to dispose of the mappings in an ondisk inode fork. + */ +int +xrep_reap_fsblocks( + struct xfs_scrub *sc, + struct xfsb_bitmap *bitmap, + const struct xfs_owner_info *oinfo) +{ + struct xreap_state rs = { + .sc = sc, + .oinfo = oinfo, + .resv = XFS_AG_RESV_NONE, + }; + int error; + + ASSERT(xfs_has_rmapbt(sc->mp)); + ASSERT(sc->ip != NULL); + + error = xfsb_bitmap_walk(bitmap, xreap_fsmeta_extent, &rs); + if (error) + return error; + + if (xreap_dirty(&rs)) + return xrep_defer_finish(sc); + + return 0; +} diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h index fe24626af1649..5e710be44b4b1 100644 --- a/fs/xfs/scrub/reap.h +++ b/fs/xfs/scrub/reap.h @@ -8,5 +8,7 @@ int xrep_reap_agblocks(struct xfs_scrub *sc, struct xagb_bitmap *bitmap, const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type); +int xrep_reap_fsblocks(struct xfs_scrub *sc, struct xfsb_bitmap *bitmap, + const struct xfs_owner_info *oinfo); #endif /* __XFS_SCRUB_REAP_H__ */ diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 70a6b18e5ad3c..46bf841524f8f 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -48,6 +48,7 @@ xrep_trans_commit( struct xbitmap; struct xagb_bitmap; +struct xfsb_bitmap; int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink); ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents 2023-11-24 23:53 ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong @ 2023-11-30 4:53 ` Christoph Hellwig 2023-11-30 21:48 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 4:53 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:53:09PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Reintroduce to xrep_reap_extents the ability to reap extents from any > AG. We dropped this before because it was buggy, but in the next patch > we will gain the ability to reap old bmap btrees, which can have blocks > in any AG. To do this, we require that sc->sa is uninitialized, so that > we can use it to hold all the per-AG context for a given extent. Can you expand a bit on why it was buggy, in what commit is was dropped and what we're doing better this time around? > > #endif /* __XFS_SCRUB_REAP_H__ */ > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h > index 70a6b18e5ad3c..46bf841524f8f 100644 > --- a/fs/xfs/scrub/repair.h > +++ b/fs/xfs/scrub/repair.h > @@ -48,6 +48,7 @@ xrep_trans_commit( > > struct xbitmap; > struct xagb_bitmap; > +struct xfsb_bitmap; Your might need the forward declaration in reap.h, but definitively not here :) Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents 2023-11-30 4:53 ` Christoph Hellwig @ 2023-11-30 21:48 ` Darrick J. Wong 2023-12-04 4:42 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-30 21:48 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Wed, Nov 29, 2023 at 08:53:19PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:53:09PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Reintroduce to xrep_reap_extents the ability to reap extents from any > > AG. We dropped this before because it was buggy, but in the next patch > > we will gain the ability to reap old bmap btrees, which can have blocks > > in any AG. To do this, we require that sc->sa is uninitialized, so that > > we can use it to hold all the per-AG context for a given extent. > > Can you expand a bit on why it was buggy, in what commit is was dropped > and what we're doing better this time around? Oh! We merged that one! Let me change the commit message: "Back in commit a55e07308831b ("xfs: only allow reaping of per-AG blocks in xrep_reap_extents"), we removed from the reaping code the ability to handle bmbt blocks. At the time, the reaping code only walked single blocks, didn't correctly detect crosslinked blocks, and the special casing made the function hard to understand. It was easier to remove unneeded functionality prior to fixing all the bugs. "Now that we've fixed the problems, we want again the ability to reap file metadata blocks. Reintroduce the per-file reaping functionality atop the current implementation. We require that sc->sa is uninitialized, so that we can use it to hold all the per-AG context for a given extent." > > > > > #endif /* __XFS_SCRUB_REAP_H__ */ > > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h > > index 70a6b18e5ad3c..46bf841524f8f 100644 > > --- a/fs/xfs/scrub/repair.h > > +++ b/fs/xfs/scrub/repair.h > > @@ -48,6 +48,7 @@ xrep_trans_commit( > > > > struct xbitmap; > > struct xagb_bitmap; > > +struct xfsb_bitmap; > > Your might need the forward declaration in reap.h, but definitively > not here :) > > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> Thanks! --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents 2023-11-30 21:48 ` Darrick J. Wong @ 2023-12-04 4:42 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-12-04 4:42 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Thu, Nov 30, 2023 at 01:48:24PM -0800, Darrick J. Wong wrote: > > Can you expand a bit on why it was buggy, in what commit is was dropped > > and what we're doing better this time around? > > Oh! We merged that one! Let me change the commit message: > > "Back in commit a55e07308831b ("xfs: only allow reaping of per-AG > blocks in xrep_reap_extents"), we removed from the reaping code the > ability to handle bmbt blocks. At the time, the reaping code only > walked single blocks, didn't correctly detect crosslinked blocks, and > the special casing made the function hard to understand. It was easier > to remove unneeded functionality prior to fixing all the bugs. > > "Now that we've fixed the problems, we want again the ability to reap > file metadata blocks. Reintroduce the per-file reaping functionality > atop the current implementation. We require that sc->sa is > uninitialized, so that we can use it to hold all the per-AG context for > a given extent." That looks much better, thanks! ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 2/5] xfs: repair inode fork block mapping data structures 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong 2023-11-24 23:53 ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong @ 2023-11-24 23:53 ` Darrick J. Wong 2023-11-30 5:07 ` Christoph Hellwig 2023-11-24 23:53 ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:53 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Use the reverse-mapping btree information to rebuild an inode block map. Update the btree bulk loading code as necessary to support inode rooted btrees and fix some bitrot problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_bmap_btree.c | 119 ++++- fs/xfs/libxfs/xfs_bmap_btree.h | 5 fs/xfs/libxfs/xfs_btree_staging.c | 11 fs/xfs/libxfs/xfs_btree_staging.h | 2 fs/xfs/libxfs/xfs_iext_tree.c | 23 + fs/xfs/libxfs/xfs_inode_fork.c | 1 fs/xfs/libxfs/xfs_inode_fork.h | 3 fs/xfs/scrub/bmap.c | 18 + fs/xfs/scrub/bmap_repair.c | 846 +++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.h | 6 fs/xfs/scrub/repair.c | 28 + fs/xfs/scrub/repair.h | 6 fs/xfs/scrub/scrub.c | 4 fs/xfs/scrub/trace.h | 34 + fs/xfs/xfs_trans.c | 95 ++++ fs/xfs/xfs_trans.h | 4 17 files changed, 1172 insertions(+), 34 deletions(-) create mode 100644 fs/xfs/scrub/bmap_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 0d86d75422f60..f62351d63b147 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -182,6 +182,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) xfs-y += $(addprefix scrub/, \ agheader_repair.o \ alloc_repair.o \ + bmap_repair.o \ ialloc_repair.o \ inode_repair.o \ newbt.o \ diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c index 8360256cff168..626486f632e2c 100644 --- a/fs/xfs/libxfs/xfs_bmap_btree.c +++ b/fs/xfs/libxfs/xfs_bmap_btree.c @@ -15,6 +15,7 @@ #include "xfs_trans.h" #include "xfs_alloc.h" #include "xfs_btree.h" +#include "xfs_btree_staging.h" #include "xfs_bmap_btree.h" #include "xfs_bmap.h" #include "xfs_error.h" @@ -288,10 +289,7 @@ xfs_bmbt_get_minrecs( int level) { if (level == cur->bc_nlevels - 1) { - struct xfs_ifork *ifp; - - ifp = xfs_ifork_ptr(cur->bc_ino.ip, - cur->bc_ino.whichfork); + struct xfs_ifork *ifp = xfs_btree_ifork_ptr(cur); return xfs_bmbt_maxrecs(cur->bc_mp, ifp->if_broot_bytes, level == 0) / 2; @@ -306,10 +304,7 @@ xfs_bmbt_get_maxrecs( int level) { if (level == cur->bc_nlevels - 1) { - struct xfs_ifork *ifp; - - ifp = xfs_ifork_ptr(cur->bc_ino.ip, - cur->bc_ino.whichfork); + struct xfs_ifork *ifp = xfs_btree_ifork_ptr(cur); return xfs_bmbt_maxrecs(cur->bc_mp, ifp->if_broot_bytes, level == 0); @@ -543,23 +538,19 @@ static const struct xfs_btree_ops xfs_bmbt_ops = { .keys_contiguous = xfs_bmbt_keys_contiguous, }; -/* - * Allocate a new bmap btree cursor. - */ -struct xfs_btree_cur * /* new bmap btree cursor */ -xfs_bmbt_init_cursor( - struct xfs_mount *mp, /* file system mount point */ - struct xfs_trans *tp, /* transaction pointer */ - struct xfs_inode *ip, /* inode owning the btree */ - int whichfork) /* data or attr fork */ +static struct xfs_btree_cur * +xfs_bmbt_init_common( + struct xfs_mount *mp, + struct xfs_trans *tp, + struct xfs_inode *ip, + int whichfork) { - struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork); struct xfs_btree_cur *cur; + ASSERT(whichfork != XFS_COW_FORK); cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP, mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache); - cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1; cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2); cur->bc_ops = &xfs_bmbt_ops; @@ -567,10 +558,30 @@ xfs_bmbt_init_cursor( if (xfs_has_crc(mp)) cur->bc_flags |= XFS_BTREE_CRC_BLOCKS; - cur->bc_ino.forksize = xfs_inode_fork_size(ip, whichfork); cur->bc_ino.ip = ip; cur->bc_ino.allocated = 0; cur->bc_ino.flags = 0; + + return cur; +} + +/* + * Allocate a new bmap btree cursor. + */ +struct xfs_btree_cur * +xfs_bmbt_init_cursor( + struct xfs_mount *mp, + struct xfs_trans *tp, + struct xfs_inode *ip, + int whichfork) +{ + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork); + struct xfs_btree_cur *cur; + + cur = xfs_bmbt_init_common(mp, tp, ip, whichfork); + + cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1; + cur->bc_ino.forksize = xfs_inode_fork_size(ip, whichfork); cur->bc_ino.whichfork = whichfork; return cur; @@ -587,6 +598,74 @@ xfs_bmbt_block_maxrecs( return blocklen / (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t)); } +/* + * Allocate a new bmap btree cursor for reloading an inode block mapping data + * structure. Note that callers can use the staged cursor to reload extents + * format inode forks if they rebuild the iext tree and commit the staged + * cursor immediately. + */ +struct xfs_btree_cur * +xfs_bmbt_stage_cursor( + struct xfs_mount *mp, + struct xfs_inode *ip, + struct xbtree_ifakeroot *ifake) +{ + struct xfs_btree_cur *cur; + struct xfs_btree_ops *ops; + + cur = xfs_bmbt_init_common(mp, NULL, ip, ifake->if_whichfork); + cur->bc_nlevels = ifake->if_levels; + cur->bc_ino.forksize = ifake->if_fork_size; + /* Don't let anyone think we're attached to the real fork yet. */ + cur->bc_ino.whichfork = -1; + xfs_btree_stage_ifakeroot(cur, ifake, &ops); + ops->update_cursor = NULL; + return cur; +} + +/* + * Swap in the new inode fork root. Once we pass this point the newly rebuilt + * mappings are in place and we have to kill off any old btree blocks. + */ +void +xfs_bmbt_commit_staged_btree( + struct xfs_btree_cur *cur, + struct xfs_trans *tp, + int whichfork) +{ + struct xbtree_ifakeroot *ifake = cur->bc_ino.ifake; + struct xfs_ifork *ifp; + static const short brootflag[2] = {XFS_ILOG_DBROOT, XFS_ILOG_ABROOT}; + static const short extflag[2] = {XFS_ILOG_DEXT, XFS_ILOG_AEXT}; + int flags = XFS_ILOG_CORE; + + ASSERT(cur->bc_flags & XFS_BTREE_STAGING); + ASSERT(whichfork != XFS_COW_FORK); + + /* + * Free any resources hanging off the real fork, then shallow-copy the + * staging fork's contents into the real fork to transfer everything + * we just built. + */ + ifp = xfs_ifork_ptr(cur->bc_ino.ip, whichfork); + xfs_idestroy_fork(ifp); + memcpy(ifp, ifake->if_fork, sizeof(struct xfs_ifork)); + + switch (ifp->if_format) { + case XFS_DINODE_FMT_EXTENTS: + flags |= extflag[whichfork]; + break; + case XFS_DINODE_FMT_BTREE: + flags |= brootflag[whichfork]; + break; + default: + ASSERT(0); + break; + } + xfs_trans_log_inode(tp, cur->bc_ino.ip, flags); + xfs_btree_commit_ifakeroot(cur, tp, whichfork, &xfs_bmbt_ops); +} + /* * Calculate number of records in a bmap btree block. */ diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h index 3e7a40a83835c..151b8491f60ee 100644 --- a/fs/xfs/libxfs/xfs_bmap_btree.h +++ b/fs/xfs/libxfs/xfs_bmap_btree.h @@ -11,6 +11,7 @@ struct xfs_btree_block; struct xfs_mount; struct xfs_inode; struct xfs_trans; +struct xbtree_ifakeroot; /* * Btree block header size depends on a superblock flag. @@ -106,6 +107,10 @@ extern int xfs_bmbt_change_owner(struct xfs_trans *tp, struct xfs_inode *ip, extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *, struct xfs_trans *, struct xfs_inode *, int); +struct xfs_btree_cur *xfs_bmbt_stage_cursor(struct xfs_mount *mp, + struct xfs_inode *ip, struct xbtree_ifakeroot *ifake); +void xfs_bmbt_commit_staged_btree(struct xfs_btree_cur *cur, + struct xfs_trans *tp, int whichfork); extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp, unsigned long long len); diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index 6fd6ea8e6fbd7..4cdf7976b7bf5 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -399,7 +399,7 @@ xfs_btree_bload_prep_block( ASSERT(*bpp == NULL); /* Allocate a new incore btree root block. */ - new_size = bbl->iroot_size(cur, nr_this_block, priv); + new_size = bbl->iroot_size(cur, level, nr_this_block, priv); ifp->if_broot = kmem_zalloc(new_size, 0); ifp->if_broot_bytes = (int)new_size; @@ -585,7 +585,14 @@ xfs_btree_bload_level_geometry( unsigned int desired_npb; unsigned int maxnr; - maxnr = cur->bc_ops->get_maxrecs(cur, level); + /* + * Compute the absolute maximum number of records that we can store in + * the ondisk block or inode root. + */ + if (cur->bc_ops->get_dmaxrecs) + maxnr = cur->bc_ops->get_dmaxrecs(cur, level); + else + maxnr = cur->bc_ops->get_maxrecs(cur, level); /* * Compute the number of blocks we need to fill each block with the diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h index d2eaf4fdc6032..439d3490c878a 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.h +++ b/fs/xfs/libxfs/xfs_btree_staging.h @@ -56,7 +56,7 @@ typedef int (*xfs_btree_bload_get_records_fn)(struct xfs_btree_cur *cur, typedef int (*xfs_btree_bload_claim_block_fn)(struct xfs_btree_cur *cur, union xfs_btree_ptr *ptr, void *priv); typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur, - unsigned int nr_this_level, void *priv); + unsigned int level, unsigned int nr_this_level, void *priv); struct xfs_btree_bload { /* diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c index 773cf43494286..d062794cc7957 100644 --- a/fs/xfs/libxfs/xfs_iext_tree.c +++ b/fs/xfs/libxfs/xfs_iext_tree.c @@ -622,13 +622,11 @@ static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp) } void -xfs_iext_insert( - struct xfs_inode *ip, +xfs_iext_insert_raw( + struct xfs_ifork *ifp, struct xfs_iext_cursor *cur, - struct xfs_bmbt_irec *irec, - int state) + struct xfs_bmbt_irec *irec) { - struct xfs_ifork *ifp = xfs_iext_state_to_fork(ip, state); xfs_fileoff_t offset = irec->br_startoff; struct xfs_iext_leaf *new = NULL; int nr_entries, i; @@ -662,12 +660,23 @@ xfs_iext_insert( xfs_iext_set(cur_rec(cur), irec); ifp->if_bytes += sizeof(struct xfs_iext_rec); - trace_xfs_iext_insert(ip, cur, state, _RET_IP_); - if (new) xfs_iext_insert_node(ifp, xfs_iext_leaf_key(new, 0), new, 2); } +void +xfs_iext_insert( + struct xfs_inode *ip, + struct xfs_iext_cursor *cur, + struct xfs_bmbt_irec *irec, + int state) +{ + struct xfs_ifork *ifp = xfs_iext_state_to_fork(ip, state); + + xfs_iext_insert_raw(ifp, cur, irec); + trace_xfs_iext_insert(ip, cur, state, _RET_IP_); +} + static struct xfs_iext_node * xfs_iext_rebalance_node( struct xfs_iext_node *parent, diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c index 5a2e7ddfa76d6..2390884e0075b 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.c +++ b/fs/xfs/libxfs/xfs_inode_fork.c @@ -520,6 +520,7 @@ xfs_idata_realloc( ifp->if_bytes = new_size; } +/* Free all memory and reset a fork back to its initial state. */ void xfs_idestroy_fork( struct xfs_ifork *ifp) diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index 96d307784c85b..535be5c036899 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -180,6 +180,9 @@ void xfs_init_local_fork(struct xfs_inode *ip, int whichfork, const void *data, int64_t size); xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp); +void xfs_iext_insert_raw(struct xfs_ifork *ifp, + struct xfs_iext_cursor *cur, + struct xfs_bmbt_irec *irec); void xfs_iext_insert(struct xfs_inode *, struct xfs_iext_cursor *cur, struct xfs_bmbt_irec *, int); void xfs_iext_remove(struct xfs_inode *, struct xfs_iext_cursor *, diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index c12ccc9141163..737ab982a2d7a 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -48,9 +48,18 @@ xchk_setup_inode_bmap( if (S_ISREG(VFS_I(sc->ip)->i_mode) && sc->sm->sm_type != XFS_SCRUB_TYPE_BMBTA) { struct address_space *mapping = VFS_I(sc->ip)->i_mapping; + bool is_repair = xchk_could_repair(sc); xchk_ilock(sc, XFS_MMAPLOCK_EXCL); + /* Break all our leases, we're going to mess with things. */ + if (is_repair) { + error = xfs_break_layouts(VFS_I(sc->ip), + &sc->ilock_flags, BREAK_WRITE); + if (error) + goto out; + } + inode_dio_wait(VFS_I(sc->ip)); /* @@ -71,6 +80,15 @@ xchk_setup_inode_bmap( error = filemap_fdatawait_keep_errors(mapping); if (error && (error != -ENOSPC && error != -EIO)) goto out; + + /* Drop the page cache if we're repairing block mappings. */ + if (is_repair) { + error = invalidate_inode_pages2( + VFS_I(sc->ip)->i_mapping); + if (error) + goto out; + } + } /* Got the inode, lock it and we're ready to go. */ diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c new file mode 100644 index 0000000000000..2c593eebb1fc4 --- /dev/null +++ b/fs/xfs/scrub/bmap_repair.c @@ -0,0 +1,846 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_inode_fork.h" +#include "xfs_alloc.h" +#include "xfs_rtalloc.h" +#include "xfs_bmap.h" +#include "xfs_bmap_util.h" +#include "xfs_bmap_btree.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_refcount.h" +#include "xfs_quota.h" +#include "xfs_ialloc.h" +#include "xfs_ag.h" +#include "xfs_reflink.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/newbt.h" +#include "scrub/reap.h" + +/* + * Inode Fork Block Mapping (BMBT) Repair + * ====================================== + * + * Gather all the rmap records for the inode and fork we're fixing, reset the + * incore fork, then recreate the btree. + */ +struct xrep_bmap { + /* Old bmbt blocks */ + struct xfsb_bitmap old_bmbt_blocks; + + /* New fork. */ + struct xrep_newbt new_bmapbt; + + /* List of new bmap records. */ + struct xfarray *bmap_records; + + struct xfs_scrub *sc; + + /* How many blocks did we find allocated to this file? */ + xfs_rfsblock_t nblocks; + + /* How many bmbt blocks did we find for this fork? */ + xfs_rfsblock_t old_bmbt_block_count; + + /* get_records()'s position in the free space record array. */ + xfarray_idx_t array_cur; + + /* How many real (non-hole, non-delalloc) mappings do we have? */ + uint64_t real_mappings; + + /* Which fork are we fixing? */ + int whichfork; + + /* Are there shared extents? */ + bool shared_extents; +}; + +/* Is this space extent shared? Flag the inode if it is. */ +STATIC int +xrep_bmap_discover_shared( + struct xrep_bmap *rb, + xfs_fsblock_t startblock, + xfs_filblks_t blockcount, + bool unwritten) +{ + struct xfs_scrub *sc = rb->sc; + xfs_agblock_t agbno; + xfs_agblock_t fbno; + xfs_extlen_t flen; + int error; + + /* + * Only investigate if we need to set the shared extents flag if we are + * adding a written extent mapping to the data fork of a regular file + * on reflink filesystem. + */ + if (rb->shared_extents) + return 0; + if (unwritten) + return 0; + if (rb->whichfork != XFS_DATA_FORK) + return 0; + if (!S_ISREG(VFS_I(sc->ip)->i_mode)) + return 0; + if (!xfs_has_reflink(sc->mp)) + return 0; + if (XFS_IS_REALTIME_INODE(sc->ip)) + return 0; + + agbno = XFS_FSB_TO_AGBNO(sc->mp, startblock); + error = xfs_refcount_find_shared(sc->sa.refc_cur, agbno, blockcount, + &fbno, &flen, false); + if (error) + return error; + + if (fbno != NULLAGBLOCK) + rb->shared_extents = true; + + return 0; +} + +/* Remember this reverse-mapping as a series of bmap records. */ +STATIC int +xrep_bmap_from_rmap( + struct xrep_bmap *rb, + xfs_fileoff_t startoff, + xfs_fsblock_t startblock, + xfs_filblks_t blockcount, + bool unwritten) +{ + struct xfs_bmbt_irec irec = { + .br_startoff = startoff, + .br_startblock = startblock, + .br_state = unwritten ? XFS_EXT_UNWRITTEN : XFS_EXT_NORM, + }; + struct xfs_bmbt_rec rbe; + struct xfs_scrub *sc = rb->sc; + int error = 0; + + /* + * If we're repairing the data fork of a non-reflinked regular file on + * a reflink filesystem, we need to figure out if this space extent is + * shared. + */ + error = xrep_bmap_discover_shared(rb, startblock, blockcount, + unwritten); + if (error) + return error; + + do { + xfs_failaddr_t fa; + + irec.br_blockcount = min_t(xfs_filblks_t, blockcount, + XFS_MAX_BMBT_EXTLEN); + + fa = xfs_bmap_validate_extent(sc->ip, rb->whichfork, &irec); + if (fa) + return -EFSCORRUPTED; + + xfs_bmbt_disk_set_all(&rbe, &irec); + + trace_xrep_bmap_found(sc->ip, rb->whichfork, &irec); + + if (xchk_should_terminate(sc, &error)) + return error; + + error = xfarray_append(rb->bmap_records, &rbe); + if (error) + return error; + + rb->real_mappings++; + + irec.br_startblock += irec.br_blockcount; + irec.br_startoff += irec.br_blockcount; + blockcount -= irec.br_blockcount; + } while (blockcount > 0); + + return 0; +} + +/* Check for any obvious errors or conflicts in the file mapping. */ +STATIC int +xrep_bmap_check_fork_rmap( + struct xrep_bmap *rb, + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec) +{ + struct xfs_scrub *sc = rb->sc; + enum xbtree_recpacking outcome; + int error; + + /* + * Data extents for rt files are never stored on the data device, but + * everything else (xattrs, bmbt blocks) can be. + */ + if (XFS_IS_REALTIME_INODE(sc->ip) && + !(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) + return -EFSCORRUPTED; + + /* Check that this is within the AG. */ + if (!xfs_verify_agbext(cur->bc_ag.pag, rec->rm_startblock, + rec->rm_blockcount)) + return -EFSCORRUPTED; + + /* Check the file offset range. */ + if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) && + !xfs_verify_fileext(sc->mp, rec->rm_offset, rec->rm_blockcount)) + return -EFSCORRUPTED; + + /* No contradictory flags. */ + if ((rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK)) && + (rec->rm_flags & XFS_RMAP_UNWRITTEN)) + return -EFSCORRUPTED; + + /* Make sure this isn't free space. */ + error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rm_startblock, + rec->rm_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + /* Must not be an inode chunk. */ + error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur, + rec->rm_startblock, rec->rm_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + return 0; +} + +/* Record extents that belong to this inode's fork. */ +STATIC int +xrep_bmap_walk_rmap( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec, + void *priv) +{ + struct xrep_bmap *rb = priv; + struct xfs_mount *mp = cur->bc_mp; + xfs_fsblock_t fsbno; + int error = 0; + + if (xchk_should_terminate(rb->sc, &error)) + return error; + + if (rec->rm_owner != rb->sc->ip->i_ino) + return 0; + + error = xrep_bmap_check_fork_rmap(rb, cur, rec); + if (error) + return error; + + /* + * Record all blocks allocated to this file even if the extent isn't + * for the fork we're rebuilding so that we can reset di_nblocks later. + */ + rb->nblocks += rec->rm_blockcount; + + /* If this rmap isn't for the fork we want, we're done. */ + if (rb->whichfork == XFS_DATA_FORK && + (rec->rm_flags & XFS_RMAP_ATTR_FORK)) + return 0; + if (rb->whichfork == XFS_ATTR_FORK && + !(rec->rm_flags & XFS_RMAP_ATTR_FORK)) + return 0; + + fsbno = XFS_AGB_TO_FSB(mp, cur->bc_ag.pag->pag_agno, + rec->rm_startblock); + + if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) { + rb->old_bmbt_block_count += rec->rm_blockcount; + return xfsb_bitmap_set(&rb->old_bmbt_blocks, fsbno, + rec->rm_blockcount); + } + + return xrep_bmap_from_rmap(rb, rec->rm_offset, fsbno, + rec->rm_blockcount, + rec->rm_flags & XFS_RMAP_UNWRITTEN); +} + +/* + * Compare two block mapping records. We want to sort in order of increasing + * file offset. + */ +static int +xrep_bmap_extent_cmp( + const void *a, + const void *b) +{ + xfs_fileoff_t ao; + xfs_fileoff_t bo; + + ao = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)a); + bo = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)b); + + if (ao > bo) + return 1; + else if (ao < bo) + return -1; + return 0; +} + +/* + * Sort the bmap extents by fork offset or else the records will be in the + * wrong order. Ensure there are no overlaps in the file offset ranges. + */ +STATIC int +xrep_bmap_sort_records( + struct xrep_bmap *rb) +{ + struct xfs_bmbt_irec irec; + xfs_fileoff_t next_off = 0; + xfarray_idx_t array_cur; + int error; + + error = xfarray_sort(rb->bmap_records, xrep_bmap_extent_cmp, + XFARRAY_SORT_KILLABLE); + if (error) + return error; + + foreach_xfarray_idx(rb->bmap_records, array_cur) { + struct xfs_bmbt_rec rec; + + if (xchk_should_terminate(rb->sc, &error)) + return error; + + error = xfarray_load(rb->bmap_records, array_cur, &rec); + if (error) + return error; + + xfs_bmbt_disk_get_all(&rec, &irec); + + if (irec.br_startoff < next_off) + return -EFSCORRUPTED; + + next_off = irec.br_startoff + irec.br_blockcount; + } + + return 0; +} + +/* Scan one AG for reverse mappings that we can turn into extent maps. */ +STATIC int +xrep_bmap_scan_ag( + struct xrep_bmap *rb, + struct xfs_perag *pag) +{ + struct xfs_scrub *sc = rb->sc; + int error; + + error = xrep_ag_init(sc, pag, &sc->sa); + if (error) + return error; + + error = xfs_rmap_query_all(sc->sa.rmap_cur, xrep_bmap_walk_rmap, rb); + xchk_ag_free(sc, &sc->sa); + return error; +} + +/* Find the delalloc extents from the old incore extent tree. */ +STATIC int +xrep_bmap_find_delalloc( + struct xrep_bmap *rb) +{ + struct xfs_bmbt_irec irec; + struct xfs_iext_cursor icur; + struct xfs_bmbt_rec rbe; + struct xfs_inode *ip = rb->sc->ip; + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, rb->whichfork); + int error = 0; + + /* + * Skip this scan if we don't expect to find delayed allocation + * reservations in this fork. + */ + if (rb->whichfork == XFS_ATTR_FORK || ip->i_delayed_blks == 0) + return 0; + + for_each_xfs_iext(ifp, &icur, &irec) { + if (!isnullstartblock(irec.br_startblock)) + continue; + + xfs_bmbt_disk_set_all(&rbe, &irec); + + trace_xrep_bmap_found(ip, rb->whichfork, &irec); + + if (xchk_should_terminate(rb->sc, &error)) + return error; + + error = xfarray_append(rb->bmap_records, &rbe); + if (error) + return error; + } + + return 0; +} + +/* + * Collect block mappings for this fork of this inode and decide if we have + * enough space to rebuild. Caller is responsible for cleaning up the list if + * anything goes wrong. + */ +STATIC int +xrep_bmap_find_mappings( + struct xrep_bmap *rb) +{ + struct xfs_scrub *sc = rb->sc; + struct xfs_perag *pag; + xfs_agnumber_t agno; + int error = 0; + + /* Iterate the rmaps for extents. */ + for_each_perag(sc->mp, agno, pag) { + error = xrep_bmap_scan_ag(rb, pag); + if (error) { + xfs_perag_rele(pag); + return error; + } + } + + return xrep_bmap_find_delalloc(rb); +} + +/* Retrieve real extent mappings for bulk loading the bmap btree. */ +STATIC int +xrep_bmap_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_bmbt_rec rec; + struct xfs_bmbt_irec *irec = &cur->bc_rec.b; + struct xrep_bmap *rb = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + do { + error = xfarray_load(rb->bmap_records, rb->array_cur++, + &rec); + if (error) + return error; + + xfs_bmbt_disk_get_all(&rec, irec); + } while (isnullstartblock(irec->br_startblock)); + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_bmap_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_bmap *rb = priv; + + return xrep_newbt_claim_block(cur, &rb->new_bmapbt, ptr); +} + +/* Figure out how much space we need to create the incore btree root block. */ +STATIC size_t +xrep_bmap_iroot_size( + struct xfs_btree_cur *cur, + unsigned int level, + unsigned int nr_this_level, + void *priv) +{ + ASSERT(level > 0); + + return XFS_BMAP_BROOT_SPACE_CALC(cur->bc_mp, nr_this_level); +} + +/* Update the inode counters. */ +STATIC int +xrep_bmap_reset_counters( + struct xrep_bmap *rb) +{ + struct xfs_scrub *sc = rb->sc; + struct xbtree_ifakeroot *ifake = &rb->new_bmapbt.ifake; + int64_t delta; + + if (rb->shared_extents) + sc->ip->i_diflags2 |= XFS_DIFLAG2_REFLINK; + + /* + * Update the inode block counts to reflect the extents we found in the + * rmapbt. + */ + delta = ifake->if_blocks - rb->old_bmbt_block_count; + sc->ip->i_nblocks = rb->nblocks + delta; + xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE); + + /* + * Adjust the quota counts by the difference in size between the old + * and new bmbt. + */ + xfs_trans_mod_dquot_byino(sc->tp, sc->ip, XFS_TRANS_DQ_BCOUNT, delta); + return 0; +} + +/* + * Create a new iext tree and load it with block mappings. If the inode is + * in extents format, that's all we need to do to commit the new mappings. + * If it is in btree format, this takes care of preloading the incore tree. + */ +STATIC int +xrep_bmap_extents_load( + struct xrep_bmap *rb) +{ + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec irec; + struct xfs_ifork *ifp = rb->new_bmapbt.ifake.if_fork; + xfarray_idx_t array_cur; + int error; + + ASSERT(ifp->if_bytes == 0); + + /* Add all the mappings (incl. delalloc) to the incore extent tree. */ + xfs_iext_first(ifp, &icur); + foreach_xfarray_idx(rb->bmap_records, array_cur) { + struct xfs_bmbt_rec rec; + + error = xfarray_load(rb->bmap_records, array_cur, &rec); + if (error) + return error; + + xfs_bmbt_disk_get_all(&rec, &irec); + + xfs_iext_insert_raw(ifp, &icur, &irec); + if (!isnullstartblock(irec.br_startblock)) + ifp->if_nextents++; + + xfs_iext_next(ifp, &icur); + } + + return xrep_ino_ensure_extent_count(rb->sc, rb->whichfork, + ifp->if_nextents); +} + +/* + * Reserve new btree blocks, bulk load the bmap records into the ondisk btree, + * and load the incore extent tree. + */ +STATIC int +xrep_bmap_btree_load( + struct xrep_bmap *rb, + struct xfs_btree_cur *bmap_cur) +{ + struct xfs_scrub *sc = rb->sc; + int error; + + /* Compute how many blocks we'll need. */ + error = xfs_btree_bload_compute_geometry(bmap_cur, + &rb->new_bmapbt.bload, rb->real_mappings); + if (error) + return error; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + return error; + + /* + * Guess how many blocks we're going to need to rebuild an entire bmap + * from the number of extents we found, and pump up our transaction to + * have sufficient block reservation. We're allowed to exceed file + * quota to repair inconsistent metadata. + */ + error = xfs_trans_reserve_more_inode(sc->tp, sc->ip, + rb->new_bmapbt.bload.nr_blocks, 0, true); + if (error) + return error; + + /* Reserve the space we'll need for the new btree. */ + error = xrep_newbt_alloc_blocks(&rb->new_bmapbt, + rb->new_bmapbt.bload.nr_blocks); + if (error) + return error; + + /* Add all observed bmap records. */ + rb->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(bmap_cur, &rb->new_bmapbt.bload, rb); + if (error) + return error; + + /* + * Load the new bmap records into the new incore extent tree to + * preserve delalloc reservations for regular files. The directory + * code loads the extent tree during xfs_dir_open and assumes + * thereafter that it remains loaded, so we must not violate that + * assumption. + */ + return xrep_bmap_extents_load(rb); +} + +/* + * Use the collected bmap information to stage a new bmap fork. If this is + * successful we'll return with the new fork information logged to the repair + * transaction but not yet committed. The caller must ensure that the inode + * is joined to the transaction; the inode will be joined to a clean + * transaction when the function returns. + */ +STATIC int +xrep_bmap_build_new_fork( + struct xrep_bmap *rb) +{ + struct xfs_owner_info oinfo; + struct xfs_scrub *sc = rb->sc; + struct xfs_btree_cur *bmap_cur; + struct xbtree_ifakeroot *ifake = &rb->new_bmapbt.ifake; + int error; + + error = xrep_bmap_sort_records(rb); + if (error) + return error; + + /* + * Prepare to construct the new fork by initializing the new btree + * structure and creating a fake ifork in the ifakeroot structure. + */ + xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, rb->whichfork); + error = xrep_newbt_init_inode(&rb->new_bmapbt, sc, rb->whichfork, + &oinfo); + if (error) + return error; + + rb->new_bmapbt.bload.get_records = xrep_bmap_get_records; + rb->new_bmapbt.bload.claim_block = xrep_bmap_claim_block; + rb->new_bmapbt.bload.iroot_size = xrep_bmap_iroot_size; + bmap_cur = xfs_bmbt_stage_cursor(sc->mp, sc->ip, ifake); + + /* + * Figure out the size and format of the new fork, then fill it with + * all the bmap records we've found. Join the inode to the transaction + * so that we can roll the transaction while holding the inode locked. + */ + if (rb->real_mappings <= XFS_IFORK_MAXEXT(sc->ip, rb->whichfork)) { + ifake->if_fork->if_format = XFS_DINODE_FMT_EXTENTS; + error = xrep_bmap_extents_load(rb); + } else { + ifake->if_fork->if_format = XFS_DINODE_FMT_BTREE; + error = xrep_bmap_btree_load(rb, bmap_cur); + } + if (error) + goto err_cur; + + /* + * Install the new fork in the inode. After this point the old mapping + * data are no longer accessible and the new tree is live. We delete + * the cursor immediately after committing the staged root because the + * staged fork might be in extents format. + */ + xfs_bmbt_commit_staged_btree(bmap_cur, sc->tp, rb->whichfork); + xfs_btree_del_cursor(bmap_cur, 0); + + /* Reset the inode counters now that we've changed the fork. */ + error = xrep_bmap_reset_counters(rb); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + error = xrep_newbt_commit(&rb->new_bmapbt); + if (error) + return error; + + return xrep_roll_trans(sc); + +err_cur: + if (bmap_cur) + xfs_btree_del_cursor(bmap_cur, error); +err_newbt: + xrep_newbt_cancel(&rb->new_bmapbt); + return error; +} + +/* + * Now that we've logged the new inode btree, invalidate all of the old blocks + * and free them, if there were any. + */ +STATIC int +xrep_bmap_remove_old_tree( + struct xrep_bmap *rb) +{ + struct xfs_scrub *sc = rb->sc; + struct xfs_owner_info oinfo; + + /* Free the old bmbt blocks if they're not in use. */ + xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, rb->whichfork); + return xrep_reap_fsblocks(sc, &rb->old_bmbt_blocks, &oinfo); +} + +/* Check for garbage inputs. Returns -ECANCELED if there's nothing to do. */ +STATIC int +xrep_bmap_check_inputs( + struct xfs_scrub *sc, + int whichfork) +{ + struct xfs_ifork *ifp = xfs_ifork_ptr(sc->ip, whichfork); + + ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK); + + if (!xfs_has_rmapbt(sc->mp)) + return -EOPNOTSUPP; + + /* No fork means nothing to rebuild. */ + if (!ifp) + return -ECANCELED; + + /* + * We only know how to repair extent mappings, which is to say that we + * only support extents and btree fork format. Repairs to a local + * format fork require a higher level repair function, so we do not + * have any work to do here. + */ + switch (ifp->if_format) { + case XFS_DINODE_FMT_DEV: + case XFS_DINODE_FMT_LOCAL: + case XFS_DINODE_FMT_UUID: + return -ECANCELED; + case XFS_DINODE_FMT_EXTENTS: + case XFS_DINODE_FMT_BTREE: + break; + default: + return -EFSCORRUPTED; + } + + if (whichfork == XFS_ATTR_FORK) + return 0; + + /* Only files, symlinks, and directories get to have data forks. */ + switch (VFS_I(sc->ip)->i_mode & S_IFMT) { + case S_IFREG: + case S_IFDIR: + case S_IFLNK: + /* ok */ + break; + default: + return -EINVAL; + } + + /* Don't know how to rebuild realtime data forks. */ + if (XFS_IS_REALTIME_INODE(sc->ip)) + return -EOPNOTSUPP; + + return 0; +} + +/* Repair an inode fork. */ +STATIC int +xrep_bmap( + struct xfs_scrub *sc, + int whichfork) +{ + struct xrep_bmap *rb; + char *descr; + unsigned int max_bmbt_recs; + bool large_extcount; + int error = 0; + + error = xrep_bmap_check_inputs(sc, whichfork); + if (error == -ECANCELED) + return 0; + if (error) + return error; + + rb = kzalloc(sizeof(struct xrep_bmap), XCHK_GFP_FLAGS); + if (!rb) + return -ENOMEM; + rb->sc = sc; + rb->whichfork = whichfork; + + /* + * No need to waste time scanning for shared extents if the inode is + * already marked. + */ + if (whichfork == XFS_DATA_FORK && xfs_is_reflink_inode(sc->ip)) + rb->shared_extents = true; + + /* Set up enough storage to handle the max records for this fork. */ + large_extcount = xfs_has_large_extent_counts(sc->mp); + max_bmbt_recs = xfs_iext_max_nextents(large_extcount, whichfork); + descr = xchk_xfile_ino_descr(sc, "%s fork mapping records", + whichfork == XFS_DATA_FORK ? "data" : "attr"); + error = xfarray_create(descr, max_bmbt_recs, + sizeof(struct xfs_bmbt_rec), &rb->bmap_records); + kfree(descr); + if (error) + goto out_rb; + + /* Collect all reverse mappings for this fork's extents. */ + xfsb_bitmap_init(&rb->old_bmbt_blocks); + error = xrep_bmap_find_mappings(rb); + if (error) + goto out_bitmap; + + xfs_trans_ijoin(sc->tp, sc->ip, 0); + + /* Rebuild the bmap information. */ + error = xrep_bmap_build_new_fork(rb); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_bmap_remove_old_tree(rb); + if (error) + goto out_bitmap; + +out_bitmap: + xfsb_bitmap_destroy(&rb->old_bmbt_blocks); + xfarray_destroy(rb->bmap_records); +out_rb: + kfree(rb); + return error; +} + +/* Repair an inode's data fork. */ +int +xrep_bmap_data( + struct xfs_scrub *sc) +{ + return xrep_bmap(sc, XFS_DATA_FORK); +} + +/* Repair an inode's attr fork. */ +int +xrep_bmap_attr( + struct xfs_scrub *sc) +{ + return xrep_bmap(sc, XFS_ATTR_FORK); +} diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h index 506b808b9fbb3..1033d8172be62 100644 --- a/fs/xfs/scrub/common.h +++ b/fs/xfs/scrub/common.h @@ -233,7 +233,11 @@ int xchk_metadata_inode_forks(struct xfs_scrub *sc); (sc)->mp->m_super->s_id, \ (sc)->sa.pag ? (sc)->sa.pag->pag_agno : (sc)->sm->sm_agno, \ ##__VA_ARGS__) - +#define xchk_xfile_ino_descr(sc, fmt, ...) \ + kasprintf(XCHK_GFP_FLAGS, "XFS (%s): inode 0x%llx " fmt, \ + (sc)->mp->m_super->s_id, \ + (sc)->ip ? (sc)->ip->i_ino : (sc)->sm->sm_ino, \ + ##__VA_ARGS__) /* * Setting up a hook to wait for intents to drain is costly -- we have to take diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 82c9760776248..4d5bfb2e4cf08 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -880,6 +880,34 @@ xrep_reinit_pagi( return 0; } +/* + * Given an active reference to a perag structure, load AG headers and cursors. + * This should only be called to scan an AG while repairing file-based metadata. + */ +int +xrep_ag_init( + struct xfs_scrub *sc, + struct xfs_perag *pag, + struct xchk_ag *sa) +{ + int error; + + ASSERT(!sa->pag); + + error = xfs_ialloc_read_agi(pag, sc->tp, &sa->agi_bp); + if (error) + return error; + + error = xfs_alloc_read_agf(pag, sc->tp, 0, &sa->agf_bp); + if (error) + return error; + + /* Grab our own passive reference from the caller's ref. */ + sa->pag = xfs_perag_hold(pag); + xrep_ag_btcur_init(sc, sa); + return 0; +} + /* Reinitialize the per-AG block reservation for the AG we just fixed. */ int xrep_reset_perag_resv( diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 46bf841524f8f..9f0c77b38ae28 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -89,6 +89,8 @@ struct xfs_imap; int xrep_setup_inode(struct xfs_scrub *sc, struct xfs_imap *imap); void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa); +int xrep_ag_init(struct xfs_scrub *sc, struct xfs_perag *pag, + struct xchk_ag *sa); /* Metadata revalidators */ @@ -106,6 +108,8 @@ int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); int xrep_refcountbt(struct xfs_scrub *sc); int xrep_inode(struct xfs_scrub *sc); +int xrep_bmap_data(struct xfs_scrub *sc); +int xrep_bmap_attr(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -165,6 +169,8 @@ xrep_setup_nothing( #define xrep_iallocbt xrep_notsupported #define xrep_refcountbt xrep_notsupported #define xrep_inode xrep_notsupported +#define xrep_bmap_data xrep_notsupported +#define xrep_bmap_attr xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index b9edda17ab64b..52a09e0652693 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -290,13 +290,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_INODE, .setup = xchk_setup_inode_bmap, .scrub = xchk_bmap_data, - .repair = xrep_notsupported, + .repair = xrep_bmap_data, }, [XFS_SCRUB_TYPE_BMBTA] = { /* inode attr fork */ .type = ST_INODE, .setup = xchk_setup_inode_bmap, .scrub = xchk_bmap_attr, - .repair = xrep_notsupported, + .repair = xrep_bmap_attr, }, [XFS_SCRUB_TYPE_BMBTC] = { /* inode CoW fork */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 6cd5d04c0410c..3d55f65c00835 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1175,7 +1175,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \ TP_ARGS(mp, agno, agbno, len, owner, offset, flags)) DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap); DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn); -DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn); +DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_walk_rmap); TRACE_EVENT(xrep_abt_found, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, @@ -1260,6 +1260,38 @@ TRACE_EVENT(xrep_refc_found, __entry->refcount) ) +TRACE_EVENT(xrep_bmap_found, + TP_PROTO(struct xfs_inode *ip, int whichfork, + struct xfs_bmbt_irec *irec), + TP_ARGS(ip, whichfork, irec), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(int, whichfork) + __field(xfs_fileoff_t, lblk) + __field(xfs_filblks_t, len) + __field(xfs_fsblock_t, pblk) + __field(int, state) + ), + TP_fast_assign( + __entry->dev = VFS_I(ip)->i_sb->s_dev; + __entry->ino = ip->i_ino; + __entry->whichfork = whichfork; + __entry->lblk = irec->br_startoff; + __entry->len = irec->br_blockcount; + __entry->pblk = irec->br_startblock; + __entry->state = irec->br_state; + ), + TP_printk("dev %d:%d ino 0x%llx whichfork %s fileoff 0x%llx fsbcount 0x%llx startblock 0x%llx state %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS), + __entry->lblk, + __entry->len, + __entry->pblk, + __entry->state) +); + TRACE_EVENT(xrep_findroot_block, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno, uint32_t magic, uint16_t level), diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 305c9d07bf1b2..11e3d50078be9 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -132,6 +132,62 @@ xfs_trans_dup( return ntp; } +/* + * Try to reserve more blocks for a transaction. + * + * This is for callers that need to attach resources to a transaction, scan + * those resources to determine the space reservation requirements, and then + * modify the attached resources. In other words, online repair. This can + * fail due to ENOSPC, so the caller must be able to cancel the transaction + * without shutting down the fs. + */ +int +xfs_trans_reserve_more( + struct xfs_trans *tp, + unsigned int blocks, + unsigned int rtextents) +{ + struct xfs_mount *mp = tp->t_mountp; + bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0; + int error = 0; + + ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); + + /* + * Attempt to reserve the needed disk blocks by decrementing + * the number needed from the number available. This will + * fail if the count would go below zero. + */ + if (blocks > 0) { + error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd); + if (error) + return -ENOSPC; + tp->t_blk_res += blocks; + } + + /* + * Attempt to reserve the needed realtime extents by decrementing + * the number needed from the number available. This will + * fail if the count would go below zero. + */ + if (rtextents > 0) { + error = xfs_mod_frextents(mp, -((int64_t)rtextents)); + if (error) { + error = -ENOSPC; + goto out_blocks; + } + tp->t_rtx_res += rtextents; + } + + return 0; +out_blocks: + if (blocks > 0) { + xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd); + tp->t_blk_res -= blocks; + } + return error; +} + /* * This is called to reserve free disk blocks and log space for the * given transaction. This must be done before allocating any resources @@ -1236,6 +1292,45 @@ xfs_trans_alloc_inode( return error; } + +/* Try to reserve more blocks and file quota for a transaction. */ +int +xfs_trans_reserve_more_inode( + struct xfs_trans *tp, + struct xfs_inode *ip, + unsigned int dblocks, + unsigned int rblocks, + bool force_quota) +{ + struct xfs_mount *mp = ip->i_mount; + unsigned int rtx = xfs_extlen_to_rtxlen(mp, rblocks); + int error; + + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + + error = xfs_trans_reserve_more(tp, dblocks, rtx); + if (error) + return error; + + if (!XFS_IS_QUOTA_ON(mp) || xfs_is_quota_inode(&mp->m_sb, ip->i_ino)) + return 0; + + if (tp->t_flags & XFS_TRANS_RESERVE) + force_quota = true; + + error = xfs_trans_reserve_quota_nblks(tp, ip, dblocks, rblocks, + force_quota); + if (!error) + return 0; + + /* Quota failed, give back the new reservation. */ + xfs_mod_fdblocks(mp, dblocks, tp->t_flags & XFS_TRANS_RESERVE); + tp->t_blk_res -= dblocks; + xfs_mod_frextents(mp, rtx); + tp->t_rtx_res -= rtx; + return error; +} + /* * Allocate an transaction in preparation for inode creation by reserving quota * against the given dquots. Callers are not required to hold any inode locks. diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 6e3646d524ceb..d32abdd1e0149 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -168,6 +168,8 @@ typedef struct xfs_trans { int xfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp, uint blocks, uint rtextents, uint flags, struct xfs_trans **tpp); +int xfs_trans_reserve_more(struct xfs_trans *tp, + unsigned int blocks, unsigned int rtextents); int xfs_trans_alloc_empty(struct xfs_mount *mp, struct xfs_trans **tpp); void xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t); @@ -260,6 +262,8 @@ struct xfs_dquot; int xfs_trans_alloc_inode(struct xfs_inode *ip, struct xfs_trans_res *resv, unsigned int dblocks, unsigned int rblocks, bool force, struct xfs_trans **tpp); +int xfs_trans_reserve_more_inode(struct xfs_trans *tp, struct xfs_inode *ip, + unsigned int dblocks, unsigned int rblocks, bool force_quota); int xfs_trans_alloc_icreate(struct xfs_mount *mp, struct xfs_trans_res *resv, struct xfs_dquot *udqp, struct xfs_dquot *gdqp, struct xfs_dquot *pdqp, unsigned int dblocks, ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 2/5] xfs: repair inode fork block mapping data structures 2023-11-24 23:53 ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong @ 2023-11-30 5:07 ` Christoph Hellwig 2023-12-01 1:38 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:07 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:53:25PM -0800, Darrick J. Wong wrote: > +static int > +xrep_bmap_extent_cmp( > + const void *a, > + const void *b) > +{ > + xfs_fileoff_t ao; > + xfs_fileoff_t bo; > + > + ao = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)a); > + bo = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)b); It would be nice if we could just have local variables for the xfs_bmbt_recs and not need casts. I guess for that xfs_bmbt_disk_get_startoff would need to take a const argument? Probably something for later. > + if (whichfork == XFS_ATTR_FORK) > + return 0; Nit: I'd probably just split the data fork specific validation into a separate helper to keep things nicely organized. > + /* > + * No need to waste time scanning for shared extents if the inode is > + * already marked. > + */ > + if (whichfork == XFS_DATA_FORK && xfs_is_reflink_inode(sc->ip)) > + rb->shared_extents = true; The comment doesn't seem to match the code. > +/* > + * Try to reserve more blocks for a transaction. > + * > + * This is for callers that need to attach resources to a transaction, scan > + * those resources to determine the space reservation requirements, and then > + * modify the attached resources. In other words, online repair. This can > + * fail due to ENOSPC, so the caller must be able to cancel the transaction > + * without shutting down the fs. > + */ > +int > +xfs_trans_reserve_more( > + struct xfs_trans *tp, > + unsigned int blocks, > + unsigned int rtextents) This basically seems to duplicate xfs_trans_reserve except that it skips the log reservation. What about just allowing to pass a NULL resp agument to xfs_trans_reserve to skip the log reservation and reuse the code? Otherwise this looks good to me. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/5] xfs: repair inode fork block mapping data structures 2023-11-30 5:07 ` Christoph Hellwig @ 2023-12-01 1:38 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-12-01 1:38 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Wed, Nov 29, 2023 at 09:07:56PM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:53:25PM -0800, Darrick J. Wong wrote: > > +static int > > +xrep_bmap_extent_cmp( > > + const void *a, > > + const void *b) > > +{ > > + xfs_fileoff_t ao; > > + xfs_fileoff_t bo; > > + > > + ao = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)a); > > + bo = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)b); > > It would be nice if we could just have local variables for the > xfs_bmbt_recs and not need casts. I guess for that > xfs_bmbt_disk_get_startoff would need to take a const argument? > > Probably something for later. Oh! Apparently xfs_bmbt_disk_get_startoff does take a const struct pointer now. > > + if (whichfork == XFS_ATTR_FORK) > > + return 0; > > Nit: I'd probably just split the data fork specific validation > into a separate helper to keep things nicely organized. > > > + /* > > + * No need to waste time scanning for shared extents if the inode is > > + * already marked. > > + */ > > + if (whichfork == XFS_DATA_FORK && xfs_is_reflink_inode(sc->ip)) > > + rb->shared_extents = true; > > The comment doesn't seem to match the code. Ooof, that state handling needs tightening up. There are three states, really -- "irrelevant to this repair", "set the iflag", and "no idea, do a scan". That first state is open-coded in _discover_shared, which is wasteful because that is decidable in xrep_bmap. > > > +/* > > + * Try to reserve more blocks for a transaction. > > + * > > + * This is for callers that need to attach resources to a transaction, scan > > + * those resources to determine the space reservation requirements, and then > > + * modify the attached resources. In other words, online repair. This can > > + * fail due to ENOSPC, so the caller must be able to cancel the transaction > > + * without shutting down the fs. > > + */ > > +int > > +xfs_trans_reserve_more( > > + struct xfs_trans *tp, > > + unsigned int blocks, > > + unsigned int rtextents) > > This basically seems to duplicate xfs_trans_reserve except that it skips > the log reservation. What about just allowing to pass a NULL resp > agument to xfs_trans_reserve to skip the log reservation and reuse the > code? Hmm. Maybe not a NULL resp, but an empty one looks like it would work fine with less code duplication. > Otherwise this looks good to me. Cool! Thanks for reviewing! --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong 2023-11-24 23:53 ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong 2023-11-24 23:53 ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong @ 2023-11-24 23:53 ` Darrick J. Wong 2023-11-28 14:20 ` Christoph Hellwig 2023-11-24 23:53 ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong 2023-11-24 23:54 ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:53 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> There are a couple of conditions that userspace can set to force repairs of metadata. These really belong in the repair code and not open-coded into the check code, so refactor them into a helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/repair.c | 22 ++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 ++ fs/xfs/scrub/scrub.c | 14 +------------- 3 files changed, 25 insertions(+), 13 deletions(-) diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 4d5bfb2e4cf08..0f8dc25ef998b 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -27,6 +27,8 @@ #include "xfs_quota.h" #include "xfs_qm.h" #include "xfs_defer.h" +#include "xfs_errortag.h" +#include "xfs_error.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -937,3 +939,23 @@ xrep_reset_perag_resv( out: return error; } + +/* Decide if we are going to call the repair function for a scrub type. */ +bool +xrep_will_attempt( + struct xfs_scrub *sc) +{ + /* Userspace asked us to rebuild the structure regardless. */ + if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD) + return true; + + /* Let debug users force us into the repair routines. */ + if (XFS_TEST_ERROR(false, sc->mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR)) + return true; + + /* Metadata is corrupt or failed cross-referencing. */ + if (xchk_needs_repair(sc->sm)) + return true; + + return false; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 9f0c77b38ae28..73ac3eca1a781 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -28,6 +28,7 @@ static inline int xrep_notsupported(struct xfs_scrub *sc) /* Repair helpers */ int xrep_attempt(struct xfs_scrub *sc, struct xchk_stats_run *run); +bool xrep_will_attempt(struct xfs_scrub *sc); void xrep_failure(struct xfs_mount *mp); int xrep_roll_ag_trans(struct xfs_scrub *sc); int xrep_roll_trans(struct xfs_scrub *sc); @@ -117,6 +118,7 @@ int xrep_reinit_pagi(struct xfs_scrub *sc); #else #define xrep_ino_dqattach(sc) (0) +#define xrep_will_attempt(sc) (false) static inline int xrep_attempt( diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 52a09e0652693..8397d1dce25fa 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -14,8 +14,6 @@ #include "xfs_inode.h" #include "xfs_quota.h" #include "xfs_qm.h" -#include "xfs_errortag.h" -#include "xfs_error.h" #include "xfs_scrub.h" #include "xfs_btree.h" #include "xfs_btree_staging.h" @@ -552,21 +550,11 @@ xfs_scrub_metadata( xchk_update_health(sc); if (xchk_could_repair(sc)) { - bool needs_fix = xchk_needs_repair(sc->sm); - - /* Userspace asked us to rebuild the structure regardless. */ - if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD) - needs_fix = true; - - /* Let debug users force us into the repair routines. */ - if (XFS_TEST_ERROR(needs_fix, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR)) - needs_fix = true; - /* * If userspace asked for a repair but it wasn't necessary, * report that back to userspace. */ - if (!needs_fix) { + if (!xrep_will_attempt(sc)) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED; goto out_nofix; } ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper 2023-11-24 23:53 ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong @ 2023-11-28 14:20 ` Christoph Hellwig 2023-11-29 5:42 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 14:20 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:53:41PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > There are a couple of conditions that userspace can set to force repairs > of metadata. These really belong in the repair code and not open-coded > into the check code, so refactor them into a helper. Just ramblings from someone who is trying to get into the scrub and repair code: I find this code organization where the check helpers are in foo.c, repair helpers in foo_repair.c and then both are used in scrub.c to fill out ops really annoying to follow. My normal taste would expect a single file that has all the methods, and which then registers the ops vector. But it's probably too late for that now.. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper 2023-11-28 14:20 ` Christoph Hellwig @ 2023-11-29 5:42 ` Darrick J. Wong 2023-11-29 6:03 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-29 5:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 06:20:19AM -0800, Christoph Hellwig wrote: > On Fri, Nov 24, 2023 at 03:53:41PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > There are a couple of conditions that userspace can set to force repairs > > of metadata. These really belong in the repair code and not open-coded > > into the check code, so refactor them into a helper. > > Just ramblings from someone who is trying to get into the scrub and > repair code: > > I find this code organization where the check helpers are in foo.c, > repair helpers in foo_repair.c and then both are used in scrub.c > to fill out ops really annoying to follow. My normal taste would > expect a single file that has all the methods, and which then > registers the ops vector. But it's probably too late for that now.. Not really, in theory I could respin the whole series to move FOO_repair.c into FOO.c surrounded by a giant #ifdef CONFIG_XFS_ONLINE_REPAIR block; and change agheader_repair.c. OTOH I thought it was cleaner to elide the repair code via Makefiles instead of preprocessor directives that we could get lost in. Longer term I've struggled with whether or not (for example) the alloc.c/alloc_repair.c declarations should go in alloc.h. That's cleaner IMHO but explodes the number of files for usually not that much gain. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper 2023-11-29 5:42 ` Darrick J. Wong @ 2023-11-29 6:03 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 6:03 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Nov 28, 2023 at 09:42:01PM -0800, Darrick J. Wong wrote: > > I find this code organization where the check helpers are in foo.c, > > repair helpers in foo_repair.c and then both are used in scrub.c > > to fill out ops really annoying to follow. My normal taste would > > expect a single file that has all the methods, and which then > > registers the ops vector. But it's probably too late for that now.. > > Not really, in theory I could respin the whole series to move > FOO_repair.c into FOO.c surrounded by a giant #ifdef > CONFIG_XFS_ONLINE_REPAIR block; and change agheader_repair.c. > > OTOH I thought it was cleaner to elide the repair code via Makefiles > instead of preprocessor directives that we could get lost in. > > Longer term I've struggled with whether or not (for example) the > alloc.c/alloc_repair.c declarations should go in alloc.h. That's > cleaner IMHO but explodes the number of files for usually not that much > gain. Heh, and I wondered if the check/repair code should just live with the code implenenting the functionality it is checking/repairing so things can be kept nicely static. I really do not have a good answer here, I just noticed that it requires quite a lot of cycling through files to understand the repair code. ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 4/5] xfs: create a ranged query function for refcount btrees 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:53 ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong @ 2023-11-24 23:53 ` Darrick J. Wong 2023-11-28 13:59 ` Christoph Hellwig 2023-11-24 23:54 ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:53 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Implement ranged queries for refcount records. The next patch will use this to scan refcount data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_refcount.c | 41 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_refcount.h | 10 ++++++++++ 2 files changed, 51 insertions(+) diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index 5fa1a6f32c17d..b7311122bd489 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -2039,6 +2039,47 @@ xfs_refcount_has_records( return xfs_btree_has_records(cur, &low, &high, NULL, outcome); } +struct xfs_refcount_query_range_info { + xfs_refcount_query_range_fn fn; + void *priv; +}; + +/* Format btree record and pass to our callback. */ +STATIC int +xfs_refcount_query_range_helper( + struct xfs_btree_cur *cur, + const union xfs_btree_rec *rec, + void *priv) +{ + struct xfs_refcount_query_range_info *query = priv; + struct xfs_refcount_irec irec; + xfs_failaddr_t fa; + + xfs_refcount_btrec_to_irec(rec, &irec); + fa = xfs_refcount_check_irec(cur, &irec); + if (fa) + return xfs_refcount_complain_bad_rec(cur, fa, &irec); + + return query->fn(cur, &irec, query->priv); +} + +/* Find all refcount records between two keys. */ +int +xfs_refcount_query_range( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *low_rec, + const struct xfs_refcount_irec *high_rec, + xfs_refcount_query_range_fn fn, + void *priv) +{ + union xfs_btree_irec low_brec = { .rc = *low_rec }; + union xfs_btree_irec high_brec = { .rc = *high_rec }; + struct xfs_refcount_query_range_info query = { .priv = priv, .fn = fn }; + + return xfs_btree_query_range(cur, &low_brec, &high_brec, + xfs_refcount_query_range_helper, &query); +} + int __init xfs_refcount_intent_init_cache(void) { diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h index 2d6fecb258bb1..9563eb91be172 100644 --- a/fs/xfs/libxfs/xfs_refcount.h +++ b/fs/xfs/libxfs/xfs_refcount.h @@ -129,4 +129,14 @@ extern struct kmem_cache *xfs_refcount_intent_cache; int __init xfs_refcount_intent_init_cache(void); void xfs_refcount_intent_destroy_cache(void); +typedef int (*xfs_refcount_query_range_fn)( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *rec, + void *priv); + +int xfs_refcount_query_range(struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *low_rec, + const struct xfs_refcount_irec *high_rec, + xfs_refcount_query_range_fn fn, void *priv); + #endif /* __XFS_REFCOUNT_H__ */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 4/5] xfs: create a ranged query function for refcount btrees 2023-11-24 23:53 ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong @ 2023-11-28 13:59 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 13:59 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair problems in CoW forks 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong ` (3 preceding siblings ...) 2023-11-24 23:53 ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong @ 2023-11-24 23:54 ` Darrick J. Wong 2023-11-30 5:10 ` Christoph Hellwig 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:54 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Try to repair errors that we see in file CoW forks so that we don't do stupid things like remap garbage into a file. There's not a lot we can do with the COW fork -- the ondisk metadata record only that the COW staging extents are owned by the refcount btree, which effectively means that we can't reconstruct this incore structure from scratch. Actually, this is even worse -- we can't touch written extents, because those map space that are actively under writeback, and there's not much to do with delalloc reservations. Hence we can only detect crosslinked unwritten extents and fix them by punching out the problematic parts and replacing them with delalloc extents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 1 fs/xfs/scrub/bitmap.h | 28 ++ fs/xfs/scrub/cow_repair.c | 612 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/reap.c | 32 ++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 84 ++++++ 7 files changed, 760 insertions(+), 1 deletion(-) create mode 100644 fs/xfs/scrub/cow_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index f62351d63b147..71a76f8ac5e47 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -183,6 +183,7 @@ xfs-y += $(addprefix scrub/, \ agheader_repair.o \ alloc_repair.o \ bmap_repair.o \ + cow_repair.o \ ialloc_repair.o \ inode_repair.o \ newbt.o \ diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h index 1356a76710ede..e470215232ef0 100644 --- a/fs/xfs/scrub/bitmap.h +++ b/fs/xfs/scrub/bitmap.h @@ -149,4 +149,32 @@ static inline int xfsb_bitmap_walk(struct xfsb_bitmap *bitmap, return xbitmap_walk(&bitmap->fsbitmap, fn, priv); } +/* Bitmaps, but for type-checked for xfs_fileoff_t */ + +struct xoff_bitmap { + struct xbitmap offbitmap; +}; + +static inline void xoff_bitmap_init(struct xoff_bitmap *bitmap) +{ + xbitmap_init(&bitmap->offbitmap); +} + +static inline void xoff_bitmap_destroy(struct xoff_bitmap *bitmap) +{ + xbitmap_destroy(&bitmap->offbitmap); +} + +static inline int xoff_bitmap_set(struct xoff_bitmap *bitmap, + xfs_fileoff_t off, xfs_filblks_t len) +{ + return xbitmap_set(&bitmap->offbitmap, off, len); +} + +static inline int xoff_bitmap_walk(struct xoff_bitmap *bitmap, + xbitmap_walk_fn fn, void *priv) +{ + return xbitmap_walk(&bitmap->offbitmap, fn, priv); +} + #endif /* __XFS_SCRUB_BITMAP_H__ */ diff --git a/fs/xfs/scrub/cow_repair.c b/fs/xfs/scrub/cow_repair.c new file mode 100644 index 0000000000000..9decff69f4583 --- /dev/null +++ b/fs/xfs/scrub/cow_repair.c @@ -0,0 +1,612 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_inode.h" +#include "xfs_inode_fork.h" +#include "xfs_alloc.h" +#include "xfs_bmap.h" +#include "xfs_rmap.h" +#include "xfs_refcount.h" +#include "xfs_quota.h" +#include "xfs_ialloc.h" +#include "xfs_ag.h" +#include "xfs_error.h" +#include "xfs_errortag.h" +#include "xfs_icache.h" +#include "xfs_refcount_btree.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/reap.h" + +/* + * CoW Fork Mapping Repair + * ======================= + * + * Although CoW staging extents are owned by incore CoW inode forks, on disk + * they are owned by the refcount btree. The ondisk metadata does not record + * any ownership information, which limits what we can do to repair the + * mappings in the CoW fork. At most, we can replace ifork mappings that lack + * an entry in the refcount btree or are described by a reverse mapping record + * whose owner is not OWN_COW. + * + * Replacing extents is also tricky -- we can't touch written CoW fork extents + * since they are undergoing writeback, and delalloc extents do not require + * repair since they only exist incore. Hence the most we can do is find the + * bad parts of unwritten mappings, allocate a replacement set of blocks, and + * replace the incore mapping. We use the regular reaping process to unmap + * or free the discarded blocks, as appropriate. + */ +struct xrep_cow { + struct xfs_scrub *sc; + + /* Bitmap of file offset ranges that need replacing. */ + struct xoff_bitmap bad_fileoffs; + + /* Bitmap of fsblocks that were removed from the CoW fork. */ + struct xfsb_bitmap old_cowfork_fsblocks; + + /* CoW fork mappings used to scan for bad CoW staging extents. */ + struct xfs_bmbt_irec irec; + + /* refcount btree block number of irec.br_startblock */ + unsigned int irec_startbno; + + /* refcount btree block number of the next refcount record we expect */ + unsigned int next_bno; +}; + +/* CoW staging extent. */ +struct xrep_cow_extent { + xfs_fsblock_t fsbno; + xfs_extlen_t len; +}; + +/* + * Mark the part of the file range that corresponds to the given physical + * space. Caller must ensure that the physical range is within xc->irec. + */ +STATIC int +xrep_cow_mark_file_range( + struct xrep_cow *xc, + xfs_fsblock_t startblock, + xfs_filblks_t blockcount) +{ + xfs_fileoff_t startoff; + + startoff = xc->irec.br_startoff + + (startblock - xc->irec.br_startblock); + + trace_xrep_cow_mark_file_range(xc->sc->ip, startblock, startoff, + blockcount); + + return xoff_bitmap_set(&xc->bad_fileoffs, startoff, blockcount); +} + +/* + * Trim @src to fit within the CoW fork mapping being examined, and put the + * result in @dst. + */ +static inline void +xrep_cow_trim_refcount( + struct xrep_cow *xc, + struct xfs_refcount_irec *dst, + const struct xfs_refcount_irec *src) +{ + unsigned int adj; + + memcpy(dst, src, sizeof(*dst)); + + if (dst->rc_startblock < xc->irec_startbno) { + adj = xc->irec_startbno - dst->rc_startblock; + dst->rc_blockcount -= adj; + dst->rc_startblock += adj; + } + + if (dst->rc_startblock + dst->rc_blockcount > + xc->irec_startbno + xc->irec.br_blockcount) { + adj = (dst->rc_startblock + dst->rc_blockcount) - + (xc->irec_startbno + xc->irec.br_blockcount); + dst->rc_blockcount -= adj; + } +} + +/* Mark any shared CoW staging extents. */ +STATIC int +xrep_cow_mark_shared_staging( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *rec, + void *priv) +{ + struct xrep_cow *xc = priv; + struct xfs_refcount_irec rrec; + xfs_fsblock_t fsbno; + + if (!xfs_refcount_check_domain(rec) || + rec->rc_domain != XFS_REFC_DOMAIN_SHARED) + return -EFSCORRUPTED; + + xrep_cow_trim_refcount(xc, &rrec, rec); + + fsbno = XFS_AGB_TO_FSB(xc->sc->mp, cur->bc_ag.pag->pag_agno, + rrec.rc_startblock); + return xrep_cow_mark_file_range(xc, fsbno, rrec.rc_blockcount); +} + +/* + * Mark any portion of the CoW fork file offset range where there is not a CoW + * staging extent record in the refcountbt, and keep a record of where we did + * find correct refcountbt records. Staging records are always cleaned out at + * mount time, so any two inodes trying to map the same staging area would have + * already taken the fs down due to refcount btree verifier errors. Hence this + * inode should be the sole creator of the staging extent records ondisk. + */ +STATIC int +xrep_cow_mark_missing_staging( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *rec, + void *priv) +{ + struct xrep_cow *xc = priv; + struct xfs_refcount_irec rrec; + int error; + + if (!xfs_refcount_check_domain(rec) || + rec->rc_domain != XFS_REFC_DOMAIN_COW) + return -EFSCORRUPTED; + + xrep_cow_trim_refcount(xc, &rrec, rec); + + if (xc->next_bno >= rrec.rc_startblock) + goto next; + + error = xrep_cow_mark_file_range(xc, + XFS_AGB_TO_FSB(xc->sc->mp, cur->bc_ag.pag->pag_agno, + xc->next_bno), + rrec.rc_startblock - xc->next_bno); + if (error) + return error; + +next: + xc->next_bno = rrec.rc_startblock + rrec.rc_blockcount; + return 0; +} + +/* + * Mark any area that does not correspond to a CoW staging rmap. These are + * cross-linked areas that must be avoided. + */ +STATIC int +xrep_cow_mark_missing_staging_rmap( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec, + void *priv) +{ + struct xrep_cow *xc = priv; + xfs_fsblock_t fsbno; + xfs_agblock_t rec_bno; + xfs_extlen_t rec_len; + unsigned int adj; + + if (rec->rm_owner == XFS_RMAP_OWN_COW) + return 0; + + rec_bno = rec->rm_startblock; + rec_len = rec->rm_blockcount; + if (rec_bno < xc->irec_startbno) { + adj = xc->irec_startbno - rec_bno; + rec_len -= adj; + rec_bno += adj; + } + + if (rec_bno + rec_len > xc->irec_startbno + xc->irec.br_blockcount) { + adj = (rec_bno + rec_len) - + (xc->irec_startbno + xc->irec.br_blockcount); + rec_len -= adj; + } + + fsbno = XFS_AGB_TO_FSB(xc->sc->mp, cur->bc_ag.pag->pag_agno, rec_bno); + return xrep_cow_mark_file_range(xc, fsbno, rec_len); +} + +/* + * Find any part of the CoW fork mapping that isn't a single-owner CoW staging + * extent and mark the corresponding part of the file range in the bitmap. + */ +STATIC int +xrep_cow_find_bad( + struct xrep_cow *xc) +{ + struct xfs_refcount_irec rc_low = { 0 }; + struct xfs_refcount_irec rc_high = { 0 }; + struct xfs_rmap_irec rm_low = { 0 }; + struct xfs_rmap_irec rm_high = { 0 }; + struct xfs_perag *pag; + struct xfs_scrub *sc = xc->sc; + xfs_agnumber_t agno; + int error; + + agno = XFS_FSB_TO_AGNO(sc->mp, xc->irec.br_startblock); + xc->irec_startbno = XFS_FSB_TO_AGBNO(sc->mp, xc->irec.br_startblock); + + pag = xfs_perag_get(sc->mp, agno); + if (!pag) + return -EFSCORRUPTED; + + error = xrep_ag_init(sc, pag, &sc->sa); + if (error) + goto out_pag; + + /* Mark any CoW fork extents that are shared. */ + rc_low.rc_startblock = xc->irec_startbno; + rc_high.rc_startblock = xc->irec_startbno + xc->irec.br_blockcount - 1; + rc_low.rc_domain = rc_high.rc_domain = XFS_REFC_DOMAIN_SHARED; + error = xfs_refcount_query_range(sc->sa.refc_cur, &rc_low, &rc_high, + xrep_cow_mark_shared_staging, xc); + if (error) + goto out_sa; + + /* Make sure there are CoW staging extents for the whole mapping. */ + rc_low.rc_startblock = xc->irec_startbno; + rc_high.rc_startblock = xc->irec_startbno + xc->irec.br_blockcount - 1; + rc_low.rc_domain = rc_high.rc_domain = XFS_REFC_DOMAIN_COW; + xc->next_bno = xc->irec_startbno; + error = xfs_refcount_query_range(sc->sa.refc_cur, &rc_low, &rc_high, + xrep_cow_mark_missing_staging, xc); + if (error) + goto out_sa; + + if (xc->next_bno < xc->irec_startbno + xc->irec.br_blockcount) { + error = xrep_cow_mark_file_range(xc, + XFS_AGB_TO_FSB(sc->mp, pag->pag_agno, + xc->next_bno), + xc->irec_startbno + xc->irec.br_blockcount - + xc->next_bno); + if (error) + goto out_sa; + } + + /* Mark any area has an rmap that isn't a COW staging extent. */ + rm_low.rm_startblock = xc->irec_startbno; + memset(&rm_high, 0xFF, sizeof(rm_high)); + rm_high.rm_startblock = xc->irec_startbno + xc->irec.br_blockcount - 1; + error = xfs_rmap_query_range(sc->sa.rmap_cur, &rm_low, &rm_high, + xrep_cow_mark_missing_staging_rmap, xc); + if (error) + goto out_sa; + + /* + * If userspace is forcing us to rebuild the CoW fork or someone turned + * on the debugging knob, replace everything in the CoW fork. + */ + if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD) || + XFS_TEST_ERROR(false, sc->mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR)) { + error = xrep_cow_mark_file_range(xc, xc->irec.br_startblock, + xc->irec.br_blockcount); + if (error) + return error; + } + +out_sa: + xchk_ag_free(sc, &sc->sa); +out_pag: + xfs_perag_put(pag); + return 0; +} + +/* + * Allocate a replacement CoW staging extent of up to the given number of + * blocks, and fill out the mapping. + */ +STATIC int +xrep_cow_alloc( + struct xfs_scrub *sc, + xfs_extlen_t maxlen, + struct xrep_cow_extent *repl) +{ + struct xfs_alloc_arg args = { + .tp = sc->tp, + .mp = sc->mp, + .oinfo = XFS_RMAP_OINFO_SKIP_UPDATE, + .minlen = 1, + .maxlen = maxlen, + .prod = 1, + .resv = XFS_AG_RESV_NONE, + .datatype = XFS_ALLOC_USERDATA, + }; + int error; + + error = xfs_trans_reserve_more(sc->tp, maxlen, 0); + if (error) + return error; + + error = xfs_alloc_vextent_start_ag(&args, + XFS_INO_TO_FSB(sc->mp, sc->ip->i_ino)); + if (error) + return error; + if (args.fsbno == NULLFSBLOCK) + return -ENOSPC; + + xfs_refcount_alloc_cow_extent(sc->tp, args.fsbno, args.len); + + repl->fsbno = args.fsbno; + repl->len = args.len; + return 0; +} + +/* + * Look up the current CoW fork mapping so that we only allocate enough to + * replace a single mapping. If we don't find a mapping that covers the start + * of the file range, or we find a delalloc or written extent, something is + * seriously wrong, since we didn't drop the ILOCK. + */ +static inline int +xrep_cow_find_mapping( + struct xrep_cow *xc, + struct xfs_iext_cursor *icur, + xfs_fileoff_t startoff, + struct xfs_bmbt_irec *got) +{ + struct xfs_inode *ip = xc->sc->ip; + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_COW_FORK); + + if (!xfs_iext_lookup_extent(ip, ifp, startoff, icur, got)) + goto bad; + + if (got->br_startoff > startoff) + goto bad; + + if (got->br_blockcount == 0) + goto bad; + + if (isnullstartblock(got->br_startblock)) + goto bad; + + if (xfs_bmap_is_written_extent(got)) + goto bad; + + return 0; +bad: + ASSERT(0); + return -EFSCORRUPTED; +} + +#define REPLACE_LEFT_SIDE (1U << 0) +#define REPLACE_RIGHT_SIDE (1U << 1) + +/* + * Given a CoW fork mapping @got and a replacement mapping @repl, remap the + * beginning of @got with the space described by @rep. + */ +static inline void +xrep_cow_replace_mapping( + struct xfs_inode *ip, + struct xfs_iext_cursor *icur, + const struct xfs_bmbt_irec *got, + const struct xrep_cow_extent *repl) +{ + struct xfs_bmbt_irec new = *got; /* struct copy */ + + ASSERT(repl->len > 0); + ASSERT(!isnullstartblock(got->br_startblock)); + + trace_xrep_cow_replace_mapping(ip, got, repl->fsbno, repl->len); + + if (got->br_blockcount == repl->len) { + /* + * The new extent is a complete replacement for the existing + * extent. Update the COW fork record. + */ + new.br_startblock = repl->fsbno; + xfs_iext_update_extent(ip, BMAP_COWFORK, icur, &new); + return; + } + + /* + * The new extent can replace the beginning of the COW fork record. + * Move the left side of @got upwards, then insert the new record. + */ + new.br_startoff += repl->len; + new.br_startblock += repl->len; + new.br_blockcount -= repl->len; + xfs_iext_update_extent(ip, BMAP_COWFORK, icur, &new); + + new.br_startoff = got->br_startoff; + new.br_startblock = repl->fsbno; + new.br_blockcount = repl->len; + xfs_iext_insert(ip, icur, &new, BMAP_COWFORK); +} + +/* + * Replace the unwritten CoW staging extent backing the given file range with a + * new space extent that isn't as problematic. + */ +STATIC int +xrep_cow_replace_range( + struct xrep_cow *xc, + xfs_fileoff_t startoff, + xfs_extlen_t *blockcount) +{ + struct xfs_iext_cursor icur; + struct xrep_cow_extent repl; + struct xfs_bmbt_irec got; + struct xfs_scrub *sc = xc->sc; + xfs_fileoff_t nextoff; + xfs_extlen_t alloc_len; + int error; + + /* + * Put the existing CoW fork mapping in @got. If @got ends before + * @rep, truncate @rep so we only replace one extent mapping at a time. + */ + error = xrep_cow_find_mapping(xc, &icur, startoff, &got); + if (error) + return error; + nextoff = min(startoff + *blockcount, + got.br_startoff + got.br_blockcount); + + /* + * Allocate a replacement extent. If we don't fill all the blocks, + * shorten the quantity that will be deleted in this step. + */ + alloc_len = min_t(xfs_fileoff_t, XFS_MAX_BMBT_EXTLEN, + nextoff - startoff); + error = xrep_cow_alloc(sc, alloc_len, &repl); + if (error) + return error; + + /* + * Replace the old mapping with the new one, and commit the metadata + * changes made so far. + */ + xrep_cow_replace_mapping(sc->ip, &icur, &got, &repl); + + xfs_inode_set_cowblocks_tag(sc->ip); + error = xfs_defer_finish(&sc->tp); + if (error) + return error; + + /* Note the old CoW staging extents; we'll reap them all later. */ + error = xfsb_bitmap_set(&xc->old_cowfork_fsblocks, got.br_startblock, + repl.len); + if (error) + return error; + + *blockcount = repl.len; + return 0; +} + +/* + * Replace a bad part of an unwritten CoW staging extent with a fresh delalloc + * reservation. + */ +STATIC int +xrep_cow_replace( + uint64_t startoff, + uint64_t blockcount, + void *priv) +{ + struct xrep_cow *xc = priv; + int error = 0; + + while (blockcount > 0) { + xfs_extlen_t len = min_t(xfs_filblks_t, blockcount, + XFS_MAX_BMBT_EXTLEN); + + error = xrep_cow_replace_range(xc, startoff, &len); + if (error) + break; + + blockcount -= len; + startoff += len; + } + + return error; +} + +/* + * Repair an inode's CoW fork. The CoW fork is an in-core structure, so + * there's no btree to rebuid. Instead, we replace any mappings that are + * cross-linked or lack ondisk CoW fork records in the refcount btree. + */ +int +xrep_bmap_cow( + struct xfs_scrub *sc) +{ + struct xrep_cow *xc; + struct xfs_iext_cursor icur; + struct xfs_ifork *ifp = xfs_ifork_ptr(sc->ip, XFS_COW_FORK); + int error; + + if (!xfs_has_rmapbt(sc->mp) || !xfs_has_reflink(sc->mp)) + return -EOPNOTSUPP; + + if (!ifp) + return 0; + + /* realtime files aren't supported yet */ + if (XFS_IS_REALTIME_INODE(sc->ip)) + return -EOPNOTSUPP; + + /* + * If we're somehow not in extents format, then reinitialize it to + * an empty extent mapping fork and exit. + */ + if (ifp->if_format != XFS_DINODE_FMT_EXTENTS) { + ifp->if_format = XFS_DINODE_FMT_EXTENTS; + ifp->if_nextents = 0; + return 0; + } + + xc = kzalloc(sizeof(struct xrep_cow), XCHK_GFP_FLAGS); + if (!xc) + return -ENOMEM; + + xfs_trans_ijoin(sc->tp, sc->ip, 0); + + xc->sc = sc; + xoff_bitmap_init(&xc->bad_fileoffs); + xfsb_bitmap_init(&xc->old_cowfork_fsblocks); + + for_each_xfs_iext(ifp, &icur, &xc->irec) { + if (xchk_should_terminate(sc, &error)) + goto out_bitmap; + + /* + * delalloc reservations only exist incore, so there is no + * ondisk metadata that we can examine. Hence we leave them + * alone. + */ + if (isnullstartblock(xc->irec.br_startblock)) + continue; + + /* + * COW fork extents are only in the written state if writeback + * is actively writing to disk. We cannot restart the write + * at a different disk address since we've already issued the + * IO, so we leave these alone and hope for the best. + */ + if (xfs_bmap_is_written_extent(&xc->irec)) + continue; + + error = xrep_cow_find_bad(xc); + if (error) + goto out_bitmap; + } + + /* Replace any bad unwritten mappings with fresh reservations. */ + error = xoff_bitmap_walk(&xc->bad_fileoffs, xrep_cow_replace, xc); + if (error) + goto out_bitmap; + + /* + * Reap as many of the old CoW blocks as we can. They are owned ondisk + * by the refcount btree, not the inode, so it is correct to treat them + * like inode metadata. + */ + error = xrep_reap_fsblocks(sc, &xc->old_cowfork_fsblocks, + &XFS_RMAP_OINFO_COW); + if (error) + goto out_bitmap; + +out_bitmap: + xfsb_bitmap_destroy(&xc->old_cowfork_fsblocks); + xoff_bitmap_destroy(&xc->bad_fileoffs); + kmem_free(xc); + return error; +} diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c index 35794df952bbe..1305e82e3df13 100644 --- a/fs/xfs/scrub/reap.c +++ b/fs/xfs/scrub/reap.c @@ -20,6 +20,7 @@ #include "xfs_ialloc_btree.h" #include "xfs_rmap.h" #include "xfs_rmap_btree.h" +#include "xfs_refcount.h" #include "xfs_refcount_btree.h" #include "xfs_extent_busy.h" #include "xfs_ag.h" @@ -378,6 +379,17 @@ xreap_agextent_iter( trace_xreap_dispose_unmap_extent(sc->sa.pag, agbno, *aglenp); rs->force_roll = true; + + if (rs->oinfo == &XFS_RMAP_OINFO_COW) { + /* + * If we're unmapping CoW staging extents, remove the + * records from the refcountbt, which will remove the + * rmap record as well. + */ + xfs_refcount_free_cow_extent(sc->tp, fsbno, *aglenp); + return 0; + } + return xfs_rmap_free(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno, *aglenp, rs->oinfo); } @@ -396,6 +408,26 @@ xreap_agextent_iter( return 0; } + /* + * If we're getting rid of CoW staging extents, use deferred work items + * to remove the refcountbt records (which removes the rmap records) + * and free the extent. We're not worried about the system going down + * here because log recovery walks the refcount btree to clean out the + * CoW staging extents. + */ + if (rs->oinfo == &XFS_RMAP_OINFO_COW) { + ASSERT(rs->resv == XFS_AG_RESV_NONE); + + xfs_refcount_free_cow_extent(sc->tp, fsbno, *aglenp); + error = xfs_free_extent_later(sc->tp, fsbno, *aglenp, NULL, + rs->resv, true); + if (error) + return error; + + rs->force_roll = true; + return 0; + } + /* Put blocks back on the AGFL one at a time. */ if (rs->resv == XFS_AG_RESV_AGFL) { ASSERT(*aglenp == 1); diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 73ac3eca1a781..be3585b8f4364 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -111,6 +111,7 @@ int xrep_refcountbt(struct xfs_scrub *sc); int xrep_inode(struct xfs_scrub *sc); int xrep_bmap_data(struct xfs_scrub *sc); int xrep_bmap_attr(struct xfs_scrub *sc); +int xrep_bmap_cow(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -173,6 +174,7 @@ xrep_setup_nothing( #define xrep_inode xrep_notsupported #define xrep_bmap_data xrep_notsupported #define xrep_bmap_attr xrep_notsupported +#define xrep_bmap_cow xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 8397d1dce25fa..bc70a91f8b1bf 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -300,7 +300,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_INODE, .setup = xchk_setup_inode_bmap, .scrub = xchk_bmap_cow, - .repair = xrep_notsupported, + .repair = xrep_bmap_cow, }, [XFS_SCRUB_TYPE_DIR] = { /* directory */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 3d55f65c00835..8b4d3e5f60616 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1599,6 +1599,90 @@ TRACE_EVENT(xrep_dinode_count_rmaps, __entry->block0) ); +TRACE_EVENT(xrep_cow_mark_file_range, + TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t startblock, + xfs_fileoff_t startoff, xfs_filblks_t blockcount), + TP_ARGS(ip, startblock, startoff, blockcount), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_fsblock_t, startblock) + __field(xfs_fileoff_t, startoff) + __field(xfs_filblks_t, blockcount) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->ino = ip->i_ino; + __entry->startoff = startoff; + __entry->startblock = startblock; + __entry->blockcount = blockcount; + ), + TP_printk("dev %d:%d ino 0x%llx fileoff 0x%llx startblock 0x%llx fsbcount 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->startoff, + __entry->startblock, + __entry->blockcount) +); + +TRACE_EVENT(xrep_cow_replace_mapping, + TP_PROTO(struct xfs_inode *ip, const struct xfs_bmbt_irec *irec, + xfs_fsblock_t new_startblock, xfs_extlen_t new_blockcount), + TP_ARGS(ip, irec, new_startblock, new_blockcount), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_fsblock_t, startblock) + __field(xfs_fileoff_t, startoff) + __field(xfs_filblks_t, blockcount) + __field(xfs_exntst_t, state) + __field(xfs_fsblock_t, new_startblock) + __field(xfs_extlen_t, new_blockcount) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->ino = ip->i_ino; + __entry->startoff = irec->br_startoff; + __entry->startblock = irec->br_startblock; + __entry->blockcount = irec->br_blockcount; + __entry->state = irec->br_state; + __entry->new_startblock = new_startblock; + __entry->new_blockcount = new_blockcount; + ), + TP_printk("dev %d:%d ino 0x%llx startoff 0x%llx startblock 0x%llx fsbcount 0x%llx state 0x%x new_startblock 0x%llx new_fsbcount 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->startoff, + __entry->startblock, + __entry->blockcount, + __entry->state, + __entry->new_startblock, + __entry->new_blockcount) +); + +TRACE_EVENT(xrep_cow_free_staging, + TP_PROTO(struct xfs_perag *pag, xfs_agblock_t agbno, + xfs_extlen_t blockcount), + TP_ARGS(pag, agbno, blockcount), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agblock_t, agbno) + __field(xfs_extlen_t, blockcount) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->agbno = agbno; + __entry->blockcount = blockcount; + ), + TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->agbno, + __entry->blockcount) +); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair problems in CoW forks 2023-11-24 23:54 ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong @ 2023-11-30 5:10 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:10 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong ` (5 preceding siblings ...) 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong @ 2023-11-24 23:46 ` Darrick J. Wong 2023-11-24 23:54 ` [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly Darrick J. Wong ` (5 more replies) 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong 7 siblings, 6 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:46 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, Add in the necessary infrastructure to check the inode and data forks of metadata files, then apply that to the realtime bitmap file. We won't be able to reconstruct the contents of the rtbitmap file until rmapbt is added for realtime volumes, but we can at least get the basics started. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtbitmap xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-rtbitmap --- fs/xfs/Makefile | 4 + fs/xfs/libxfs/xfs_bmap.c | 39 ++++++++ fs/xfs/libxfs/xfs_bmap.h | 2 fs/xfs/scrub/bmap_repair.c | 17 +++ fs/xfs/scrub/repair.c | 151 ++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 9 ++ fs/xfs/scrub/rtbitmap.c | 101 +++++++++++++++++--- fs/xfs/scrub/rtbitmap.h | 22 ++++ fs/xfs/scrub/rtbitmap_repair.c | 202 ++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/rtsummary.c | 132 +++++++++++++++++++++----- fs/xfs/scrub/scrub.c | 4 - fs/xfs/xfs_inode.c | 24 +---- 12 files changed, 637 insertions(+), 70 deletions(-) create mode 100644 fs/xfs/scrub/rtbitmap.h create mode 100644 fs/xfs/scrub/rtbitmap_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong @ 2023-11-24 23:54 ` Darrick J. Wong 2023-11-28 14:04 ` Christoph Hellwig 2023-11-24 23:54 ` [PATCH 2/6] xfs: check rt summary " Darrick J. Wong ` (4 subsequent siblings) 5 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:54 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> I forgot that the superblock tracks the number of blocks that are in the realtime bitmap, and that the rt bitmap file can have more blocks mapped to the data fork than sb_rbmblocks if growfsrt fails. So. Add to the rtbitmap scrubber an explicit check that sb_rextents and sb_rbmblocks are correct, then adjust the rtbitmap i_size checks to allow for the growfsrt failure case. Finally, flag post-eof blocks in the rtbitmap. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/rtbitmap.c | 97 ++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 82 insertions(+), 15 deletions(-) diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c index d509a08d3fc3e..3b5b62fbf4e0a 100644 --- a/fs/xfs/scrub/rtbitmap.c +++ b/fs/xfs/scrub/rtbitmap.c @@ -14,16 +14,30 @@ #include "xfs_rtbitmap.h" #include "xfs_inode.h" #include "xfs_bmap.h" +#include "xfs_bit.h" #include "scrub/scrub.h" #include "scrub/common.h" +struct xchk_rtbitmap { + uint64_t rextents; + uint64_t rbmblocks; + unsigned int rextslog; +}; + /* Set us up with the realtime metadata locked. */ int xchk_setup_rtbitmap( struct xfs_scrub *sc) { + struct xfs_mount *mp = sc->mp; + struct xchk_rtbitmap *rtb; int error; + rtb = kzalloc(sizeof(struct xchk_rtbitmap), XCHK_GFP_FLAGS); + if (!rtb) + return -ENOMEM; + sc->buf = rtb; + error = xchk_trans_alloc(sc, 0); if (error) return error; @@ -37,6 +51,15 @@ xchk_setup_rtbitmap( return error; xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP); + + /* + * Now that we've locked the rtbitmap, we can't race with growfsrt + * trying to expand the bitmap or change the size of the rt volume. + * Hence it is safe to compute and check the geometry values. + */ + rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); + rtb->rextslog = rtb->rextents ? xfs_highbit32(rtb->rextents) : 0; + rtb->rbmblocks = xfs_rtbitmap_blockcount(mp, rtb->rextents); return 0; } @@ -67,21 +90,30 @@ STATIC int xchk_rtbitmap_check_extents( struct xfs_scrub *sc) { - struct xfs_mount *mp = sc->mp; struct xfs_bmbt_irec map; - xfs_rtblock_t off; - int nmap; + struct xfs_iext_cursor icur; + struct xfs_mount *mp = sc->mp; + struct xfs_inode *ip = sc->ip; + xfs_fileoff_t off = 0; + xfs_fileoff_t endoff; int error = 0; - for (off = 0; off < mp->m_sb.sb_rbmblocks;) { + /* Mappings may not cross or lie beyond EOF. */ + endoff = XFS_B_TO_FSB(mp, ip->i_disk_size); + if (xfs_iext_lookup_extent(ip, &ip->i_df, endoff, &icur, &map)) { + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, endoff); + return 0; + } + + while (off < endoff) { + int nmap = 1; + if (xchk_should_terminate(sc, &error) || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) break; /* Make sure we have a written extent. */ - nmap = 1; - error = xfs_bmapi_read(mp->m_rbmip, off, - mp->m_sb.sb_rbmblocks - off, &map, &nmap, + error = xfs_bmapi_read(ip, off, endoff - off, &map, &nmap, XFS_DATA_FORK); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, off, &error)) break; @@ -102,12 +134,48 @@ int xchk_rtbitmap( struct xfs_scrub *sc) { + struct xfs_mount *mp = sc->mp; + struct xchk_rtbitmap *rtb = sc->buf; int error; - /* Is the size of the rtbitmap correct? */ - if (sc->mp->m_rbmip->i_disk_size != - XFS_FSB_TO_B(sc->mp, sc->mp->m_sb.sb_rbmblocks)) { - xchk_ino_set_corrupt(sc, sc->mp->m_rbmip->i_ino); + /* Is sb_rextents correct? */ + if (mp->m_sb.sb_rextents != rtb->rextents) { + xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino); + return 0; + } + + /* Is sb_rextslog correct? */ + if (mp->m_sb.sb_rextslog != rtb->rextslog) { + xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino); + return 0; + } + + /* + * Is sb_rbmblocks large enough to handle the current rt volume? In no + * case can we exceed 4bn bitmap blocks since the super field is a u32. + */ + if (rtb->rbmblocks > U32_MAX) { + xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino); + return 0; + } + if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks) { + xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino); + return 0; + } + + /* The bitmap file length must be aligned to an fsblock. */ + if (mp->m_rbmip->i_disk_size & mp->m_blockmask) { + xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino); + return 0; + } + + /* + * Is the bitmap file itself large enough to handle the rt volume? + * growfsrt expands the bitmap file before updating sb_rextents, so the + * file can be larger than sb_rbmblocks. + */ + if (mp->m_rbmip->i_disk_size < XFS_FSB_TO_B(mp, rtb->rbmblocks)) { + xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino); return 0; } @@ -120,12 +188,11 @@ xchk_rtbitmap( if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) return error; - error = xfs_rtalloc_query_all(sc->mp, sc->tp, xchk_rtbitmap_rec, sc); + error = xfs_rtalloc_query_all(mp, sc->tp, xchk_rtbitmap_rec, sc); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error)) - goto out; + return error; -out: - return error; + return 0; } /* xref check that the extent is not free in the rtbitmap */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly 2023-11-24 23:54 ` [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly Darrick J. Wong @ 2023-11-28 14:04 ` Christoph Hellwig 2023-11-28 23:27 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 14:04 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs > + /* > + * Now that we've locked the rtbitmap, we can't race with growfsrt > + * trying to expand the bitmap or change the size of the rt volume. > + * Hence it is safe to compute and check the geometry values. > + */ > + rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > + rtb->rextslog = rtb->rextents ? xfs_highbit32(rtb->rextents) : 0; > + rtb->rbmblocks = xfs_rtbitmap_blockcount(mp, rtb->rextents); All these will be 0 if mp->m_sb.sb_rblocks, and rtb is zeroed allocation right above, so calculating the values seems a bit odd. Why not simply: if (mp->m_sb.sb_rblocks) { rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); rtb->rextslog = xfs_highbit32(rtb->rextents); rtb->rbmblocks = xfs_rtbitmap_blockcount(mp, rtb->rextents); } ? Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly 2023-11-28 14:04 ` Christoph Hellwig @ 2023-11-28 23:27 ` Darrick J. Wong 2023-11-29 6:05 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 23:27 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 06:04:26AM -0800, Christoph Hellwig wrote: > > + /* > > + * Now that we've locked the rtbitmap, we can't race with growfsrt > > + * trying to expand the bitmap or change the size of the rt volume. > > + * Hence it is safe to compute and check the geometry values. > > + */ > > + rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > > + rtb->rextslog = rtb->rextents ? xfs_highbit32(rtb->rextents) : 0; > > + rtb->rbmblocks = xfs_rtbitmap_blockcount(mp, rtb->rextents); > > All these will be 0 if mp->m_sb.sb_rblocks, and rtb is zeroed allocation > right above, so calculating the values seems a bit odd. Why not simply: > > if (mp->m_sb.sb_rblocks) { > rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > rtb->rextslog = xfs_highbit32(rtb->rextents); Well... xfs_highbit32 returns -1 if its argument is zero, which is possible for the nasty edge case of (say) a 64k block device and a realtime extent size of 1MB, which results in rblocks > 0 and rextents == 0. So I'll still have to do: if (rtb->rextents) rtb->rextslog = xfs_highbit32() but otherwise this is fine. > rtb->rbmblocks = xfs_rtbitmap_blockcount(mp, rtb->rextents); > } > > ? > > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> Thank you! --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly 2023-11-28 23:27 ` Darrick J. Wong @ 2023-11-29 6:05 ` Christoph Hellwig 2023-11-29 6:20 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 6:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Nov 28, 2023 at 03:27:40PM -0800, Darrick J. Wong wrote: > > All these will be 0 if mp->m_sb.sb_rblocks, and rtb is zeroed allocation > > right above, so calculating the values seems a bit odd. Why not simply: > > > > if (mp->m_sb.sb_rblocks) { > > rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > > rtb->rextslog = xfs_highbit32(rtb->rextents); > > Well... xfs_highbit32 returns -1 if its argument is zero, which is > possible for the nasty edge case of (say) a 64k block device and a > realtime extent size of 1MB, which results in rblocks > 0 and > rextents == 0. Eww. How do we even allow creating a mounting that? Such a configuration doesn't make any sense. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly 2023-11-29 6:05 ` Christoph Hellwig @ 2023-11-29 6:20 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-29 6:20 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 10:05:06PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 03:27:40PM -0800, Darrick J. Wong wrote: > > > All these will be 0 if mp->m_sb.sb_rblocks, and rtb is zeroed allocation > > > right above, so calculating the values seems a bit odd. Why not simply: > > > > > > if (mp->m_sb.sb_rblocks) { > > > rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > > > rtb->rextslog = xfs_highbit32(rtb->rextents); > > > > Well... xfs_highbit32 returns -1 if its argument is zero, which is > > possible for the nasty edge case of (say) a 64k block device and a > > realtime extent size of 1MB, which results in rblocks > 0 and > > rextents == 0. > > Eww. How do we even allow creating a mounting that? Such a > configuration doesn't make any sense. $ truncate -s 64k /tmp/realtime $ truncate -s 1g /tmp/data $ mkfs.xfs -f /tmp/data -r rtdev=/tmp/realtime,extsize=1m Pre 4.19 mkfs.xfs would actually write out the fs and pre-4.19 kernels would mount it (and ENOSPC). Since then, due to buggy sb validation code on my part now it just fails verifiers and crashes/doesn't mount. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong 2023-11-24 23:54 ` [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly Darrick J. Wong @ 2023-11-24 23:54 ` Darrick J. Wong 2023-11-28 14:05 ` Christoph Hellwig 2023-11-24 23:54 ` [PATCH 3/6] xfs: always check the rtbitmap and rtsummary files Darrick J. Wong ` (3 subsequent siblings) 5 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:54 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> I forgot that the xfs_mount tracks the size and number of levels in the realtime summary file, and that the rt summary file can have more blocks mapped to the data fork than m_rsumsize implies if growfsrt fails. So. Add to the rtsummary scrubber an explicit check that all the summary geometry values are correct, then adjust the rtsummary i_size checks to allow for the growfsrt failure case. Finally, flag post-eof blocks in the summary file. While we're at it, split the extent map checking so that we only call xfs_bmapi_read once per extent instead of once per rtsummary block. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/rtsummary.c | 132 +++++++++++++++++++++++++++++++++++++--------- 1 file changed, 105 insertions(+), 27 deletions(-) diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c index f94800a029f35..41f64158c8626 100644 --- a/fs/xfs/scrub/rtsummary.c +++ b/fs/xfs/scrub/rtsummary.c @@ -31,6 +31,18 @@ * (potentially large) amount of data in pageable memory. */ +struct xchk_rtsummary { + struct xfs_rtalloc_args args; + + uint64_t rextents; + uint64_t rbmblocks; + uint64_t rsumsize; + unsigned int rsumlevels; + + /* Memory buffer for the summary comparison. */ + union xfs_suminfo_raw words[]; +}; + /* Set us up to check the rtsummary file. */ int xchk_setup_rtsummary( @@ -38,8 +50,16 @@ xchk_setup_rtsummary( { struct xfs_mount *mp = sc->mp; char *descr; + struct xchk_rtsummary *rts; + xfs_filblks_t rsumblocks; int error; + rts = kvzalloc(struct_size(rts, words, mp->m_blockwsize), + XCHK_GFP_FLAGS); + if (!rts) + return -ENOMEM; + sc->buf = rts; + /* * Create an xfile to construct a new rtsummary file. The xfile allows * us to avoid pinning kernel memory for this purpose. @@ -54,11 +74,6 @@ xchk_setup_rtsummary( if (error) return error; - /* Allocate a memory buffer for the summary comparison. */ - sc->buf = kvmalloc(mp->m_sb.sb_blocksize, XCHK_GFP_FLAGS); - if (!sc->buf) - return -ENOMEM; - error = xchk_install_live_inode(sc, mp->m_rsumip); if (error) return error; @@ -75,13 +90,23 @@ xchk_setup_rtsummary( */ xfs_ilock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP); xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM); + + /* + * Now that we've locked the rtbitmap and rtsummary, we can't race with + * growfsrt trying to expand the summary or change the size of the rt + * volume. Hence it is safe to compute and check the geometry values. + */ + rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); + rts->rbmblocks = xfs_rtbitmap_blockcount(mp, rts->rextents); + rts->rsumlevels = rts->rextents ? xfs_highbit32(rts->rextents) + 1 : 0; + rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels, + rts->rbmblocks); + rts->rsumsize = XFS_FSB_TO_B(mp, rsumblocks); return 0; } /* Helper functions to record suminfo words in an xfile. */ -typedef unsigned int xchk_rtsumoff_t; - static inline int xfsum_load( struct xfs_scrub *sc, @@ -192,19 +217,29 @@ STATIC int xchk_rtsum_compare( struct xfs_scrub *sc) { - struct xfs_rtalloc_args args = { - .mp = sc->mp, - .tp = sc->tp, - }; - struct xfs_mount *mp = sc->mp; struct xfs_bmbt_irec map; - xfs_fileoff_t off; - xchk_rtsumoff_t sumoff = 0; - int nmap; + struct xfs_iext_cursor icur; - for (off = 0; off < XFS_B_TO_FSB(mp, mp->m_rsumsize); off++) { - union xfs_suminfo_raw *ondisk_info; - int error = 0; + struct xfs_mount *mp = sc->mp; + struct xfs_inode *ip = sc->ip; + struct xchk_rtsummary *rts = sc->buf; + xfs_fileoff_t off = 0; + xfs_fileoff_t endoff; + xfs_rtsumoff_t sumoff = 0; + int error = 0; + + rts->args.mp = sc->mp; + rts->args.tp = sc->tp; + + /* Mappings may not cross or lie beyond EOF. */ + endoff = XFS_B_TO_FSB(mp, ip->i_disk_size); + if (xfs_iext_lookup_extent(ip, &ip->i_df, endoff, &icur, &map)) { + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, endoff); + return 0; + } + + while (off < endoff) { + int nmap = 1; if (xchk_should_terminate(sc, &error)) return error; @@ -212,8 +247,7 @@ xchk_rtsum_compare( return 0; /* Make sure we have a written extent. */ - nmap = 1; - error = xfs_bmapi_read(mp->m_rsumip, off, 1, &map, &nmap, + error = xfs_bmapi_read(ip, off, endoff - off, &map, &nmap, XFS_DATA_FORK); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, off, &error)) return error; @@ -223,24 +257,33 @@ xchk_rtsum_compare( return 0; } + off += map.br_blockcount; + } + + for (off = 0; off < endoff; off++) { + union xfs_suminfo_raw *ondisk_info; + /* Read a block's worth of ondisk rtsummary file. */ - error = xfs_rtsummary_read_buf(&args, off); + error = xfs_rtsummary_read_buf(&rts->args, off); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, off, &error)) return error; /* Read a block's worth of computed rtsummary file. */ - error = xfsum_copyout(sc, sumoff, sc->buf, mp->m_blockwsize); + error = xfsum_copyout(sc, sumoff, rts->words, mp->m_blockwsize); if (error) { - xfs_rtbuf_cache_relse(&args); + xfs_rtbuf_cache_relse(&rts->args); return error; } - ondisk_info = xfs_rsumblock_infoptr(&args, 0); - if (memcmp(ondisk_info, sc->buf, - mp->m_blockwsize << XFS_WORDLOG) != 0) + ondisk_info = xfs_rsumblock_infoptr(&rts->args, 0); + if (memcmp(ondisk_info, rts->words, + mp->m_blockwsize << XFS_WORDLOG) != 0) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, off); + xfs_rtbuf_cache_relse(&rts->args); + return error; + } - xfs_rtbuf_cache_relse(&args); + xfs_rtbuf_cache_relse(&rts->args); sumoff += mp->m_blockwsize; } @@ -253,8 +296,43 @@ xchk_rtsummary( struct xfs_scrub *sc) { struct xfs_mount *mp = sc->mp; + struct xchk_rtsummary *rts = sc->buf; int error = 0; + /* Is sb_rextents correct? */ + if (mp->m_sb.sb_rextents != rts->rextents) { + xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino); + goto out_rbm; + } + + /* Is m_rsumlevels correct? */ + if (mp->m_rsumlevels != rts->rsumlevels) { + xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino); + goto out_rbm; + } + + /* Is m_rsumsize correct? */ + if (mp->m_rsumsize != rts->rsumsize) { + xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino); + goto out_rbm; + } + + /* The summary file length must be aligned to an fsblock. */ + if (mp->m_rsumip->i_disk_size & mp->m_blockmask) { + xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino); + goto out_rbm; + } + + /* + * Is the summary file itself large enough to handle the rt volume? + * growfsrt expands the summary file before updating sb_rextents, so + * the file can be larger than rsumsize. + */ + if (mp->m_rsumip->i_disk_size < rts->rsumsize) { + xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino); + goto out_rbm; + } + /* Invoke the fork scrubber. */ error = xchk_metadata_inode_forks(sc); if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-24 23:54 ` [PATCH 2/6] xfs: check rt summary " Darrick J. Wong @ 2023-11-28 14:05 ` Christoph Hellwig 2023-11-28 23:30 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 14:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs > + /* > + * Now that we've locked the rtbitmap and rtsummary, we can't race with > + * growfsrt trying to expand the summary or change the size of the rt > + * volume. Hence it is safe to compute and check the geometry values. > + */ > + rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > + rts->rbmblocks = xfs_rtbitmap_blockcount(mp, rts->rextents); > + rts->rsumlevels = rts->rextents ? xfs_highbit32(rts->rextents) + 1 : 0; > + rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels, > + rts->rbmblocks); > + rts->rsumsize = XFS_FSB_TO_B(mp, rsumblocks); Same nitpick as for the last patch. Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-28 14:05 ` Christoph Hellwig @ 2023-11-28 23:30 ` Darrick J. Wong 2023-11-29 1:23 ` Darrick J. Wong 2023-11-29 6:05 ` Christoph Hellwig 0 siblings, 2 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-28 23:30 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 06:05:48AM -0800, Christoph Hellwig wrote: > > + /* > > + * Now that we've locked the rtbitmap and rtsummary, we can't race with > > + * growfsrt trying to expand the summary or change the size of the rt > > + * volume. Hence it is safe to compute and check the geometry values. > > + */ > > + rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > > + rts->rbmblocks = xfs_rtbitmap_blockcount(mp, rts->rextents); > > + rts->rsumlevels = rts->rextents ? xfs_highbit32(rts->rextents) + 1 : 0; > > + rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels, > > + rts->rbmblocks); > > + rts->rsumsize = XFS_FSB_TO_B(mp, rsumblocks); > > Same nitpick as for the last patch. LOL so I just tried a 64k rt volume with a 1M rextsize and mkfs crashed. I guess I'll go sort out what's going on there... > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> Thanks! --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-28 23:30 ` Darrick J. Wong @ 2023-11-29 1:23 ` Darrick J. Wong 2023-11-29 6:05 ` Christoph Hellwig 1 sibling, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-29 1:23 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 03:30:09PM -0800, Darrick J. Wong wrote: > On Tue, Nov 28, 2023 at 06:05:48AM -0800, Christoph Hellwig wrote: > > > + /* > > > + * Now that we've locked the rtbitmap and rtsummary, we can't race with > > > + * growfsrt trying to expand the summary or change the size of the rt > > > + * volume. Hence it is safe to compute and check the geometry values. > > > + */ > > > + rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); > > > + rts->rbmblocks = xfs_rtbitmap_blockcount(mp, rts->rextents); > > > + rts->rsumlevels = rts->rextents ? xfs_highbit32(rts->rextents) + 1 : 0; > > > + rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels, > > > + rts->rbmblocks); > > > + rts->rsumsize = XFS_FSB_TO_B(mp, rsumblocks); > > > > Same nitpick as for the last patch. > > LOL so I just tried a 64k rt volume with a 1M rextsize and mkfs crashed. > I guess I'll go sort out what's going on there... Oh good, XFS has been broken since the beginning of git history for the stupid corner case where the rtblocks < rextsize. In this case, mkfs will set sb_rextents and sb_rextslog both to zero: sbp->sb_rextslog = (uint8_t)(rtextents ? libxfs_highbit32((unsigned int)rtextents) : 0); However, that's not the check that xfs_repair uses for nonzero rtblocks: if (sb->sb_rextslog != libxfs_highbit32((unsigned int)sb->sb_rextents)) The difference here is that xfs_highbit32 returns -1 if its argument is zero, which means that for a runt rt volume, repair thinks the "correct" value of rextslog is -1, even though mkfs wrote it as 0 and flags a freshly formatted filesystem as corrupt. Because mkfs has been writing ondisk artifacts like this for decades, we have to accept that as "correct". TBH, zero rextslog for zero rtextents makes more sense to me anyway. Regrettably, the superblock verifier checks created in commit copied xfs_repair even though mkfs has been writing out such filesystems for ages. In testing /that/ fix, I discovered that the logic above is wrong -- rsumlevels is always rextslog + 1 when rblocks > 0, even if rextents == 0. IOWs, this logic needs to be: /* * Now that we've locked the rtbitmap and rtsummary, we can't race with * growfsrt trying to expand the summary or change the size of the rt * volume. Hence it is safe to compute and check the geometry values. */ if (mp->m_sb.sb_rblocks) { xfs_filblks_t rsumblocks; int rextslog = 0; rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks); if (rts->rextents) rextslog = xfs_highbit32(rts->rextents); rts->rsumlevels = rextslog + 1; rts->rbmblocks = xfs_rtbitmap_blockcount(mp, rts->rextents); rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels, rts->rbmblocks); rts->rsumsize = XFS_FSB_TO_B(mp, rsumblocks); } Yay winning. --D > > Otherwise looks good: > > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > Thanks! > > --D > ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-28 23:30 ` Darrick J. Wong 2023-11-29 1:23 ` Darrick J. Wong @ 2023-11-29 6:05 ` Christoph Hellwig 2023-11-29 6:21 ` Darrick J. Wong 1 sibling, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 6:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Nov 28, 2023 at 03:30:09PM -0800, Darrick J. Wong wrote: > LOL so I just tried a 64k rt volume with a 1M rextsize and mkfs crashed. > I guess I'll go sort out what's going on there... I think we should just reject rt device size < rtextsize configs in the kernel and all tools. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-29 6:05 ` Christoph Hellwig @ 2023-11-29 6:21 ` Darrick J. Wong 2023-11-29 6:23 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-29 6:21 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 10:05:49PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 03:30:09PM -0800, Darrick J. Wong wrote: > > LOL so I just tried a 64k rt volume with a 1M rextsize and mkfs crashed. > > I guess I'll go sort out what's going on there... > > I think we should just reject rt device size < rtextsize configs in > the kernel and all tools. "But that could break old weirdass customer filesystems." The design of rtgroups prohibits that, so we're ok going forward. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-29 6:21 ` Darrick J. Wong @ 2023-11-29 6:23 ` Christoph Hellwig 2023-11-30 0:10 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-29 6:23 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Tue, Nov 28, 2023 at 10:21:55PM -0800, Darrick J. Wong wrote: > On Tue, Nov 28, 2023 at 10:05:49PM -0800, Christoph Hellwig wrote: > > On Tue, Nov 28, 2023 at 03:30:09PM -0800, Darrick J. Wong wrote: > > > LOL so I just tried a 64k rt volume with a 1M rextsize and mkfs crashed. > > > I guess I'll go sort out what's going on there... > > > > I think we should just reject rt device size < rtextsize configs in > > the kernel and all tools. > > "But that could break old weirdass customer filesystems." > > The design of rtgroups prohibits that, so we're ok going forward. Well, as you just said it hasn't mounted for a long time, and really this is a corner case that just doesn't make any sense. I'd really prefer to cleanly reject it, and if someone really complains with a good reason we can revisit the decisions. But I strongly doubt it's ever going to happen. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 2/6] xfs: check rt summary file geometry more thoroughly 2023-11-29 6:23 ` Christoph Hellwig @ 2023-11-30 0:10 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-30 0:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Tue, Nov 28, 2023 at 10:23:41PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 10:21:55PM -0800, Darrick J. Wong wrote: > > On Tue, Nov 28, 2023 at 10:05:49PM -0800, Christoph Hellwig wrote: > > > On Tue, Nov 28, 2023 at 03:30:09PM -0800, Darrick J. Wong wrote: > > > > LOL so I just tried a 64k rt volume with a 1M rextsize and mkfs crashed. > > > > I guess I'll go sort out what's going on there... > > > > > > I think we should just reject rt device size < rtextsize configs in > > > the kernel and all tools. > > > > "But that could break old weirdass customer filesystems." > > > > The design of rtgroups prohibits that, so we're ok going forward. > > Well, as you just said it hasn't mounted for a long time, and really > this is a corner case that just doesn't make any sense. I'd really > prefer to cleanly reject it, and if someone really complains with a good > reason we can revisit the decisions. But I strongly doubt it's ever > going to happen. Oh, even better, Dave and I noticed today that if you format a 17G realtime volume (> 2^32 rt extents) then mkfs fails because there's an integer overflow: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/mkfs/xfs_mkfs.c#n3739 Based on your observation that rt free space never exceeds the group length with rtgroups turned on, I'll tweak the sb_rextslog computation so that it's computed with (rgblocks / rextsize) instead of (rblocks / rextsize) which will fix that problem for future filesystems. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 3/6] xfs: always check the rtbitmap and rtsummary files 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong 2023-11-24 23:54 ` [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly Darrick J. Wong 2023-11-24 23:54 ` [PATCH 2/6] xfs: check rt summary " Darrick J. Wong @ 2023-11-24 23:54 ` Darrick J. Wong 2023-11-28 14:06 ` Christoph Hellwig 2023-11-24 23:55 ` [PATCH 4/6] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong ` (2 subsequent siblings) 5 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:54 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> XFS filesystems always have a realtime bitmap and summary file, even if there has never been a realtime volume attached. Always check them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/scrub.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index bc70a91f8b1bf..89ce6d2f9ad14 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -330,14 +330,12 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_FS, .setup = xchk_setup_rtbitmap, .scrub = xchk_rtbitmap, - .has = xfs_has_realtime, .repair = xrep_notsupported, }, [XFS_SCRUB_TYPE_RTSUM] = { /* realtime summary */ .type = ST_FS, .setup = xchk_setup_rtsummary, .scrub = xchk_rtsummary, - .has = xfs_has_realtime, .repair = xrep_notsupported, }, [XFS_SCRUB_TYPE_UQUOTA] = { /* user quota */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 3/6] xfs: always check the rtbitmap and rtsummary files 2023-11-24 23:54 ` [PATCH 3/6] xfs: always check the rtbitmap and rtsummary files Darrick J. Wong @ 2023-11-28 14:06 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-28 14:06 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 4/6] xfs: repair the inode core and forks of a metadata inode 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:54 ` [PATCH 3/6] xfs: always check the rtbitmap and rtsummary files Darrick J. Wong @ 2023-11-24 23:55 ` Darrick J. Wong 2023-11-30 5:12 ` Christoph Hellwig 2023-11-24 23:55 ` [PATCH 5/6] xfs: create a new inode fork block unmap helper Darrick J. Wong 2023-11-24 23:55 ` [PATCH 6/6] xfs: online repair of realtime bitmaps Darrick J. Wong 5 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:55 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a helper function to repair the core and forks of a metadata inode, so that we can get move onto the task of repairing higher level metadata that lives in an inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/bmap_repair.c | 17 ++++- fs/xfs/scrub/repair.c | 151 ++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 + 3 files changed, 166 insertions(+), 4 deletions(-) diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c index 2c593eebb1fc4..e55f3e52f36fc 100644 --- a/fs/xfs/scrub/bmap_repair.c +++ b/fs/xfs/scrub/bmap_repair.c @@ -78,6 +78,9 @@ struct xrep_bmap { /* Are there shared extents? */ bool shared_extents; + + /* Do we allow unwritten extents? */ + bool allow_unwritten; }; /* Is this space extent shared? Flag the inode if it is. */ @@ -272,6 +275,10 @@ xrep_bmap_walk_rmap( !(rec->rm_flags & XFS_RMAP_ATTR_FORK)) return 0; + /* Reject unwritten extents if we don't allow those. */ + if ((rec->rm_flags & XFS_RMAP_UNWRITTEN) && !rb->allow_unwritten) + return -EFSCORRUPTED; + fsbno = XFS_AGB_TO_FSB(mp, cur->bc_ag.pag->pag_agno, rec->rm_startblock); @@ -762,10 +769,11 @@ xrep_bmap_check_inputs( } /* Repair an inode fork. */ -STATIC int +int xrep_bmap( struct xfs_scrub *sc, - int whichfork) + int whichfork, + bool allow_unwritten) { struct xrep_bmap *rb; char *descr; @@ -784,6 +792,7 @@ xrep_bmap( return -ENOMEM; rb->sc = sc; rb->whichfork = whichfork; + rb->allow_unwritten = allow_unwritten; /* * No need to waste time scanning for shared extents if the inode is @@ -834,7 +843,7 @@ int xrep_bmap_data( struct xfs_scrub *sc) { - return xrep_bmap(sc, XFS_DATA_FORK); + return xrep_bmap(sc, XFS_DATA_FORK, true); } /* Repair an inode's attr fork. */ @@ -842,5 +851,5 @@ int xrep_bmap_attr( struct xfs_scrub *sc) { - return xrep_bmap(sc, XFS_ATTR_FORK); + return xrep_bmap(sc, XFS_ATTR_FORK, false); } diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 0f8dc25ef998b..cd13ba9b4f345 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -29,6 +29,7 @@ #include "xfs_defer.h" #include "xfs_errortag.h" #include "xfs_error.h" +#include "xfs_reflink.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -959,3 +960,153 @@ xrep_will_attempt( return false; } + +/* Try to fix some part of a metadata inode by calling another scrubber. */ +STATIC int +xrep_metadata_inode_subtype( + struct xfs_scrub *sc, + unsigned int scrub_type) +{ + __u32 smtype = sc->sm->sm_type; + __u32 smflags = sc->sm->sm_flags; + int error; + + /* + * Let's see if the inode needs repair. We're going to open-code calls + * to the scrub and repair functions so that we can hang on to the + * resources that we already acquired instead of using the standard + * setup/teardown routines. + */ + sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT; + sc->sm->sm_type = scrub_type; + + switch (scrub_type) { + case XFS_SCRUB_TYPE_INODE: + error = xchk_inode(sc); + break; + case XFS_SCRUB_TYPE_BMBTD: + error = xchk_bmap_data(sc); + break; + case XFS_SCRUB_TYPE_BMBTA: + error = xchk_bmap_attr(sc); + break; + default: + ASSERT(0); + error = -EFSCORRUPTED; + } + if (error) + goto out; + + if (!xrep_will_attempt(sc)) + goto out; + + /* + * Repair some part of the inode. This will potentially join the inode + * to the transaction. + */ + switch (scrub_type) { + case XFS_SCRUB_TYPE_INODE: + error = xrep_inode(sc); + break; + case XFS_SCRUB_TYPE_BMBTD: + error = xrep_bmap(sc, XFS_DATA_FORK, false); + break; + case XFS_SCRUB_TYPE_BMBTA: + error = xrep_bmap(sc, XFS_ATTR_FORK, false); + break; + } + if (error) + goto out; + + /* + * Finish all deferred intent items and then roll the transaction so + * that the inode will not be joined to the transaction when we exit + * the function. + */ + error = xfs_defer_finish(&sc->tp); + if (error) + goto out; + error = xfs_trans_roll(&sc->tp); + if (error) + goto out; + + /* + * Clear the corruption flags and re-check the metadata that we just + * repaired. + */ + sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT; + + switch (scrub_type) { + case XFS_SCRUB_TYPE_INODE: + error = xchk_inode(sc); + break; + case XFS_SCRUB_TYPE_BMBTD: + error = xchk_bmap_data(sc); + break; + case XFS_SCRUB_TYPE_BMBTA: + error = xchk_bmap_attr(sc); + break; + } + if (error) + goto out; + + /* If corruption persists, the repair has failed. */ + if (xchk_needs_repair(sc->sm)) { + error = -EFSCORRUPTED; + goto out; + } +out: + sc->sm->sm_type = smtype; + sc->sm->sm_flags = smflags; + return error; +} + +/* + * Repair the ondisk forks of a metadata inode. The caller must ensure that + * sc->ip points to the metadata inode and the ILOCK is held on that inode. + * The inode must not be joined to the transaction before the call, and will + * not be afterwards. + */ +int +xrep_metadata_inode_forks( + struct xfs_scrub *sc) +{ + bool dirty = false; + int error; + + /* Repair the inode record and the data fork. */ + error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_INODE); + if (error) + return error; + + error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTD); + if (error) + return error; + + /* Make sure the attr fork looks ok before we delete it. */ + error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTA); + if (error) + return error; + + /* Clear the reflink flag since metadata never shares. */ + if (xfs_is_reflink_inode(sc->ip)) { + dirty = true; + xfs_trans_ijoin(sc->tp, sc->ip, 0); + error = xfs_reflink_clear_inode_flag(sc->ip, &sc->tp); + if (error) + return error; + } + + /* + * If we modified the inode, roll the transaction but don't rejoin the + * inode to the new transaction because xrep_bmap_data can do that. + */ + if (dirty) { + error = xfs_trans_roll(&sc->tp); + if (error) + return error; + dirty = false; + } + + return 0; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index be3585b8f4364..b7ddd35e753eb 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -82,6 +82,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc); int xrep_ino_ensure_extent_count(struct xfs_scrub *sc, int whichfork, xfs_extnum_t nextents); int xrep_reset_perag_resv(struct xfs_scrub *sc); +int xrep_bmap(struct xfs_scrub *sc, int whichfork, bool allow_unwritten); +int xrep_metadata_inode_forks(struct xfs_scrub *sc); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 4/6] xfs: repair the inode core and forks of a metadata inode 2023-11-24 23:55 ` [PATCH 4/6] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong @ 2023-11-30 5:12 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:12 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/6] xfs: create a new inode fork block unmap helper 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong ` (3 preceding siblings ...) 2023-11-24 23:55 ` [PATCH 4/6] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong @ 2023-11-24 23:55 ` Darrick J. Wong 2023-11-25 6:17 ` Christoph Hellwig 2023-11-24 23:55 ` [PATCH 6/6] xfs: online repair of realtime bitmaps Darrick J. Wong 5 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:55 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a new helper to unmap blocks from an inode's fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_bmap.c | 39 +++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_bmap.h | 2 ++ fs/xfs/xfs_inode.c | 24 ++++-------------------- 3 files changed, 45 insertions(+), 20 deletions(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 9968a3a6e6d8d..cebf490daf43e 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -6231,3 +6231,42 @@ xfs_bmap_validate_extent( return xfs_bmap_validate_extent_raw(ip->i_mount, XFS_IS_REALTIME_INODE(ip), whichfork, irec); } + +/* + * Used in xfs_itruncate_extents(). This is the maximum number of extents + * freed from a file in a single transaction. + */ +#define XFS_ITRUNC_MAX_EXTENTS 2 + +/* + * Unmap every extent in part of an inode's fork. We don't do any higher level + * invalidation work at all. + */ +int +xfs_bunmapi_range( + struct xfs_trans **tpp, + struct xfs_inode *ip, + uint32_t flags, + xfs_fileoff_t startoff, + xfs_fileoff_t endoff) +{ + xfs_filblks_t unmap_len = endoff - startoff + 1; + int error = 0; + + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + + while (unmap_len > 0) { + ASSERT((*tpp)->t_highest_agno == NULLAGNUMBER); + error = __xfs_bunmapi(*tpp, ip, startoff, &unmap_len, flags, + XFS_ITRUNC_MAX_EXTENTS); + if (error) + goto out; + + /* free the just unmapped extents */ + error = xfs_defer_finish(tpp); + if (error) + goto out; + } +out: + return error; +} diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index 8518324db2855..9bc78c717ecf7 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -273,6 +273,8 @@ int xfs_bmap_complain_bad_rec(struct xfs_inode *ip, int whichfork, int xfs_bmapi_remap(struct xfs_trans *tp, struct xfs_inode *ip, xfs_fileoff_t bno, xfs_filblks_t len, xfs_fsblock_t startblock, uint32_t flags); +int xfs_bunmapi_range(struct xfs_trans **tpp, struct xfs_inode *ip, + uint32_t flags, xfs_fileoff_t startoff, xfs_fileoff_t endoff); extern struct kmem_cache *xfs_bmap_intent_cache; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index c0f1c89786c2a..424b03628b7cf 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -40,12 +40,6 @@ struct kmem_cache *xfs_inode_cache; -/* - * Used in xfs_itruncate_extents(). This is the maximum number of extents - * freed from a file in a single transaction. - */ -#define XFS_ITRUNC_MAX_EXTENTS 2 - STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *); STATIC int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag, struct xfs_inode *); @@ -1339,7 +1333,6 @@ xfs_itruncate_extents_flags( struct xfs_mount *mp = ip->i_mount; struct xfs_trans *tp = *tpp; xfs_fileoff_t first_unmap_block; - xfs_filblks_t unmap_len; int error = 0; ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); @@ -1371,19 +1364,10 @@ xfs_itruncate_extents_flags( return 0; } - unmap_len = XFS_MAX_FILEOFF - first_unmap_block + 1; - while (unmap_len > 0) { - ASSERT(tp->t_highest_agno == NULLAGNUMBER); - error = __xfs_bunmapi(tp, ip, first_unmap_block, &unmap_len, - flags, XFS_ITRUNC_MAX_EXTENTS); - if (error) - goto out; - - /* free the just unmapped extents */ - error = xfs_defer_finish(&tp); - if (error) - goto out; - } + error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block, + XFS_MAX_FILEOFF); + if (error) + goto out; if (whichfork == XFS_DATA_FORK) { /* Remove all pending CoW reservations. */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 5/6] xfs: create a new inode fork block unmap helper 2023-11-24 23:55 ` [PATCH 5/6] xfs: create a new inode fork block unmap helper Darrick J. Wong @ 2023-11-25 6:17 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-25 6:17 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:55:30PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Create a new helper to unmap blocks from an inode's fork. Look good: Reviewed-by: Christoph Hellwig <hch@lst.de> With this __xfs_bunmapi an be marked static in xfs_bmap.c, as there is no user outside of it (and neither in xfsprogs). ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 6/6] xfs: online repair of realtime bitmaps 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong ` (4 preceding siblings ...) 2023-11-24 23:55 ` [PATCH 5/6] xfs: create a new inode fork block unmap helper Darrick J. Wong @ 2023-11-24 23:55 ` Darrick J. Wong 2023-11-30 5:16 ` Christoph Hellwig 5 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:55 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Fix all the file metadata surrounding the realtime bitmap file, which includes the rt geometry, file size, forks, and space mappings. The bitmap contents themselves cannot be fixed without rt rmap, so that will come later. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 4 + fs/xfs/scrub/repair.h | 7 + fs/xfs/scrub/rtbitmap.c | 16 ++- fs/xfs/scrub/rtbitmap.h | 22 ++++ fs/xfs/scrub/rtbitmap_repair.c | 202 ++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/scrub.c | 2 6 files changed, 245 insertions(+), 8 deletions(-) create mode 100644 fs/xfs/scrub/rtbitmap.h create mode 100644 fs/xfs/scrub/rtbitmap_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 71a76f8ac5e47..36e7bc7d147e2 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -191,5 +191,9 @@ xfs-y += $(addprefix scrub/, \ refcount_repair.o \ repair.o \ ) + +xfs-$(CONFIG_XFS_RT) += $(addprefix scrub/, \ + rtbitmap_repair.o \ + ) endif endif diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index b7ddd35e753eb..f54dff9268bcc 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -115,6 +115,12 @@ int xrep_bmap_data(struct xfs_scrub *sc); int xrep_bmap_attr(struct xfs_scrub *sc); int xrep_bmap_cow(struct xfs_scrub *sc); +#ifdef CONFIG_XFS_RT +int xrep_rtbitmap(struct xfs_scrub *sc); +#else +# define xrep_rtbitmap xrep_notsupported +#endif /* CONFIG_XFS_RT */ + int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -177,6 +183,7 @@ xrep_setup_nothing( #define xrep_bmap_data xrep_notsupported #define xrep_bmap_attr xrep_notsupported #define xrep_bmap_cow xrep_notsupported +#define xrep_rtbitmap xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c index 3b5b62fbf4e0a..92058eb545344 100644 --- a/fs/xfs/scrub/rtbitmap.c +++ b/fs/xfs/scrub/rtbitmap.c @@ -17,12 +17,8 @@ #include "xfs_bit.h" #include "scrub/scrub.h" #include "scrub/common.h" - -struct xchk_rtbitmap { - uint64_t rextents; - uint64_t rbmblocks; - unsigned int rextslog; -}; +#include "scrub/repair.h" +#include "scrub/rtbitmap.h" /* Set us up with the realtime metadata locked. */ int @@ -38,7 +34,13 @@ xchk_setup_rtbitmap( return -ENOMEM; sc->buf = rtb; - error = xchk_trans_alloc(sc, 0); + if (xchk_could_repair(sc)) { + error = xrep_setup_rtbitmap(sc, rtb); + if (error) + return error; + } + + error = xchk_trans_alloc(sc, rtb->resblks); if (error) return error; diff --git a/fs/xfs/scrub/rtbitmap.h b/fs/xfs/scrub/rtbitmap.h new file mode 100644 index 0000000000000..85304ff019e1d --- /dev/null +++ b/fs/xfs/scrub/rtbitmap.h @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_RTBITMAP_H__ +#define __XFS_SCRUB_RTBITMAP_H__ + +struct xchk_rtbitmap { + uint64_t rextents; + uint64_t rbmblocks; + unsigned int rextslog; + unsigned int resblks; +}; + +#ifdef CONFIG_XFS_ONLINE_REPAIR +int xrep_setup_rtbitmap(struct xfs_scrub *sc, struct xchk_rtbitmap *rtb); +#else +# define xrep_setup_rtbitmap(sc, rtb) (0) +#endif /* CONFIG_XFS_ONLINE_REPAIR */ + +#endif /* __XFS_SCRUB_RTBITMAP_H__ */ diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c new file mode 100644 index 0000000000000..46f5d5f605c91 --- /dev/null +++ b/fs/xfs/scrub/rtbitmap_repair.c @@ -0,0 +1,202 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2020-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_btree.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_inode.h" +#include "xfs_bit.h" +#include "xfs_bmap.h" +#include "xfs_bmap_btree.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/xfile.h" +#include "scrub/rtbitmap.h" + +/* Set up to repair the realtime bitmap file metadata. */ +int +xrep_setup_rtbitmap( + struct xfs_scrub *sc, + struct xchk_rtbitmap *rtb) +{ + struct xfs_mount *mp = sc->mp; + unsigned long long blocks = 0; + + /* + * Reserve enough blocks to write out a completely new bmbt for a + * maximally fragmented bitmap file. We do not hold the rtbitmap + * ILOCK yet, so this is entirely speculative. + */ + blocks = xfs_bmbt_calc_size(mp, mp->m_sb.sb_rbmblocks); + if (blocks > UINT_MAX) + return -EOPNOTSUPP; + + rtb->resblks += blocks; + return 0; +} + +/* + * Make sure that the given range of the data fork of the realtime file is + * mapped to written blocks. The caller must ensure that the inode is joined + * to the transaction. + */ +STATIC int +xrep_rtbitmap_data_mappings( + struct xfs_scrub *sc, + xfs_filblks_t len) +{ + struct xfs_bmbt_irec map; + xfs_fileoff_t off = 0; + int error; + + ASSERT(sc->ip != NULL); + + while (off < len) { + int nmaps = 1; + + /* + * If we have a real extent mapping this block then we're + * in ok shape. + */ + error = xfs_bmapi_read(sc->ip, off, len - off, &map, &nmaps, + XFS_DATA_FORK); + if (error) + return error; + if (nmaps == 0) { + ASSERT(nmaps != 0); + return -EFSCORRUPTED; + } + + /* + * Written extents are ok. Holes are not filled because we + * do not know the freespace information. + */ + if (xfs_bmap_is_written_extent(&map) || + map.br_startblock == HOLESTARTBLOCK) { + off = map.br_startoff + map.br_blockcount; + continue; + } + + /* + * If we find a delalloc reservation then something is very + * very wrong. Bail out. + */ + if (map.br_startblock == DELAYSTARTBLOCK) + return -EFSCORRUPTED; + + /* Make sure we're really converting an unwritten extent. */ + if (map.br_state != XFS_EXT_UNWRITTEN) { + ASSERT(map.br_state == XFS_EXT_UNWRITTEN); + return -EFSCORRUPTED; + } + + /* Make sure this block has a real zeroed extent mapped. */ + nmaps = 1; + error = xfs_bmapi_write(sc->tp, sc->ip, map.br_startoff, + map.br_blockcount, + XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO, + 0, &map, &nmaps); + if (error) + return error; + if (nmaps != 1) + return -EFSCORRUPTED; + + /* Commit new extent and all deferred work. */ + error = xrep_defer_finish(sc); + if (error) + return error; + + off = map.br_startoff + map.br_blockcount; + } + + return 0; +} + +/* Fix broken rt volume geometry. */ +STATIC int +xrep_rtbitmap_geometry( + struct xfs_scrub *sc, + struct xchk_rtbitmap *rtb) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_trans *tp = sc->tp; + + /* Superblock fields */ + if (mp->m_sb.sb_rextents != rtb->rextents) + xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS, + rtb->rextents - mp->m_sb.sb_rextents); + + if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks) + xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS, + rtb->rbmblocks - mp->m_sb.sb_rbmblocks); + + if (mp->m_sb.sb_rextslog != rtb->rextslog) + xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG, + rtb->rextslog - mp->m_sb.sb_rextslog); + + /* Fix broken isize */ + sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size, + mp->m_sb.sb_blocksize); + + if (sc->ip->i_disk_size < XFS_FSB_TO_B(mp, rtb->rbmblocks)) + sc->ip->i_disk_size = XFS_FSB_TO_B(mp, rtb->rbmblocks); + + xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE); + return xrep_roll_trans(sc); +} + +/* Repair the realtime bitmap file metadata. */ +int +xrep_rtbitmap( + struct xfs_scrub *sc) +{ + struct xchk_rtbitmap *rtb = sc->buf; + struct xfs_mount *mp = sc->mp; + unsigned long long blocks = 0; + int error; + + /* Impossibly large rtbitmap means we can't touch the filesystem. */ + if (rtb->rbmblocks > U32_MAX) + return 0; + + /* + * If the size of the rt bitmap file is larger than what we reserved, + * figure out if we need to adjust the block reservation in the + * transaction. + */ + blocks = xfs_bmbt_calc_size(mp, rtb->rbmblocks); + if (blocks > UINT_MAX) + return -EOPNOTSUPP; + if (blocks > rtb->resblks) { + error = xfs_trans_reserve_more(sc->tp, blocks, 0); + if (error) + return error; + + rtb->resblks += blocks; + } + + /* Fix inode core and forks. */ + error = xrep_metadata_inode_forks(sc); + if (error) + return error; + + xfs_trans_ijoin(sc->tp, sc->ip, 0); + + /* Ensure no unwritten extents. */ + error = xrep_rtbitmap_data_mappings(sc, rtb->rbmblocks); + if (error) + return error; + + /* Fix inconsistent bitmap geometry */ + return xrep_rtbitmap_geometry(sc, rtb); +} diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 89ce6d2f9ad14..9982b626bfc33 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -330,7 +330,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_FS, .setup = xchk_setup_rtbitmap, .scrub = xchk_rtbitmap, - .repair = xrep_notsupported, + .repair = xrep_rtbitmap, }, [XFS_SCRUB_TYPE_RTSUM] = { /* realtime summary */ .type = ST_FS, ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 6/6] xfs: online repair of realtime bitmaps 2023-11-24 23:55 ` [PATCH 6/6] xfs: online repair of realtime bitmaps Darrick J. Wong @ 2023-11-30 5:16 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:16 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong ` (6 preceding siblings ...) 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong @ 2023-11-24 23:46 ` Darrick J. Wong 2023-11-24 23:56 ` [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot Darrick J. Wong ` (4 more replies) 7 siblings, 5 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:46 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, XFS stores quota records and free space bitmap information in files. Add the necessary infrastructure to enable repairing metadata inodes and their forks, and then make it so that we can repair the file metadata for the rtbitmap. Repairing the bitmap contents (and the summary file) is left for subsequent patchsets. We also add the ability to repair file metadata the quota files. As part of these repairs, we also reinitialize the ondisk dquot records as necessary to get the incore dquots working. We can also correct obviously bad dquot record attributes, but we leave checking the resource usage counts for the next patchsets. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-quota fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-quota --- fs/xfs/Makefile | 9 + fs/xfs/libxfs/xfs_format.h | 3 fs/xfs/scrub/dqiterate.c | 211 ++++++++++++++++ fs/xfs/scrub/quota.c | 107 +++++++- fs/xfs/scrub/quota.h | 36 +++ fs/xfs/scrub/quota_repair.c | 575 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 7 + fs/xfs/scrub/scrub.c | 6 fs/xfs/scrub/trace.c | 3 fs/xfs/scrub/trace.h | 78 ++++++ fs/xfs/xfs_dquot.c | 37 --- fs/xfs/xfs_dquot.h | 8 - 12 files changed, 1026 insertions(+), 54 deletions(-) create mode 100644 fs/xfs/scrub/dqiterate.c create mode 100644 fs/xfs/scrub/quota.h create mode 100644 fs/xfs/scrub/quota_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong @ 2023-11-24 23:56 ` Darrick J. Wong 2023-11-30 5:17 ` Christoph Hellwig 2023-11-24 23:56 ` [PATCH 2/5] xfs: check dquot resource timers Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:56 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Each xfs_dquot object caches the file offset and daddr of the ondisk block that backs the dquot. Make sure these cached values are the same as the bmapi data, and that the block state is written. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/quota.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c index 5671c81534335..59350cd7a325b 100644 --- a/fs/xfs/scrub/quota.c +++ b/fs/xfs/scrub/quota.c @@ -6,6 +6,7 @@ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" +#include "xfs_bit.h" #include "xfs_format.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" @@ -75,6 +76,47 @@ struct xchk_quota_info { xfs_dqid_t last_id; }; +/* There's a written block backing this dquot, right? */ +STATIC int +xchk_quota_item_bmap( + struct xfs_scrub *sc, + struct xfs_dquot *dq, + xfs_fileoff_t offset) +{ + struct xfs_bmbt_irec irec; + struct xfs_mount *mp = sc->mp; + int nmaps = 1; + int error; + + if (!xfs_verify_fileoff(mp, offset)) { + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + return 0; + } + + if (dq->q_fileoffset != offset) { + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + return 0; + } + + error = xfs_bmapi_read(sc->ip, offset, 1, &irec, &nmaps, 0); + if (error) + return error; + + if (nmaps != 1) { + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + return 0; + } + + if (!xfs_verify_fsbno(mp, irec.br_startblock)) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + if (XFS_FSB_TO_DADDR(mp, irec.br_startblock) != dq->q_blkno) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + if (!xfs_bmap_is_written_extent(&irec)) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + + return 0; +} + /* Scrub the fields in an individual quota item. */ STATIC int xchk_quota_item( @@ -93,6 +135,17 @@ xchk_quota_item( if (xchk_should_terminate(sc, &error)) return error; + /* + * We want to validate the bmap record for the storage backing this + * dquot, so we need to lock the dquot and the quota file. For quota + * operations, the locking order is first the ILOCK and then the dquot. + * However, dqiterate gave us a locked dquot, so drop the dquot lock to + * get the ILOCK. + */ + xfs_dqunlock(dq); + xchk_ilock(sc, XFS_ILOCK_SHARED); + xfs_dqlock(dq); + /* * Except for the root dquot, the actual dquot we got must either have * the same or higher id as we saw before. @@ -103,6 +156,11 @@ xchk_quota_item( sqi->last_id = dq->q_id; + error = xchk_quota_item_bmap(sc, dq, offset); + xchk_iunlock(sc, XFS_ILOCK_SHARED); + if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, offset, &error)) + return error; + /* * Warn if the hard limits are larger than the fs. * Administrators can do this, though in production this seems ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot 2023-11-24 23:56 ` [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot Darrick J. Wong @ 2023-11-30 5:17 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:17 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 2/5] xfs: check dquot resource timers 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong 2023-11-24 23:56 ` [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot Darrick J. Wong @ 2023-11-24 23:56 ` Darrick J. Wong 2023-11-30 5:17 ` Christoph Hellwig 2023-11-24 23:56 ` [PATCH 3/5] xfs: pull xfs_qm_dqiterate back into scrub Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:56 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> For each dquot resource, ensure either (a) the resource usage is over the soft limit and there is a nonzero timer; or (b) usage is at or under the soft limit and the timer is unset. (a) is redundant with the dquot buffer verifier, but (b) isn't checked anywhere. Found by fuzzing xfs/426 and noticing that diskdq.btimer = add didn't trip any kind of warning for having a timer set even with no limits. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/scrub/quota.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c index 59350cd7a325b..49835d2840b4a 100644 --- a/fs/xfs/scrub/quota.c +++ b/fs/xfs/scrub/quota.c @@ -117,6 +117,23 @@ xchk_quota_item_bmap( return 0; } +/* Complain if a quota timer is incorrectly set. */ +static inline void +xchk_quota_item_timer( + struct xfs_scrub *sc, + xfs_fileoff_t offset, + const struct xfs_dquot_res *res) +{ + if ((res->softlimit && res->count > res->softlimit) || + (res->hardlimit && res->count > res->hardlimit)) { + if (!res->timer) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + } else { + if (res->timer) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + } +} + /* Scrub the fields in an individual quota item. */ STATIC int xchk_quota_item( @@ -224,6 +241,10 @@ xchk_quota_item( dq->q_rtb.count > dq->q_rtb.hardlimit) xchk_fblock_set_warning(sc, XFS_DATA_FORK, offset); + xchk_quota_item_timer(sc, offset, &dq->q_blk); + xchk_quota_item_timer(sc, offset, &dq->q_ino); + xchk_quota_item_timer(sc, offset, &dq->q_rtb); + out: if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) return -ECANCELED; ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 2/5] xfs: check dquot resource timers 2023-11-24 23:56 ` [PATCH 2/5] xfs: check dquot resource timers Darrick J. Wong @ 2023-11-30 5:17 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:17 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 3/5] xfs: pull xfs_qm_dqiterate back into scrub 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong 2023-11-24 23:56 ` [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot Darrick J. Wong 2023-11-24 23:56 ` [PATCH 2/5] xfs: check dquot resource timers Darrick J. Wong @ 2023-11-24 23:56 ` Darrick J. Wong 2023-11-30 5:22 ` Christoph Hellwig 2023-11-24 23:56 ` [PATCH 4/5] xfs: improve dquot iteration for scrub Darrick J. Wong 2023-11-24 23:57 ` [PATCH 5/5] xfs: repair quotas Darrick J. Wong 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:56 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> There aren't any other users of this code outside of online fsck, so pull it back in there. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 5 ++++ fs/xfs/scrub/dqiterate.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/quota.c | 3 ++- fs/xfs/scrub/quota.h | 14 ++++++++++++ fs/xfs/xfs_dquot.c | 31 --------------------------- fs/xfs/xfs_dquot.h | 5 ---- 6 files changed, 72 insertions(+), 38 deletions(-) create mode 100644 fs/xfs/scrub/dqiterate.c create mode 100644 fs/xfs/scrub/quota.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 36e7bc7d147e2..91008db406fb2 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -175,7 +175,10 @@ xfs-$(CONFIG_XFS_RT) += $(addprefix scrub/, \ rtsummary.o \ ) -xfs-$(CONFIG_XFS_QUOTA) += scrub/quota.o +xfs-$(CONFIG_XFS_QUOTA) += $(addprefix scrub/, \ + dqiterate.o \ + quota.o \ + ) # online repair ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) diff --git a/fs/xfs/scrub/dqiterate.c b/fs/xfs/scrub/dqiterate.c new file mode 100644 index 0000000000000..83bb483aafb39 --- /dev/null +++ b/fs/xfs/scrub/dqiterate.c @@ -0,0 +1,52 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_bit.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_inode.h" +#include "xfs_quota.h" +#include "xfs_qm.h" +#include "xfs_bmap.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/quota.h" + +/* + * Iterate every dquot of a particular type. The caller must ensure that the + * particular quota type is active. iter_fn can return negative error codes, + * or -ECANCELED to indicate that it wants to stop iterating. + */ +int +xchk_dqiterate( + struct xfs_mount *mp, + xfs_dqtype_t type, + xchk_dqiterate_fn iter_fn, + void *priv) +{ + struct xfs_dquot *dq; + xfs_dqid_t id = 0; + int error; + + do { + error = xfs_qm_dqget_next(mp, id, type, &dq); + if (error == -ENOENT) + return 0; + if (error) + return error; + + error = iter_fn(dq, type, priv); + id = dq->q_id + 1; + xfs_qm_dqput(dq); + } while (error == 0 && id != 0); + + return error; +} diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c index 49835d2840b4a..f142ca6646061 100644 --- a/fs/xfs/scrub/quota.c +++ b/fs/xfs/scrub/quota.c @@ -18,6 +18,7 @@ #include "xfs_bmap.h" #include "scrub/scrub.h" #include "scrub/common.h" +#include "scrub/quota.h" /* Convert a scrub type code to a DQ flag, or return 0 if error. */ static inline xfs_dqtype_t @@ -320,7 +321,7 @@ xchk_quota( xchk_iunlock(sc, sc->ilock_flags); sqi.sc = sc; sqi.last_id = 0; - error = xfs_qm_dqiterate(mp, dqtype, xchk_quota_item, &sqi); + error = xchk_dqiterate(mp, dqtype, xchk_quota_item, &sqi); xchk_ilock(sc, XFS_ILOCK_EXCL); if (error == -ECANCELED) error = 0; diff --git a/fs/xfs/scrub/quota.h b/fs/xfs/scrub/quota.h new file mode 100644 index 0000000000000..0d7b3b01436e6 --- /dev/null +++ b/fs/xfs/scrub/quota.h @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_QUOTA_H__ +#define __XFS_SCRUB_QUOTA_H__ + +typedef int (*xchk_dqiterate_fn)(struct xfs_dquot *dq, + xfs_dqtype_t type, void *priv); +int xchk_dqiterate(struct xfs_mount *mp, xfs_dqtype_t type, + xchk_dqiterate_fn iter_fn, void *priv); + +#endif /* __XFS_SCRUB_QUOTA_H__ */ diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index a013b87ab8d5e..60ec401e26ffd 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1362,34 +1362,3 @@ xfs_qm_exit(void) kmem_cache_destroy(xfs_dqtrx_cache); kmem_cache_destroy(xfs_dquot_cache); } - -/* - * Iterate every dquot of a particular type. The caller must ensure that the - * particular quota type is active. iter_fn can return negative error codes, - * or -ECANCELED to indicate that it wants to stop iterating. - */ -int -xfs_qm_dqiterate( - struct xfs_mount *mp, - xfs_dqtype_t type, - xfs_qm_dqiterate_fn iter_fn, - void *priv) -{ - struct xfs_dquot *dq; - xfs_dqid_t id = 0; - int error; - - do { - error = xfs_qm_dqget_next(mp, id, type, &dq); - if (error == -ENOENT) - return 0; - if (error) - return error; - - error = iter_fn(dq, type, priv); - id = dq->q_id + 1; - xfs_qm_dqput(dq); - } while (error == 0 && id != 0); - - return error; -} diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h index 80c8f851a2f3b..8d9d4b0d979d0 100644 --- a/fs/xfs/xfs_dquot.h +++ b/fs/xfs/xfs_dquot.h @@ -234,11 +234,6 @@ static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp) return dqp; } -typedef int (*xfs_qm_dqiterate_fn)(struct xfs_dquot *dq, - xfs_dqtype_t type, void *priv); -int xfs_qm_dqiterate(struct xfs_mount *mp, xfs_dqtype_t type, - xfs_qm_dqiterate_fn iter_fn, void *priv); - time64_t xfs_dquot_set_timeout(struct xfs_mount *mp, time64_t timeout); time64_t xfs_dquot_set_grace_period(time64_t grace); ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 3/5] xfs: pull xfs_qm_dqiterate back into scrub 2023-11-24 23:56 ` [PATCH 3/5] xfs: pull xfs_qm_dqiterate back into scrub Darrick J. Wong @ 2023-11-30 5:22 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:22 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:56:33PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > There aren't any other users of this code outside of online fsck, so > pull it back in there. The move itself looks fine, but what about just open coding it and getting rid of the indirection functional? ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 4/5] xfs: improve dquot iteration for scrub 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong ` (2 preceding siblings ...) 2023-11-24 23:56 ` [PATCH 3/5] xfs: pull xfs_qm_dqiterate back into scrub Darrick J. Wong @ 2023-11-24 23:56 ` Darrick J. Wong 2023-11-30 5:25 ` Christoph Hellwig 2023-11-24 23:57 ` [PATCH 5/5] xfs: repair quotas Darrick J. Wong 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:56 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Upon a closer inspection of the quota record scrubber, I noticed that dqiterate wasn't actually walking all possible dquots for the mapped blocks in the quota file. This is due to xfs_qm_dqget_next skipping all XFS_IS_DQUOT_UNINITIALIZED dquots. For a fsck program, we really want to look at all the dquots, even if all counters and limits in the dquot record are zero. Rewrite the implementation to do this, as well as switching to an iterator paradigm to reduce the number of indirect calls. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_format.h | 3 + fs/xfs/scrub/dqiterate.c | 195 ++++++++++++++++++++++++++++++++++++++++---- fs/xfs/scrub/quota.c | 24 +++-- fs/xfs/scrub/quota.h | 28 +++++- fs/xfs/scrub/trace.c | 2 fs/xfs/scrub/trace.h | 49 +++++++++++ 6 files changed, 270 insertions(+), 31 deletions(-) diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h index 9dd3b21434314..1b2becaac0b7f 100644 --- a/fs/xfs/libxfs/xfs_format.h +++ b/fs/xfs/libxfs/xfs_format.h @@ -1273,6 +1273,9 @@ static inline time64_t xfs_dq_bigtime_to_unix(uint32_t ondisk_seconds) #define XFS_DQ_GRACE_MIN ((int64_t)0) #define XFS_DQ_GRACE_MAX ((int64_t)U32_MAX) +/* Maximum id value for a quota record */ +#define XFS_DQ_ID_MAX (U32_MAX) + /* * This is the main portion of the on-disk representation of quota information * for a user. We pad this with some more expansion room to construct the on diff --git a/fs/xfs/scrub/dqiterate.c b/fs/xfs/scrub/dqiterate.c index 83bb483aafb39..20c4daedd48df 100644 --- a/fs/xfs/scrub/dqiterate.c +++ b/fs/xfs/scrub/dqiterate.c @@ -19,34 +19,193 @@ #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/quota.h" +#include "scrub/trace.h" + +/* Initialize a dquot iteration cursor. */ +void +xchk_dqiter_init( + struct xchk_dqiter *cursor, + struct xfs_scrub *sc, + xfs_dqtype_t dqtype) +{ + cursor->sc = sc; + cursor->bmap.br_startoff = NULLFILEOFF; + cursor->dqtype = dqtype & XFS_DQTYPE_REC_MASK; + cursor->quota_ip = xfs_quota_inode(sc->mp, cursor->dqtype); + cursor->id = 0; +} /* - * Iterate every dquot of a particular type. The caller must ensure that the - * particular quota type is active. iter_fn can return negative error codes, - * or -ECANCELED to indicate that it wants to stop iterating. + * Ensure that the cached data fork mapping for the dqiter cursor is fresh and + * covers the dquot pointed to by the scan cursor. */ -int -xchk_dqiterate( - struct xfs_mount *mp, - xfs_dqtype_t type, - xchk_dqiterate_fn iter_fn, - void *priv) +STATIC int +xchk_dquot_iter_revalidate_bmap( + struct xchk_dqiter *cursor) { - struct xfs_dquot *dq; - xfs_dqid_t id = 0; + struct xfs_quotainfo *qi = cursor->sc->mp->m_quotainfo; + struct xfs_ifork *ifp = xfs_ifork_ptr(cursor->quota_ip, + XFS_DATA_FORK); + xfs_fileoff_t fileoff; + xfs_dqid_t this_id = cursor->id; + int nmaps = 1; int error; + fileoff = this_id / qi->qi_dqperchunk; + + /* + * If we have a mapping for cursor->id and it's still fresh, there's + * no need to reread the bmbt. + */ + if (cursor->bmap.br_startoff != NULLFILEOFF && + cursor->if_seq == ifp->if_seq && + cursor->bmap.br_startoff + cursor->bmap.br_blockcount > fileoff) + return 0; + + /* Look up the data fork mapping for the dquot id of interest. */ + error = xfs_bmapi_read(cursor->quota_ip, fileoff, + XFS_MAX_FILEOFF - fileoff, &cursor->bmap, &nmaps, 0); + if (error) + return error; + if (!nmaps) { + ASSERT(nmaps > 0); + return -EFSCORRUPTED; + } + if (cursor->bmap.br_startoff > fileoff) { + ASSERT(cursor->bmap.br_startoff == fileoff); + return -EFSCORRUPTED; + } + + cursor->if_seq = ifp->if_seq; + trace_xchk_dquot_iter_revalidate_bmap(cursor, cursor->id); + return 0; +} + +/* Advance the dqiter cursor to the next non-sparse region of the quota file. */ +STATIC int +xchk_dquot_iter_advance_bmap( + struct xchk_dqiter *cursor, + uint64_t *next_ondisk_id) +{ + struct xfs_quotainfo *qi = cursor->sc->mp->m_quotainfo; + struct xfs_ifork *ifp = xfs_ifork_ptr(cursor->quota_ip, + XFS_DATA_FORK); + xfs_fileoff_t fileoff; + uint64_t next_id; + int nmaps = 1; + int error; + + /* Find the dquot id for the next non-hole mapping. */ do { - error = xfs_qm_dqget_next(mp, id, type, &dq); - if (error == -ENOENT) + fileoff = cursor->bmap.br_startoff + cursor->bmap.br_blockcount; + if (fileoff > XFS_DQ_ID_MAX / qi->qi_dqperchunk) { + /* The hole goes beyond the max dquot id, we're done */ + *next_ondisk_id = -1ULL; return 0; + } + + error = xfs_bmapi_read(cursor->quota_ip, fileoff, + XFS_MAX_FILEOFF - fileoff, &cursor->bmap, + &nmaps, 0); if (error) return error; + if (!nmaps) { + /* Must have reached the end of the mappings. */ + *next_ondisk_id = -1ULL; + return 0; + } + if (cursor->bmap.br_startoff > fileoff) { + ASSERT(cursor->bmap.br_startoff == fileoff); + return -EFSCORRUPTED; + } + } while (!xfs_bmap_is_real_extent(&cursor->bmap)); - error = iter_fn(dq, type, priv); - id = dq->q_id + 1; - xfs_qm_dqput(dq); - } while (error == 0 && id != 0); + next_id = cursor->bmap.br_startoff * qi->qi_dqperchunk; + if (next_id > XFS_DQ_ID_MAX) { + /* The hole goes beyond the max dquot id, we're done */ + *next_ondisk_id = -1ULL; + return 0; + } - return error; + /* Propose jumping forward to the dquot in the next allocated block. */ + *next_ondisk_id = next_id; + cursor->if_seq = ifp->if_seq; + trace_xchk_dquot_iter_advance_bmap(cursor, *next_ondisk_id); + return 0; +} + +/* + * Find the id of the next highest incore dquot. Normally this will correspond + * exactly with the quota file block mappings, but repair might have erased a + * mapping because it was crosslinked; in that case, we need to re-allocate the + * space so that we can reset q_blkno. + */ +STATIC void +xchk_dquot_iter_advance_incore( + struct xchk_dqiter *cursor, + uint64_t *next_incore_id) +{ + struct xfs_quotainfo *qi = cursor->sc->mp->m_quotainfo; + struct radix_tree_root *tree = xfs_dquot_tree(qi, cursor->dqtype); + struct xfs_dquot *dq; + unsigned int nr_found; + + *next_incore_id = -1ULL; + + mutex_lock(&qi->qi_tree_lock); + nr_found = radix_tree_gang_lookup(tree, (void **)&dq, cursor->id, 1); + if (nr_found) + *next_incore_id = dq->q_id; + mutex_unlock(&qi->qi_tree_lock); + + trace_xchk_dquot_iter_advance_incore(cursor, *next_incore_id); +} + +/* + * Walk all incore dquots of this filesystem. Caller must set *@cursorp to + * zero before the first call, and must not hold the quota file ILOCK. + * Returns 1 and a valid *@dqpp; 0 and *@dqpp == NULL when there are no more + * dquots to iterate; or a negative errno. + */ +int +xchk_dquot_iter( + struct xchk_dqiter *cursor, + struct xfs_dquot **dqpp) +{ + struct xfs_mount *mp = cursor->sc->mp; + struct xfs_dquot *dq = NULL; + uint64_t next_ondisk, next_incore = -1ULL; + unsigned int lock_mode; + int error = 0; + + if (cursor->id > XFS_DQ_ID_MAX) + return 0; + next_ondisk = cursor->id; + + /* Revalidate and/or advance the cursor. */ + lock_mode = xfs_ilock_data_map_shared(cursor->quota_ip); + error = xchk_dquot_iter_revalidate_bmap(cursor); + if (!error && !xfs_bmap_is_real_extent(&cursor->bmap)) + error = xchk_dquot_iter_advance_bmap(cursor, &next_ondisk); + xfs_iunlock(cursor->quota_ip, lock_mode); + if (error) + return error; + + if (next_ondisk > cursor->id) + xchk_dquot_iter_advance_incore(cursor, &next_incore); + + /* Pick the next dquot in the sequence and return it. */ + cursor->id = min(next_ondisk, next_incore); + if (cursor->id > XFS_DQ_ID_MAX) + return 0; + + trace_xchk_dquot_iter(cursor, cursor->id); + + error = xfs_qm_dqget(mp, cursor->id, cursor->dqtype, false, &dq); + if (error) + return error; + + cursor->id = dq->q_id + 1; + *dqpp = dq; + return 1; } diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c index f142ca6646061..1a65a75025276 100644 --- a/fs/xfs/scrub/quota.c +++ b/fs/xfs/scrub/quota.c @@ -138,11 +138,9 @@ xchk_quota_item_timer( /* Scrub the fields in an individual quota item. */ STATIC int xchk_quota_item( - struct xfs_dquot *dq, - xfs_dqtype_t dqtype, - void *priv) + struct xchk_quota_info *sqi, + struct xfs_dquot *dq) { - struct xchk_quota_info *sqi = priv; struct xfs_scrub *sc = sqi->sc; struct xfs_mount *mp = sc->mp; struct xfs_quotainfo *qi = mp->m_quotainfo; @@ -271,7 +269,7 @@ xchk_quota_data_fork( return error; /* Check for data fork problems that apply only to quota files. */ - max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk; + max_dqid_off = XFS_DQ_ID_MAX / qi->qi_dqperchunk; ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK); for_each_xfs_iext(ifp, &icur, &irec) { if (xchk_should_terminate(sc, &error)) @@ -298,9 +296,11 @@ int xchk_quota( struct xfs_scrub *sc) { - struct xchk_quota_info sqi; + struct xchk_dqiter cursor = { }; + struct xchk_quota_info sqi = { .sc = sc }; struct xfs_mount *mp = sc->mp; struct xfs_quotainfo *qi = mp->m_quotainfo; + struct xfs_dquot *dq; xfs_dqtype_t dqtype; int error = 0; @@ -319,9 +319,15 @@ xchk_quota( * functions. */ xchk_iunlock(sc, sc->ilock_flags); - sqi.sc = sc; - sqi.last_id = 0; - error = xchk_dqiterate(mp, dqtype, xchk_quota_item, &sqi); + + /* Now look for things that the quota verifiers won't complain about. */ + xchk_dqiter_init(&cursor, sc, dqtype); + while ((error = xchk_dquot_iter(&cursor, &dq)) == 1) { + error = xchk_quota_item(&sqi, dq); + xfs_qm_dqput(dq); + if (error) + break; + } xchk_ilock(sc, XFS_ILOCK_EXCL); if (error == -ECANCELED) error = 0; diff --git a/fs/xfs/scrub/quota.h b/fs/xfs/scrub/quota.h index 0d7b3b01436e6..5056b7766c4a2 100644 --- a/fs/xfs/scrub/quota.h +++ b/fs/xfs/scrub/quota.h @@ -6,9 +6,29 @@ #ifndef __XFS_SCRUB_QUOTA_H__ #define __XFS_SCRUB_QUOTA_H__ -typedef int (*xchk_dqiterate_fn)(struct xfs_dquot *dq, - xfs_dqtype_t type, void *priv); -int xchk_dqiterate(struct xfs_mount *mp, xfs_dqtype_t type, - xchk_dqiterate_fn iter_fn, void *priv); +/* dquot iteration code */ + +struct xchk_dqiter { + struct xfs_scrub *sc; + + /* Quota file that we're walking. */ + struct xfs_inode *quota_ip; + + /* Cached data fork mapping for the dquot. */ + struct xfs_bmbt_irec bmap; + + /* The next dquot to scan. */ + uint64_t id; + + /* Quota type (user/group/project). */ + xfs_dqtype_t dqtype; + + /* Data fork sequence number to detect stale mappings. */ + unsigned int if_seq; +}; + +void xchk_dqiter_init(struct xchk_dqiter *cursor, struct xfs_scrub *sc, + xfs_dqtype_t dqtype); +int xchk_dquot_iter(struct xchk_dqiter *cursor, struct xfs_dquot **dqpp); #endif /* __XFS_SCRUB_QUOTA_H__ */ diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c index 29afa48512355..4641522fd9070 100644 --- a/fs/xfs/scrub/trace.c +++ b/fs/xfs/scrub/trace.c @@ -14,9 +14,11 @@ #include "xfs_btree.h" #include "xfs_ag.h" #include "xfs_rtbitmap.h" +#include "xfs_quota.h" #include "scrub/scrub.h" #include "scrub/xfile.h" #include "scrub/xfarray.h" +#include "scrub/quota.h" /* Figure out which block the btree cursor was pointing to. */ static inline xfs_fsblock_t diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 8b4d3e5f60616..3bfd53b4e8d0b 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -19,6 +19,7 @@ struct xfile; struct xfarray; struct xfarray_sortinfo; +struct xchk_dqiter; /* * ftrace's __print_symbolic requires that all enum values be wrapped in the @@ -348,6 +349,54 @@ DEFINE_EVENT(xchk_fblock_error_class, name, \ DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_error); DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_warning); +#ifdef CONFIG_XFS_QUOTA +DECLARE_EVENT_CLASS(xchk_dqiter_class, + TP_PROTO(struct xchk_dqiter *cursor, uint64_t id), + TP_ARGS(cursor, id), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_dqtype_t, dqtype) + __field(xfs_ino_t, ino) + __field(unsigned long long, cur_id) + __field(unsigned long long, id) + __field(xfs_fileoff_t, startoff) + __field(xfs_fsblock_t, startblock) + __field(xfs_filblks_t, blockcount) + __field(xfs_exntst_t, state) + ), + TP_fast_assign( + __entry->dev = cursor->sc->ip->i_mount->m_super->s_dev; + __entry->dqtype = cursor->dqtype; + __entry->ino = cursor->quota_ip->i_ino; + __entry->cur_id = cursor->id; + __entry->startoff = cursor->bmap.br_startoff; + __entry->startblock = cursor->bmap.br_startblock; + __entry->blockcount = cursor->bmap.br_blockcount; + __entry->state = cursor->bmap.br_state; + __entry->id = id; + ), + TP_printk("dev %d:%d dquot type %s ino 0x%llx cursor_id 0x%llx startoff 0x%llx startblock 0x%llx blockcount 0x%llx state %u id 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_symbolic(__entry->dqtype, XFS_DQTYPE_STRINGS), + __entry->ino, + __entry->cur_id, + __entry->startoff, + __entry->startblock, + __entry->blockcount, + __entry->state, + __entry->id) +); + +#define DEFINE_SCRUB_DQITER_EVENT(name) \ +DEFINE_EVENT(xchk_dqiter_class, name, \ + TP_PROTO(struct xchk_dqiter *cursor, uint64_t id), \ + TP_ARGS(cursor, id)) +DEFINE_SCRUB_DQITER_EVENT(xchk_dquot_iter_revalidate_bmap); +DEFINE_SCRUB_DQITER_EVENT(xchk_dquot_iter_advance_bmap); +DEFINE_SCRUB_DQITER_EVENT(xchk_dquot_iter_advance_incore); +DEFINE_SCRUB_DQITER_EVENT(xchk_dquot_iter); +#endif /* CONFIG_XFS_QUOTA */ + TRACE_EVENT(xchk_incomplete, TP_PROTO(struct xfs_scrub *sc, void *ret_ip), TP_ARGS(sc, ret_ip), ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 4/5] xfs: improve dquot iteration for scrub 2023-11-24 23:56 ` [PATCH 4/5] xfs: improve dquot iteration for scrub Darrick J. Wong @ 2023-11-30 5:25 ` Christoph Hellwig 0 siblings, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:25 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Nov 24, 2023 at 03:56:48PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Upon a closer inspection of the quota record scrubber, I noticed that > dqiterate wasn't actually walking all possible dquots for the mapped > blocks in the quota file. This is due to xfs_qm_dqget_next skipping all > XFS_IS_DQUOT_UNINITIALIZED dquots. > > For a fsck program, we really want to look at all the dquots, even if > all counters and limits in the dquot record are zero. Rewrite the > implementation to do this, as well as switching to an iterator paradigm > to reduce the number of indirect calls. Heh, this basically ends up doing what I suggested in the last patch (and a lot more). I'd just fold the previous patch into this one, as there's no point in moving the function just to remove it immediately. Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair quotas 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong ` (3 preceding siblings ...) 2023-11-24 23:56 ` [PATCH 4/5] xfs: improve dquot iteration for scrub Darrick J. Wong @ 2023-11-24 23:57 ` Darrick J. Wong 2023-11-30 5:33 ` Christoph Hellwig 4 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-11-24 23:57 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Fix anything that causes the quota verifiers to fail. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 4 fs/xfs/scrub/quota.c | 3 fs/xfs/scrub/quota.h | 2 fs/xfs/scrub/quota_repair.c | 575 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 7 + fs/xfs/scrub/scrub.c | 6 fs/xfs/scrub/trace.c | 1 fs/xfs/scrub/trace.h | 29 ++ fs/xfs/xfs_dquot.c | 6 fs/xfs/xfs_dquot.h | 3 10 files changed, 628 insertions(+), 8 deletions(-) create mode 100644 fs/xfs/scrub/quota_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 91008db406fb2..8d10fe02054db 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -198,5 +198,9 @@ xfs-y += $(addprefix scrub/, \ xfs-$(CONFIG_XFS_RT) += $(addprefix scrub/, \ rtbitmap_repair.o \ ) + +xfs-$(CONFIG_XFS_QUOTA) += $(addprefix scrub/, \ + quota_repair.o \ + ) endif endif diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c index 1a65a75025276..183d531875eae 100644 --- a/fs/xfs/scrub/quota.c +++ b/fs/xfs/scrub/quota.c @@ -21,7 +21,7 @@ #include "scrub/quota.h" /* Convert a scrub type code to a DQ flag, or return 0 if error. */ -static inline xfs_dqtype_t +xfs_dqtype_t xchk_quota_to_dqtype( struct xfs_scrub *sc) { @@ -328,7 +328,6 @@ xchk_quota( if (error) break; } - xchk_ilock(sc, XFS_ILOCK_EXCL); if (error == -ECANCELED) error = 0; if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, diff --git a/fs/xfs/scrub/quota.h b/fs/xfs/scrub/quota.h index 5056b7766c4a2..6c7134ce2385e 100644 --- a/fs/xfs/scrub/quota.h +++ b/fs/xfs/scrub/quota.h @@ -6,6 +6,8 @@ #ifndef __XFS_SCRUB_QUOTA_H__ #define __XFS_SCRUB_QUOTA_H__ +xfs_dqtype_t xchk_quota_to_dqtype(struct xfs_scrub *sc); + /* dquot iteration code */ struct xchk_dqiter { diff --git a/fs/xfs/scrub/quota_repair.c b/fs/xfs/scrub/quota_repair.c new file mode 100644 index 0000000000000..0bab4c30cb85a --- /dev/null +++ b/fs/xfs/scrub/quota_repair.c @@ -0,0 +1,575 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_bit.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_inode_fork.h" +#include "xfs_alloc.h" +#include "xfs_bmap.h" +#include "xfs_quota.h" +#include "xfs_qm.h" +#include "xfs_dquot.h" +#include "xfs_dquot_item.h" +#include "xfs_reflink.h" +#include "xfs_bmap_btree.h" +#include "xfs_trans_space.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/quota.h" +#include "scrub/trace.h" +#include "scrub/repair.h" + +/* + * Quota Repair + * ============ + * + * Quota repairs are fairly simplistic; we fix everything that the dquot + * verifiers complain about, cap any counters or limits that make no sense, + * and schedule a quotacheck if we had to fix anything. We also repair any + * data fork extent records that don't apply to metadata files. + */ + +struct xrep_quota_info { + struct xfs_scrub *sc; + bool need_quotacheck; +}; + +/* + * Allocate a new block into a sparse hole in the quota file backing this + * dquot, initialize the block, and commit the whole mess. + */ +STATIC int +xrep_quota_item_fill_bmap_hole( + struct xfs_scrub *sc, + struct xfs_dquot *dq, + struct xfs_bmbt_irec *irec) +{ + struct xfs_buf *bp; + struct xfs_mount *mp = sc->mp; + int nmaps = 1; + int error; + + xfs_trans_ijoin(sc->tp, sc->ip, 0); + + /* Map a block into the file. */ + error = xfs_trans_reserve_more(sc->tp, XFS_QM_DQALLOC_SPACE_RES(mp), + 0); + if (error) + return error; + + error = xfs_bmapi_write(sc->tp, sc->ip, dq->q_fileoffset, + XFS_DQUOT_CLUSTER_SIZE_FSB, XFS_BMAPI_METADATA, 0, + irec, &nmaps); + if (error) + return error; + if (nmaps != 1) + return -ENOSPC; + + dq->q_blkno = XFS_FSB_TO_DADDR(mp, irec->br_startblock); + + trace_xrep_dquot_item_fill_bmap_hole(sc->mp, dq->q_type, dq->q_id); + + /* Initialize the new block. */ + error = xfs_trans_get_buf(sc->tp, mp->m_ddev_targp, dq->q_blkno, + mp->m_quotainfo->qi_dqchunklen, 0, &bp); + if (error) + return error; + bp->b_ops = &xfs_dquot_buf_ops; + + xfs_qm_init_dquot_blk(sc->tp, dq->q_id, dq->q_type, bp); + xfs_buf_set_ref(bp, XFS_DQUOT_REF); + + /* + * Finish the mapping transactions and roll one more time to + * disconnect sc->ip from sc->tp. + */ + error = xrep_defer_finish(sc); + if (error) + return error; + return xfs_trans_roll(&sc->tp); +} + +/* Make sure there's a written block backing this dquot */ +STATIC int +xrep_quota_item_bmap( + struct xfs_scrub *sc, + struct xfs_dquot *dq, + bool *dirty) +{ + struct xfs_bmbt_irec irec; + struct xfs_mount *mp = sc->mp; + struct xfs_quotainfo *qi = mp->m_quotainfo; + xfs_fileoff_t offset = dq->q_id / qi->qi_dqperchunk; + int nmaps = 1; + int error; + + /* The computed file offset should always be valid. */ + if (!xfs_verify_fileoff(mp, offset)) { + ASSERT(xfs_verify_fileoff(mp, offset)); + return -EFSCORRUPTED; + } + dq->q_fileoffset = offset; + + error = xfs_bmapi_read(sc->ip, offset, 1, &irec, &nmaps, 0); + if (error) + return error; + + if (nmaps < 1 || !xfs_bmap_is_real_extent(&irec)) { + /* Hole/delalloc extent; allocate a real block. */ + error = xrep_quota_item_fill_bmap_hole(sc, dq, &irec); + if (error) + return error; + } else if (irec.br_state != XFS_EXT_NORM) { + /* Unwritten extent, which we already took care of? */ + ASSERT(irec.br_state == XFS_EXT_NORM); + return -EFSCORRUPTED; + } else if (dq->q_blkno != XFS_FSB_TO_DADDR(mp, irec.br_startblock)) { + /* + * If the cached daddr is incorrect, repair probably punched a + * hole out of the quota file and filled it back in with a new + * block. Update the block mapping in the dquot. + */ + dq->q_blkno = XFS_FSB_TO_DADDR(mp, irec.br_startblock); + } + + *dirty = true; + return 0; +} + +/* Reset quota timers if incorrectly set. */ +static inline void +xrep_quota_item_timer( + struct xfs_scrub *sc, + const struct xfs_dquot_res *res, + bool *dirty) +{ + if ((res->softlimit && res->count > res->softlimit) || + (res->hardlimit && res->count > res->hardlimit)) { + if (!res->timer) + *dirty = true; + } else { + if (res->timer) + *dirty = true; + } +} + +/* Scrub the fields in an individual quota item. */ +STATIC int +xrep_quota_item( + struct xrep_quota_info *rqi, + struct xfs_dquot *dq) +{ + struct xfs_scrub *sc = rqi->sc; + struct xfs_mount *mp = sc->mp; + xfs_ino_t fs_icount; + bool dirty = false; + int error = 0; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + return error; + + /* + * We might need to fix holes in the bmap record for the storage + * backing this dquot, so we need to lock the dquot and the quota file. + * dqiterate gave us a locked dquot, so drop the dquot lock to get the + * ILOCK_EXCL. + */ + xfs_dqunlock(dq); + xchk_ilock(sc, XFS_ILOCK_EXCL); + xfs_dqlock(dq); + + error = xrep_quota_item_bmap(sc, dq, &dirty); + xchk_iunlock(sc, XFS_ILOCK_EXCL); + if (error) + return error; + + /* Check the limits. */ + if (dq->q_blk.softlimit > dq->q_blk.hardlimit) { + dq->q_blk.softlimit = dq->q_blk.hardlimit; + dirty = true; + } + + if (dq->q_ino.softlimit > dq->q_ino.hardlimit) { + dq->q_ino.softlimit = dq->q_ino.hardlimit; + dirty = true; + } + + if (dq->q_rtb.softlimit > dq->q_rtb.hardlimit) { + dq->q_rtb.softlimit = dq->q_rtb.hardlimit; + dirty = true; + } + + /* + * Check that usage doesn't exceed physical limits. However, on + * a reflink filesystem we're allowed to exceed physical space + * if there are no quota limits. We don't know what the real number + * is, but we can make quotacheck find out for us. + */ + if (!xfs_has_reflink(mp) && dq->q_blk.count > mp->m_sb.sb_dblocks) { + dq->q_blk.reserved -= dq->q_blk.count; + dq->q_blk.reserved += mp->m_sb.sb_dblocks; + dq->q_blk.count = mp->m_sb.sb_dblocks; + rqi->need_quotacheck = true; + dirty = true; + } + fs_icount = percpu_counter_sum(&mp->m_icount); + if (dq->q_ino.count > fs_icount) { + dq->q_ino.reserved -= dq->q_ino.count; + dq->q_ino.reserved += fs_icount; + dq->q_ino.count = fs_icount; + rqi->need_quotacheck = true; + dirty = true; + } + if (dq->q_rtb.count > mp->m_sb.sb_rblocks) { + dq->q_rtb.reserved -= dq->q_rtb.count; + dq->q_rtb.reserved += mp->m_sb.sb_rblocks; + dq->q_rtb.count = mp->m_sb.sb_rblocks; + rqi->need_quotacheck = true; + dirty = true; + } + + xrep_quota_item_timer(sc, &dq->q_blk, &dirty); + xrep_quota_item_timer(sc, &dq->q_ino, &dirty); + xrep_quota_item_timer(sc, &dq->q_rtb, &dirty); + + if (!dirty) + return 0; + + trace_xrep_dquot_item(sc->mp, dq->q_type, dq->q_id); + + dq->q_flags |= XFS_DQFLAG_DIRTY; + xfs_trans_dqjoin(sc->tp, dq); + if (dq->q_id) { + xfs_qm_adjust_dqlimits(dq); + xfs_qm_adjust_dqtimers(dq); + } + xfs_trans_log_dquot(sc->tp, dq); + error = xfs_trans_roll(&sc->tp); + xfs_dqlock(dq); + return error; +} + +/* Fix a quota timer so that we can pass the verifier. */ +STATIC void +xrep_quota_fix_timer( + struct xfs_mount *mp, + const struct xfs_disk_dquot *ddq, + __be64 softlimit, + __be64 countnow, + __be32 *timer, + time64_t timelimit) +{ + uint64_t soft = be64_to_cpu(softlimit); + uint64_t count = be64_to_cpu(countnow); + time64_t new_timer; + uint32_t t; + + if (!soft || count <= soft || *timer != 0) + return; + + new_timer = xfs_dquot_set_timeout(mp, + ktime_get_real_seconds() + timelimit); + if (ddq->d_type & XFS_DQTYPE_BIGTIME) + t = xfs_dq_unix_to_bigtime(new_timer); + else + t = new_timer; + + *timer = cpu_to_be32(t); +} + +/* Fix anything the verifiers complain about. */ +STATIC int +xrep_quota_block( + struct xfs_scrub *sc, + xfs_daddr_t daddr, + xfs_dqtype_t dqtype, + xfs_dqid_t id) +{ + struct xfs_dqblk *dqblk; + struct xfs_disk_dquot *ddq; + struct xfs_quotainfo *qi = sc->mp->m_quotainfo; + struct xfs_def_quota *defq = xfs_get_defquota(qi, dqtype); + struct xfs_buf *bp = NULL; + enum xfs_blft buftype = 0; + int i; + int error; + + error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp, daddr, + qi->qi_dqchunklen, 0, &bp, &xfs_dquot_buf_ops); + switch (error) { + case -EFSBADCRC: + case -EFSCORRUPTED: + /* Failed verifier, retry read with no ops. */ + error = xfs_trans_read_buf(sc->mp, sc->tp, + sc->mp->m_ddev_targp, daddr, qi->qi_dqchunklen, + 0, &bp, NULL); + if (error) + return error; + break; + case 0: + dqblk = bp->b_addr; + ddq = &dqblk[0].dd_diskdq; + + /* + * If there's nothing that would impede a dqiterate, we're + * done. + */ + if ((ddq->d_type & XFS_DQTYPE_REC_MASK) != dqtype || + id == be32_to_cpu(ddq->d_id)) { + xfs_trans_brelse(sc->tp, bp); + return 0; + } + break; + default: + return error; + } + + /* Something's wrong with the block, fix the whole thing. */ + dqblk = bp->b_addr; + bp->b_ops = &xfs_dquot_buf_ops; + for (i = 0; i < qi->qi_dqperchunk; i++, dqblk++) { + ddq = &dqblk->dd_diskdq; + + trace_xrep_disk_dquot(sc->mp, dqtype, id + i); + + ddq->d_magic = cpu_to_be16(XFS_DQUOT_MAGIC); + ddq->d_version = XFS_DQUOT_VERSION; + ddq->d_type = dqtype; + ddq->d_id = cpu_to_be32(id + i); + + if (xfs_has_bigtime(sc->mp) && ddq->d_id) + ddq->d_type |= XFS_DQTYPE_BIGTIME; + + xrep_quota_fix_timer(sc->mp, ddq, ddq->d_blk_softlimit, + ddq->d_bcount, &ddq->d_btimer, + defq->blk.time); + + xrep_quota_fix_timer(sc->mp, ddq, ddq->d_ino_softlimit, + ddq->d_icount, &ddq->d_itimer, + defq->ino.time); + + xrep_quota_fix_timer(sc->mp, ddq, ddq->d_rtb_softlimit, + ddq->d_rtbcount, &ddq->d_rtbtimer, + defq->rtb.time); + + /* We only support v5 filesystems so always set these. */ + uuid_copy(&dqblk->dd_uuid, &sc->mp->m_sb.sb_meta_uuid); + xfs_update_cksum((char *)dqblk, sizeof(struct xfs_dqblk), + XFS_DQUOT_CRC_OFF); + dqblk->dd_lsn = 0; + } + switch (dqtype) { + case XFS_DQTYPE_USER: + buftype = XFS_BLFT_UDQUOT_BUF; + break; + case XFS_DQTYPE_GROUP: + buftype = XFS_BLFT_GDQUOT_BUF; + break; + case XFS_DQTYPE_PROJ: + buftype = XFS_BLFT_PDQUOT_BUF; + break; + } + xfs_trans_buf_set_type(sc->tp, bp, buftype); + xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1); + return xrep_roll_trans(sc); +} + +/* + * Repair a quota file's data fork. The function returns with the inode + * joined. + */ +STATIC int +xrep_quota_data_fork( + struct xfs_scrub *sc, + xfs_dqtype_t dqtype) +{ + struct xfs_bmbt_irec irec = { 0 }; + struct xfs_iext_cursor icur; + struct xfs_quotainfo *qi = sc->mp->m_quotainfo; + struct xfs_ifork *ifp; + xfs_fileoff_t max_dqid_off; + xfs_fileoff_t off; + xfs_fsblock_t fsbno; + bool truncate = false; + bool joined = false; + int error = 0; + + error = xrep_metadata_inode_forks(sc); + if (error) + goto out; + + /* Check for data fork problems that apply only to quota files. */ + max_dqid_off = XFS_DQ_ID_MAX / qi->qi_dqperchunk; + ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK); + for_each_xfs_iext(ifp, &icur, &irec) { + if (isnullstartblock(irec.br_startblock)) { + error = -EFSCORRUPTED; + goto out; + } + + if (irec.br_startoff > max_dqid_off || + irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) { + truncate = true; + break; + } + + /* Convert unwritten extents to real ones. */ + if (irec.br_state == XFS_EXT_UNWRITTEN) { + struct xfs_bmbt_irec nrec; + int nmap = 1; + + if (!joined) { + xfs_trans_ijoin(sc->tp, sc->ip, 0); + joined = true; + } + + error = xfs_bmapi_write(sc->tp, sc->ip, + irec.br_startoff, irec.br_blockcount, + XFS_BMAPI_CONVERT, 0, &nrec, &nmap); + if (error) + goto out; + if (nmap != 1) { + error = -ENOSPC; + goto out; + } + ASSERT(nrec.br_startoff == irec.br_startoff); + ASSERT(nrec.br_blockcount == irec.br_blockcount); + + error = xfs_defer_finish(&sc->tp); + if (error) + goto out; + } + } + + if (!joined) { + xfs_trans_ijoin(sc->tp, sc->ip, 0); + joined = true; + } + + if (truncate) { + /* Erase everything after the block containing the max dquot */ + error = xfs_bunmapi_range(&sc->tp, sc->ip, 0, + max_dqid_off * sc->mp->m_sb.sb_blocksize, + XFS_MAX_FILEOFF); + if (error) + goto out; + + /* Remove all CoW reservations. */ + error = xfs_reflink_cancel_cow_blocks(sc->ip, &sc->tp, 0, + XFS_MAX_FILEOFF, true); + if (error) + goto out; + sc->ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK; + + /* + * Always re-log the inode so that our permanent transaction + * can keep on rolling it forward in the log. + */ + xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE); + } + + /* Now go fix anything that fails the verifiers. */ + for_each_xfs_iext(ifp, &icur, &irec) { + for (fsbno = irec.br_startblock, off = irec.br_startoff; + fsbno < irec.br_startblock + irec.br_blockcount; + fsbno += XFS_DQUOT_CLUSTER_SIZE_FSB, + off += XFS_DQUOT_CLUSTER_SIZE_FSB) { + error = xrep_quota_block(sc, + XFS_FSB_TO_DADDR(sc->mp, fsbno), + dqtype, off * qi->qi_dqperchunk); + if (error) + goto out; + } + } + +out: + return error; +} + +/* + * Go fix anything in the quota items that we could have been mad about. Now + * that we've checked the quota inode data fork we have to drop ILOCK_EXCL to + * use the regular dquot functions. + */ +STATIC int +xrep_quota_problems( + struct xfs_scrub *sc, + xfs_dqtype_t dqtype) +{ + struct xchk_dqiter cursor = { }; + struct xrep_quota_info rqi = { .sc = sc }; + struct xfs_dquot *dq; + int error; + + xchk_dqiter_init(&cursor, sc, dqtype); + while ((error = xchk_dquot_iter(&cursor, &dq)) == 1) { + error = xrep_quota_item(&rqi, dq); + xfs_qm_dqput(dq); + if (error) + break; + } + if (error) + return error; + + /* Make a quotacheck happen. */ + if (rqi.need_quotacheck) + xrep_force_quotacheck(sc, dqtype); + return 0; +} + +/* Repair all of a quota type's items. */ +int +xrep_quota( + struct xfs_scrub *sc) +{ + xfs_dqtype_t dqtype; + int error; + + dqtype = xchk_quota_to_dqtype(sc); + + /* + * Re-take the ILOCK so that we can fix any problems that we found + * with the data fork mappings, or with the dquot bufs themselves. + */ + if (!(sc->ilock_flags & XFS_ILOCK_EXCL)) + xchk_ilock(sc, XFS_ILOCK_EXCL); + error = xrep_quota_data_fork(sc, dqtype); + if (error) + return error; + + /* + * Finish deferred items and roll the transaction to unjoin the quota + * inode from transaction so that we can unlock the quota inode; we + * play only with dquots from now on. + */ + error = xrep_defer_finish(sc); + if (error) + return error; + error = xfs_trans_roll(&sc->tp); + if (error) + return error; + xchk_iunlock(sc, sc->ilock_flags); + + /* Fix anything the dquot verifiers don't complain about. */ + error = xrep_quota_problems(sc, dqtype); + if (error) + return error; + + return xrep_trans_commit(sc); +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index f54dff9268bcc..da0866f9b4525 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -121,6 +121,12 @@ int xrep_rtbitmap(struct xfs_scrub *sc); # define xrep_rtbitmap xrep_notsupported #endif /* CONFIG_XFS_RT */ +#ifdef CONFIG_XFS_QUOTA +int xrep_quota(struct xfs_scrub *sc); +#else +# define xrep_quota xrep_notsupported +#endif /* CONFIG_XFS_QUOTA */ + int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -184,6 +190,7 @@ xrep_setup_nothing( #define xrep_bmap_attr xrep_notsupported #define xrep_bmap_cow xrep_notsupported #define xrep_rtbitmap xrep_notsupported +#define xrep_quota xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 9982b626bfc33..0fbfed522d656 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -342,19 +342,19 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_FS, .setup = xchk_setup_quota, .scrub = xchk_quota, - .repair = xrep_notsupported, + .repair = xrep_quota, }, [XFS_SCRUB_TYPE_GQUOTA] = { /* group quota */ .type = ST_FS, .setup = xchk_setup_quota, .scrub = xchk_quota, - .repair = xrep_notsupported, + .repair = xrep_quota, }, [XFS_SCRUB_TYPE_PQUOTA] = { /* project quota */ .type = ST_FS, .setup = xchk_setup_quota, .scrub = xchk_quota, - .repair = xrep_notsupported, + .repair = xrep_quota, }, [XFS_SCRUB_TYPE_FSCOUNTERS] = { /* fs summary counters */ .type = ST_FS, diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c index 4641522fd9070..d0e24ffaf7547 100644 --- a/fs/xfs/scrub/trace.c +++ b/fs/xfs/scrub/trace.c @@ -15,6 +15,7 @@ #include "xfs_ag.h" #include "xfs_rtbitmap.h" #include "xfs_quota.h" +#include "xfs_quota_defs.h" #include "scrub/scrub.h" #include "scrub/xfile.h" #include "scrub/xfarray.h" diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 3bfd53b4e8d0b..f8e316357f56f 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1732,6 +1732,35 @@ TRACE_EVENT(xrep_cow_free_staging, __entry->blockcount) ); +#ifdef CONFIG_XFS_QUOTA +DECLARE_EVENT_CLASS(xrep_dquot_class, + TP_PROTO(struct xfs_mount *mp, uint8_t type, uint32_t id), + TP_ARGS(mp, type, id), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(uint8_t, type) + __field(uint32_t, id) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->id = id; + __entry->type = type; + ), + TP_printk("dev %d:%d type %s id 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_flags(__entry->type, "|", XFS_DQTYPE_STRINGS), + __entry->id) +); + +#define DEFINE_XREP_DQUOT_EVENT(name) \ +DEFINE_EVENT(xrep_dquot_class, name, \ + TP_PROTO(struct xfs_mount *mp, uint8_t type, uint32_t id), \ + TP_ARGS(mp, type, id)) +DEFINE_XREP_DQUOT_EVENT(xrep_dquot_item); +DEFINE_XREP_DQUOT_EVENT(xrep_disk_dquot); +DEFINE_XREP_DQUOT_EVENT(xrep_dquot_item_fill_bmap_hole); +#endif /* CONFIG_XFS_QUOTA */ + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index 60ec401e26ffd..a93ad76f23c56 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -172,14 +172,14 @@ xfs_qm_adjust_dqtimers( /* * initialize a buffer full of dquots and log the whole thing */ -STATIC void +void xfs_qm_init_dquot_blk( struct xfs_trans *tp, - struct xfs_mount *mp, xfs_dqid_t id, xfs_dqtype_t type, struct xfs_buf *bp) { + struct xfs_mount *mp = tp->t_mountp; struct xfs_quotainfo *q = mp->m_quotainfo; struct xfs_dqblk *d; xfs_dqid_t curid; @@ -353,7 +353,7 @@ xfs_dquot_disk_alloc( * Make a chunk of dquots out of this buffer and log * the entire thing. */ - xfs_qm_init_dquot_blk(tp, mp, dqp->q_id, qtype, bp); + xfs_qm_init_dquot_blk(tp, dqp->q_id, qtype, bp); xfs_buf_set_ref(bp, XFS_DQUOT_REF); /* diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h index 8d9d4b0d979d0..956272d9b302f 100644 --- a/fs/xfs/xfs_dquot.h +++ b/fs/xfs/xfs_dquot.h @@ -237,4 +237,7 @@ static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp) time64_t xfs_dquot_set_timeout(struct xfs_mount *mp, time64_t timeout); time64_t xfs_dquot_set_grace_period(time64_t grace); +void xfs_qm_init_dquot_blk(struct xfs_trans *tp, xfs_dqid_t id, xfs_dqtype_t + type, struct xfs_buf *bp); + #endif /* __XFS_DQUOT_H__ */ ^ permalink raw reply related [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair quotas 2023-11-24 23:57 ` [PATCH 5/5] xfs: repair quotas Darrick J. Wong @ 2023-11-30 5:33 ` Christoph Hellwig 2023-11-30 22:10 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-11-30 5:33 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs > @@ -328,7 +328,6 @@ xchk_quota( > if (error) > break; > } > - xchk_ilock(sc, XFS_ILOCK_EXCL); > if (error == -ECANCELED) > error = 0; > if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, What is the replacement for this lock? The call in xrep_quota_item? I'm a little confused on how locking works - about all flags in sc->ilock_flags are released, and here we used just lock the exclusive ilock directly without tracking it. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair quotas 2023-11-30 5:33 ` Christoph Hellwig @ 2023-11-30 22:10 ` Darrick J. Wong 2023-12-04 4:48 ` Christoph Hellwig 2023-12-04 4:49 ` Christoph Hellwig 0 siblings, 2 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-11-30 22:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Wed, Nov 29, 2023 at 09:33:41PM -0800, Christoph Hellwig wrote: > > @@ -328,7 +328,6 @@ xchk_quota( > > if (error) > > break; > > } > > - xchk_ilock(sc, XFS_ILOCK_EXCL); > > if (error == -ECANCELED) > > error = 0; > > if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, > > What is the replacement for this lock? The call in xrep_quota_item? The replacement is the conditional re-lock at the start of xrep_quota. I could have left this alone, though for the scrub-only case it reduces lock cycling by one. > I'm a little confused on how locking works - about all flags in > sc->ilock_flags are released, and here we used just lock the > exclusive ilock directly without tracking it. Fixing quota files is a bit of a locking mess because dqget needs to take the ILOCK to read the data fork if the dquot isn't already in memory. As a result, the scrub functions have to take the lock to inspect the data fork for mapping problems only to drop it before iterating the incore dquots. For each dquot, we then have to take the ILOCK *again* to check that its q_fileoffset and q_blkno fields actually match the bmap. Similarly, repair has to retake the lock to fix any problems that were found with the mapping before dropping once again to walk the incore dquots. Then we cycle ILOCK for every dquot to fix the mapping. Not sure what you meant about "we used just lock the exclusive lock directly without tracking it" -- both files call xchk_{ilock,iunlock}. The telemetry data I've collected shows that quota file checking is sorta slow, so perhaps it would be justified to create a special no-alloc dqget function where the caller is allowed to pre-acquire the ILOCK. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair quotas 2023-11-30 22:10 ` Darrick J. Wong @ 2023-12-04 4:48 ` Christoph Hellwig 2023-12-04 20:52 ` Darrick J. Wong 2023-12-04 4:49 ` Christoph Hellwig 1 sibling, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-12-04 4:48 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Thu, Nov 30, 2023 at 02:10:15PM -0800, Darrick J. Wong wrote: > On Wed, Nov 29, 2023 at 09:33:41PM -0800, Christoph Hellwig wrote: > > > @@ -328,7 +328,6 @@ xchk_quota( > > > if (error) > > > break; > > > } > > > - xchk_ilock(sc, XFS_ILOCK_EXCL); > > > if (error == -ECANCELED) > > > error = 0; > > > if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, > > > > What is the replacement for this lock? The call in xrep_quota_item? > > The replacement is the conditional re-lock at the start of xrep_quota. Hmm. but not all scrub calls do even end up in the repair callbacks, do they? Ok, I guess the xchk_iunlock call in xchk_teardown would have just released it a bit later and we skip the cycle. Would have been a lot easier to understand if this was in a well-explained self-contained patch.. > Not sure what you meant about "we used just lock the exclusive lock > directly without tracking it" -- both files call xchk_{ilock,iunlock}. > The telemetry data I've collected shows that quota file checking is > sorta slow, so perhaps it would be justified to create a special > no-alloc dqget function where the caller is allowed to pre-acquire the > ILOCK. My confusions was more about checking/using sc->ilock_flags in the callers, while it is maintained by the locking helpers. Probably not *THAT* unusual, but I might have simply been too tired to understand it. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair quotas 2023-12-04 4:48 ` Christoph Hellwig @ 2023-12-04 20:52 ` Darrick J. Wong 2023-12-05 4:27 ` Christoph Hellwig 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-12-04 20:52 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Sun, Dec 03, 2023 at 08:48:20PM -0800, Christoph Hellwig wrote: > On Thu, Nov 30, 2023 at 02:10:15PM -0800, Darrick J. Wong wrote: > > On Wed, Nov 29, 2023 at 09:33:41PM -0800, Christoph Hellwig wrote: > > > > @@ -328,7 +328,6 @@ xchk_quota( > > > > if (error) > > > > break; > > > > } > > > > - xchk_ilock(sc, XFS_ILOCK_EXCL); > > > > if (error == -ECANCELED) > > > > error = 0; > > > > if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, > > > > > > What is the replacement for this lock? The call in xrep_quota_item? > > > > The replacement is the conditional re-lock at the start of xrep_quota. > > Hmm. but not all scrub calls do even end up in the repair callbacks, > do they? Ok, I guess the xchk_iunlock call in xchk_teardown would have > just released it a bit later and we skip the cycle. Would have been > a lot easier to understand if this was in a well-explained > self-contained patch.. How about I not remove the xchk_ilock call, then? Repair is already smart enough to take the lock if it doesn't have it, so it's not strictly necessary for correct operation. > > Not sure what you meant about "we used just lock the exclusive lock > > directly without tracking it" -- both files call xchk_{ilock,iunlock}. > > The telemetry data I've collected shows that quota file checking is > > sorta slow, so perhaps it would be justified to create a special > > no-alloc dqget function where the caller is allowed to pre-acquire the > > ILOCK. > > My confusions was more about checking/using sc->ilock_flags in the > callers, while it is maintained by the locking helpers. Probably not > *THAT* unusual, but I might have simply been too tired to understand it. Ah, got it. I'll ponder a no-alloc dqget in the meantime. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair quotas 2023-12-04 20:52 ` Darrick J. Wong @ 2023-12-05 4:27 ` Christoph Hellwig 2023-12-05 5:20 ` Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Christoph Hellwig @ 2023-12-05 4:27 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Mon, Dec 04, 2023 at 12:52:14PM -0800, Darrick J. Wong wrote: > > > > > - xchk_ilock(sc, XFS_ILOCK_EXCL); > > > > > if (error == -ECANCELED) > > > > > error = 0; > > > > > if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, > > > > > > > > What is the replacement for this lock? The call in xrep_quota_item? > > > > > > The replacement is the conditional re-lock at the start of xrep_quota. > > > > Hmm. but not all scrub calls do even end up in the repair callbacks, > > do they? Ok, I guess the xchk_iunlock call in xchk_teardown would have > > just released it a bit later and we skip the cycle. Would have been > > a lot easier to understand if this was in a well-explained > > self-contained patch.. > > How about I not remove the xchk_ilock call, then? Repair is already > smart enough to take the lock if it doesn't have it, so it's not > strictly necessary for correct operation. No, please keep this hunk. As I said I would have preferred to have it in a separate hunk to understand it, but it understand it now, and it does seems useful. ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair quotas 2023-12-05 4:27 ` Christoph Hellwig @ 2023-12-05 5:20 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-12-05 5:20 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On Mon, Dec 04, 2023 at 08:27:50PM -0800, Christoph Hellwig wrote: > On Mon, Dec 04, 2023 at 12:52:14PM -0800, Darrick J. Wong wrote: > > > > > > - xchk_ilock(sc, XFS_ILOCK_EXCL); > > > > > > if (error == -ECANCELED) > > > > > > error = 0; > > > > > > if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, > > > > > > > > > > What is the replacement for this lock? The call in xrep_quota_item? > > > > > > > > The replacement is the conditional re-lock at the start of xrep_quota. > > > > > > Hmm. but not all scrub calls do even end up in the repair callbacks, > > > do they? Ok, I guess the xchk_iunlock call in xchk_teardown would have > > > just released it a bit later and we skip the cycle. Would have been > > > a lot easier to understand if this was in a well-explained > > > self-contained patch.. > > > > How about I not remove the xchk_ilock call, then? Repair is already > > smart enough to take the lock if it doesn't have it, so it's not > > strictly necessary for correct operation. > > No, please keep this hunk. As I said I would have preferred to have > it in a separate hunk to understand it, but it understand it now, and it > does seems useful. Ok, I'll keep it then. --D ^ permalink raw reply [flat|nested] 156+ messages in thread
* Re: [PATCH 5/5] xfs: repair quotas 2023-11-30 22:10 ` Darrick J. Wong 2023-12-04 4:48 ` Christoph Hellwig @ 2023-12-04 4:49 ` Christoph Hellwig 1 sibling, 0 replies; 156+ messages in thread From: Christoph Hellwig @ 2023-12-04 4:49 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs On Thu, Nov 30, 2023 at 02:10:15PM -0800, Darrick J. Wong wrote: > sorta slow, so perhaps it would be justified to create a special > no-alloc dqget function where the caller is allowed to pre-acquire the > ILOCK. That beeing said, a no-alloc dqget does sound like a sensible idea in general. ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCHSET v26.0 0/5] xfs: online repair of AG btrees @ 2023-07-27 22:20 Darrick J. Wong 2023-07-27 22:31 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-07-27 22:20 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs Hi all, Now that we've spent a lot of time reworking common code in online fsck, we're ready to start rebuilding the AG space btrees. This series implements repair functions for the free space, inode, and refcount btrees. Rebuilding the reverse mapping btree is much more intense and is left for a subsequent patchset. The fstests counterpart of this patchset implements stress testing of repair. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-ag-btrees --- fs/xfs/Makefile | 3 fs/xfs/libxfs/xfs_ag.h | 10 fs/xfs/libxfs/xfs_ag_resv.c | 2 fs/xfs/libxfs/xfs_alloc.c | 18 - fs/xfs/libxfs/xfs_alloc.h | 2 fs/xfs/libxfs/xfs_alloc_btree.c | 13 - fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_ialloc.c | 41 +- fs/xfs/libxfs/xfs_ialloc.h | 3 fs/xfs/libxfs/xfs_refcount.c | 18 - fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 - fs/xfs/libxfs/xfs_types.h | 7 fs/xfs/scrub/alloc.c | 14 - fs/xfs/scrub/alloc_repair.c | 912 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.c | 153 ++++++ fs/xfs/scrub/common.h | 22 + fs/xfs/scrub/ialloc.c | 3 fs/xfs/scrub/ialloc_repair.c | 882 +++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.c | 45 ++ fs/xfs/scrub/newbt.h | 6 fs/xfs/scrub/refcount_repair.c | 796 +++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.c | 128 +++++ fs/xfs/scrub/repair.h | 43 ++ fs/xfs/scrub/scrub.c | 22 + fs/xfs/scrub/scrub.h | 9 fs/xfs/scrub/trace.h | 134 ++++- fs/xfs/scrub/xfarray.h | 22 + fs/xfs/xfs_extent_busy.c | 13 + fs/xfs/xfs_extent_busy.h | 2 fs/xfs/xfs_icache.c | 38 -- fs/xfs/xfs_icache.h | 4 33 files changed, 3277 insertions(+), 131 deletions(-) create mode 100644 fs/xfs/scrub/alloc_repair.c create mode 100644 fs/xfs/scrub/ialloc_repair.c create mode 100644 fs/xfs/scrub/refcount_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair refcount btrees 2023-07-27 22:20 [PATCHSET v26.0 0/5] xfs: online repair of AG btrees Darrick J. Wong @ 2023-07-27 22:31 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-07-27 22:31 UTC (permalink / raw) To: djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Reconstruct the refcount data from the rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_ag.h | 1 fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_refcount.c | 18 + fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 + fs/xfs/scrub/refcount_repair.c | 796 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 22 + 11 files changed, 867 insertions(+), 18 deletions(-) create mode 100644 fs/xfs/scrub/refcount_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 7fed0e706cfa0..a6f708dc56cc2 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -185,6 +185,7 @@ xfs-y += $(addprefix scrub/, \ ialloc_repair.o \ newbt.o \ reap.o \ + refcount_repair.o \ repair.o \ ) endif diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 686f4eadd5743..616812911a23f 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -87,6 +87,7 @@ struct xfs_perag { * verifiers while rebuilding the AG btrees. */ uint8_t pagf_alt_levels[XFS_BTNUM_AGF]; + uint8_t pagf_alt_refcount_level; #endif spinlock_t pag_state_lock; diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index c100e92140be1..ea8d3659df208 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -5212,3 +5212,29 @@ xfs_btree_destroy_cur_caches(void) xfs_rmapbt_destroy_cur_cache(); xfs_refcountbt_destroy_cur_cache(); } + +/* Move the btree cursor before the first record. */ +int +xfs_btree_goto_left_edge( + struct xfs_btree_cur *cur) +{ + int stat = 0; + int error; + + memset(&cur->bc_rec, 0, sizeof(cur->bc_rec)); + error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat); + if (error) + return error; + if (!stat) + return 0; + + error = xfs_btree_decrement(cur, 0, &stat); + if (error) + return error; + if (stat != 0) { + ASSERT(0); + return -EFSCORRUPTED; + } + + return 0; +} diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index e0875cec49392..d906324e25c86 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -738,4 +738,6 @@ xfs_btree_alloc_cursor( int __init xfs_btree_init_cur_caches(void); void xfs_btree_destroy_cur_caches(void); +int xfs_btree_goto_left_edge(struct xfs_btree_cur *cur); + #endif /* __XFS_BTREE_H__ */ diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index 646b3fa362ad0..8db7b6163e55f 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -120,14 +120,11 @@ xfs_refcount_btrec_to_irec( irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount); } -/* Simple checks for refcount records. */ -xfs_failaddr_t -xfs_refcount_check_irec( - struct xfs_btree_cur *cur, +inline xfs_failaddr_t +xfs_refcount_check_perag_irec( + struct xfs_perag *pag, const struct xfs_refcount_irec *irec) { - struct xfs_perag *pag = cur->bc_ag.pag; - if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN) return __this_address; @@ -144,6 +141,15 @@ xfs_refcount_check_irec( return NULL; } +/* Simple checks for refcount records. */ +xfs_failaddr_t +xfs_refcount_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *irec) +{ + return xfs_refcount_check_perag_irec(cur->bc_ag.pag, irec); +} + static inline int xfs_refcount_complain_bad_rec( struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h index 783cd89ca1951..2d6fecb258bb1 100644 --- a/fs/xfs/libxfs/xfs_refcount.h +++ b/fs/xfs/libxfs/xfs_refcount.h @@ -117,6 +117,8 @@ extern int xfs_refcount_has_records(struct xfs_btree_cur *cur, union xfs_btree_rec; extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec, struct xfs_refcount_irec *irec); +xfs_failaddr_t xfs_refcount_check_perag_irec(struct xfs_perag *pag, + const struct xfs_refcount_irec *irec); xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur, const struct xfs_refcount_irec *irec); extern int xfs_refcount_insert(struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index 5c3987d8dc242..50fe789efc938 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -226,7 +226,18 @@ xfs_refcountbt_verify( level = be16_to_cpu(block->bb_level); if (pag && xfs_perag_initialised_agf(pag)) { - if (level >= pag->pagf_refcount_level) + unsigned int maxlevel = pag->pagf_refcount_level; + +#ifdef CONFIG_XFS_ONLINE_REPAIR + /* + * Online repair could be rewriting the refcount btree, so + * we'll validate against the larger of either tree while this + * is going on. + */ + maxlevel = max_t(unsigned int, maxlevel, + pag->pagf_alt_refcount_level); +#endif + if (level >= maxlevel) return __this_address; } else if (level >= mp->m_refc_maxlevels) return __this_address; diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c new file mode 100644 index 0000000000000..23d0dacc1d15a --- /dev/null +++ b/fs/xfs/scrub/refcount_repair.c @@ -0,0 +1,796 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_inode.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_alloc.h" +#include "xfs_ialloc.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_refcount.h" +#include "xfs_refcount_btree.h" +#include "xfs_error.h" +#include "xfs_ag.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/newbt.h" +#include "scrub/reap.h" + +/* + * Rebuilding the Reference Count Btree + * ==================================== + * + * This algorithm is "borrowed" from xfs_repair. Imagine the rmap + * entries as rectangles representing extents of physical blocks, and + * that the rectangles can be laid down to allow them to overlap each + * other; then we know that we must emit a refcnt btree entry wherever + * the amount of overlap changes, i.e. the emission stimulus is + * level-triggered: + * + * - --- + * -- ----- ---- --- ------ + * -- ---- ----------- ---- --------- + * -------------------------------- ----------- + * ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + * 2 1 23 21 3 43 234 2123 1 01 2 3 0 + * + * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner). + * + * Note that in the actual refcnt btree we don't store the refcount < 2 + * cases because the bnobt tells us which blocks are free; single-use + * blocks aren't recorded in the bnobt or the refcntbt. If the rmapbt + * supports storing multiple entries covering a given block we could + * theoretically dispense with the refcntbt and simply count rmaps, but + * that's inefficient in the (hot) write path, so we'll take the cost of + * the extra tree to save time. Also there's no guarantee that rmap + * will be enabled. + * + * Given an array of rmaps sorted by physical block number, a starting + * physical block (sp), a bag to hold rmaps that cover sp, and the next + * physical block where the level changes (np), we can reconstruct the + * refcount btree as follows: + * + * While there are still unprocessed rmaps in the array, + * - Set sp to the physical block (pblk) of the next unprocessed rmap. + * - Add to the bag all rmaps in the array where startblock == sp. + * - Set np to the physical block where the bag size will change. This + * is the minimum of (the pblk of the next unprocessed rmap) and + * (startblock + len of each rmap in the bag). + * - Record the bag size as old_bag_size. + * + * - While the bag isn't empty, + * - Remove from the bag all rmaps where startblock + len == np. + * - Add to the bag all rmaps in the array where startblock == np. + * - If the bag size isn't old_bag_size, store the refcount entry + * (sp, np - sp, bag_size) in the refcnt btree. + * - If the bag is empty, break out of the inner loop. + * - Set old_bag_size to the bag size + * - Set sp = np. + * - Set np to the physical block where the bag size will change. + * This is the minimum of (the pblk of the next unprocessed rmap) + * and (startblock + len of each rmap in the bag). + * + * Like all the other repairers, we make a list of all the refcount + * records we need, then reinitialize the refcount btree root and + * insert all the records. + */ + +/* The only parts of the rmap that we care about for computing refcounts. */ +struct xrep_refc_rmap { + xfs_agblock_t startblock; + xfs_extlen_t blockcount; +} __packed; + +struct xrep_refc { + /* refcount extents */ + struct xfarray *refcount_records; + + /* new refcountbt information */ + struct xrep_newbt new_btree; + + /* old refcountbt blocks */ + struct xagb_bitmap old_refcountbt_blocks; + + struct xfs_scrub *sc; + + /* get_records()'s position in the refcount record array. */ + xfarray_idx_t array_cur; + + /* # of refcountbt blocks */ + xfs_extlen_t btblocks; +}; + +/* Check for any obvious conflicts with this shared/CoW staging extent. */ +STATIC int +xrep_refc_check_ext( + struct xfs_scrub *sc, + const struct xfs_refcount_irec *rec) +{ + enum xbtree_recpacking outcome; + int error; + + if (xfs_refcount_check_perag_irec(sc->sa.pag, rec) != NULL) + return -EFSCORRUPTED; + + /* Make sure this isn't free space. */ + error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rc_startblock, + rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + /* Must not be an inode chunk. */ + error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur, + rec->rc_startblock, rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + return 0; +} + +/* Record a reference count extent. */ +STATIC int +xrep_refc_stash( + struct xrep_refc *rr, + enum xfs_refc_domain domain, + xfs_agblock_t agbno, + xfs_extlen_t len, + uint64_t refcount) +{ + struct xfs_refcount_irec irec = { + .rc_startblock = agbno, + .rc_blockcount = len, + .rc_domain = domain, + }; + struct xfs_scrub *sc = rr->sc; + int error = 0; + + if (xchk_should_terminate(sc, &error)) + return error; + + irec.rc_refcount = min_t(uint64_t, MAXREFCOUNT, refcount); + + error = xrep_refc_check_ext(rr->sc, &irec); + if (error) + return error; + + trace_xrep_refc_found(sc->sa.pag, &irec); + + return xfarray_append(rr->refcount_records, &irec); +} + +/* Record a CoW staging extent. */ +STATIC int +xrep_refc_stash_cow( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + return xrep_refc_stash(rr, XFS_REFC_DOMAIN_COW, agbno, len, 1); +} + +/* Decide if an rmap could describe a shared extent. */ +static inline bool +xrep_refc_rmap_shareable( + struct xfs_mount *mp, + const struct xfs_rmap_irec *rmap) +{ + /* AG metadata are never sharable */ + if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner)) + return false; + + /* Metadata in files are never shareable */ + if (xfs_internal_inum(mp, rmap->rm_owner)) + return false; + + /* Metadata and unwritten file blocks are not shareable. */ + if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN)) + return false; + + return true; +} + +/* + * Walk along the reverse mapping records until we find one that could describe + * a shared extent. + */ +STATIC int +xrep_refc_walk_rmaps( + struct xrep_refc *rr, + struct xrep_refc_rmap *rrm, + bool *have_rec) +{ + struct xfs_rmap_irec rmap; + struct xfs_btree_cur *cur = rr->sc->sa.rmap_cur; + struct xfs_mount *mp = cur->bc_mp; + int have_gt; + int error = 0; + + *have_rec = false; + + /* + * Loop through the remaining rmaps. Remember CoW staging + * extents and the refcountbt blocks from the old tree for later + * disposal. We can only share written data fork extents, so + * keep looping until we find an rmap for one. + */ + do { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfs_btree_increment(cur, 0, &have_gt); + if (error) + return error; + if (!have_gt) + return 0; + + error = xfs_rmap_get_rec(cur, &rmap, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(mp, !have_gt)) + return -EFSCORRUPTED; + + if (rmap.rm_owner == XFS_RMAP_OWN_COW) { + error = xrep_refc_stash_cow(rr, rmap.rm_startblock, + rmap.rm_blockcount); + if (error) + return error; + } else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) { + /* refcountbt block, dump it when we're done. */ + rr->btblocks += rmap.rm_blockcount; + error = xagb_bitmap_set(&rr->old_refcountbt_blocks, + rmap.rm_startblock, rmap.rm_blockcount); + if (error) + return error; + } + } while (!xrep_refc_rmap_shareable(mp, &rmap)); + + rrm->startblock = rmap.rm_startblock; + rrm->blockcount = rmap.rm_blockcount; + *have_rec = true; + return 0; +} + +static inline uint32_t +xrep_refc_encode_startblock( + const struct xfs_refcount_irec *irec) +{ + uint32_t start; + + start = irec->rc_startblock & ~XFS_REFC_COWFLAG; + if (irec->rc_domain == XFS_REFC_DOMAIN_COW) + start |= XFS_REFC_COWFLAG; + + return start; +} + +/* Sort in the same order as the ondisk records. */ +static int +xrep_refc_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_refcount_irec *ap = a; + const struct xfs_refcount_irec *bp = b; + uint32_t sa, sb; + + sa = xrep_refc_encode_startblock(ap); + sb = xrep_refc_encode_startblock(bp); + + if (sa > sb) + return 1; + if (sa < sb) + return -1; + return 0; +} + +/* + * Sort the refcount extents by startblock or else the btree records will be in + * the wrong order. Make sure the records do not overlap in physical space. + */ +STATIC int +xrep_refc_sort_records( + struct xrep_refc *rr) +{ + struct xfs_refcount_irec irec; + xfarray_idx_t cur; + enum xfs_refc_domain dom = XFS_REFC_DOMAIN_SHARED; + xfs_agblock_t next_agbno = 0; + int error; + + error = xfarray_sort(rr->refcount_records, xrep_refc_extent_cmp, + XFARRAY_SORT_KILLABLE); + if (error) + return error; + + foreach_xfarray_idx(rr->refcount_records, cur) { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfarray_load(rr->refcount_records, cur, &irec); + if (error) + return error; + + if (dom == XFS_REFC_DOMAIN_SHARED && + irec.rc_domain == XFS_REFC_DOMAIN_COW) { + dom = irec.rc_domain; + next_agbno = 0; + } + + if (dom != irec.rc_domain) + return -EFSCORRUPTED; + if (irec.rc_startblock < next_agbno) + return -EFSCORRUPTED; + + next_agbno = irec.rc_startblock + irec.rc_blockcount; + } + + return error; +} + +#define RRM_NEXT(r) ((r).startblock + (r).blockcount) +/* + * Find the next block where the refcount changes, given the next rmap we + * looked at and the ones we're already tracking. + */ +static inline int +xrep_refc_next_edge( + struct xfarray *rmap_bag, + struct xrep_refc_rmap *next_rrm, + bool next_valid, + xfs_agblock_t *nbnop) +{ + struct xrep_refc_rmap rrm; + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + xfs_agblock_t nbno = NULLAGBLOCK; + int error; + + if (next_valid) + nbno = next_rrm->startblock; + + while ((error = xfarray_iter(rmap_bag, &array_cur, &rrm)) == 1) + nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm)); + + if (error) + return error; + + /* + * We should have found /something/ because either next_rrm is the next + * interesting rmap to look at after emitting this refcount extent, or + * there are other rmaps in rmap_bag contributing to the current + * sharing count. But if something is seriously wrong, bail out. + */ + if (nbno == NULLAGBLOCK) + return -EFSCORRUPTED; + + *nbnop = nbno; + return 0; +} + +/* + * Walk forward through the rmap btree to collect all rmaps starting at + * @bno in @rmap_bag. These represent the file(s) that share ownership of + * the current block. Upon return, the rmap cursor points to the last record + * satisfying the startblock constraint. + */ +static int +xrep_refc_push_rmaps_at( + struct xrep_refc *rr, + struct xfarray *rmap_bag, + xfs_agblock_t bno, + struct xrep_refc_rmap *rrm, + bool *have, + uint64_t *stack_sz) +{ + struct xfs_scrub *sc = rr->sc; + int have_gt; + int error; + + while (*have && rrm->startblock == bno) { + error = xfarray_store_anywhere(rmap_bag, rrm); + if (error) + return error; + (*stack_sz)++; + error = xrep_refc_walk_rmaps(rr, rrm, have); + if (error) + return error; + } + + error = xfs_btree_decrement(sc->sa.rmap_cur, 0, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(sc->mp, !have_gt)) + return -EFSCORRUPTED; + + return 0; +} + +/* Iterate all the rmap records to generate reference count data. */ +STATIC int +xrep_refc_find_refcounts( + struct xrep_refc *rr) +{ + struct xrep_refc_rmap rrm; + struct xfs_scrub *sc = rr->sc; + struct xfarray *rmap_bag; + char *descr; + uint64_t old_stack_sz; + uint64_t stack_sz = 0; + xfs_agblock_t sbno; + xfs_agblock_t cbno; + xfs_agblock_t nbno; + bool have; + int error; + + xrep_ag_btcur_init(sc, &sc->sa); + + /* + * Set up a sparse array to store all the rmap records that we're + * tracking to generate a reference count record. If this exceeds + * MAXREFCOUNT, we clamp rc_refcount. + */ + descr = xchk_xfile_ag_descr(sc, "rmap record bag"); + error = xfarray_create(descr, 0, sizeof(struct xrep_refc_rmap), + &rmap_bag); + kfree(descr); + if (error) + goto out_cur; + + /* Start the rmapbt cursor to the left of all records. */ + error = xfs_btree_goto_left_edge(sc->sa.rmap_cur); + if (error) + goto out_bag; + + /* Process reverse mappings into refcount data. */ + while (xfs_btree_has_more_records(sc->sa.rmap_cur)) { + /* Push all rmaps with pblk == sbno onto the stack */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (!have) + break; + sbno = cbno = rrm.startblock; + error = xrep_refc_push_rmaps_at(rr, rmap_bag, sbno, + &rrm, &have, &stack_sz); + if (error) + goto out_bag; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + old_stack_sz = stack_sz; + + /* While stack isn't empty... */ + while (stack_sz) { + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + + /* Pop all rmaps that end at nbno */ + while ((error = xfarray_iter(rmap_bag, &array_cur, + &rrm)) == 1) { + if (RRM_NEXT(rrm) != nbno) + continue; + error = xfarray_unset(rmap_bag, array_cur - 1); + if (error) + goto out_bag; + stack_sz--; + } + if (error) + goto out_bag; + + /* Push array items that start at nbno */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (have) { + error = xrep_refc_push_rmaps_at(rr, rmap_bag, + nbno, &rrm, &have, &stack_sz); + if (error) + goto out_bag; + } + + /* Emit refcount if necessary */ + ASSERT(nbno > cbno); + if (stack_sz != old_stack_sz) { + if (old_stack_sz > 1) { + error = xrep_refc_stash(rr, + XFS_REFC_DOMAIN_SHARED, + cbno, nbno - cbno, + old_stack_sz); + if (error) + goto out_bag; + } + cbno = nbno; + } + + /* Stack empty, go find the next rmap */ + if (stack_sz == 0) + break; + old_stack_sz = stack_sz; + sbno = nbno; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, + &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + } + } + + ASSERT(stack_sz == 0); +out_bag: + xfarray_destroy(rmap_bag); +out_cur: + xchk_ag_btcur_free(&sc->sa); + return error; +} +#undef RRM_NEXT + +/* Retrieve refcountbt data for bulk load. */ +STATIC int +xrep_refc_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_refcount_irec *irec = &cur->bc_rec.rc; + struct xrep_refc *rr = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + error = xfarray_load(rr->refcount_records, rr->array_cur++, + irec); + if (error) + return error; + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_refc_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_refc *rr = priv; + int error; + + error = xrep_newbt_relog_autoreap(&rr->new_btree); + if (error) + return error; + + return xrep_newbt_claim_block(cur, &rr->new_btree, ptr); +} + +/* Update the AGF counters. */ +STATIC int +xrep_refc_reset_counters( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + + /* + * After we commit the new btree to disk, it is possible that the + * process to reap the old btree blocks will race with the AIL trying + * to checkpoint the old btree blocks into the filesystem. If the new + * tree is shorter than the old one, the refcountbt write verifier will + * fail and the AIL will shut down the filesystem. + * + * To avoid this, save the old incore btree height values as the alt + * height values before re-initializing the perag info from the updated + * AGF to capture all the new values. + */ + pag->pagf_alt_refcount_level = pag->pagf_refcount_level; + + /* Reinitialize with the values we just logged. */ + return xrep_reinit_pagf(sc); +} + +/* + * Use the collected refcount information to stage a new refcount btree. If + * this is successful we'll return with the new btree root information logged + * to the repair transaction but not yet committed. + */ +STATIC int +xrep_refc_build_new_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_btree_cur *refc_cur; + struct xfs_perag *pag = sc->sa.pag; + xfs_fsblock_t fsbno; + int error; + + error = xrep_refc_sort_records(rr); + if (error) + return error; + + /* + * Prepare to construct the new btree by reserving disk space for the + * new btree and setting up all the accounting information we'll need + * to root the new btree while it's under construction and before we + * attach it to the AG header. + */ + fsbno = XFS_AGB_TO_FSB(sc->mp, pag->pag_agno, xfs_refc_block(sc->mp)); + xrep_newbt_init_ag(&rr->new_btree, sc, &XFS_RMAP_OINFO_REFC, fsbno, + XFS_AG_RESV_METADATA); + rr->new_btree.bload.get_records = xrep_refc_get_records; + rr->new_btree.bload.claim_block = xrep_refc_claim_block; + + /* Compute how many blocks we'll need. */ + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, &rr->new_btree.afake, + pag); + error = xfs_btree_bload_compute_geometry(refc_cur, + &rr->new_btree.bload, + xfarray_length(rr->refcount_records)); + if (error) + goto err_cur; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto err_cur; + + /* Reserve the space we'll need for the new btree. */ + error = xrep_newbt_alloc_blocks(&rr->new_btree, + rr->new_btree.bload.nr_blocks); + if (error) + goto err_cur; + + /* + * Due to btree slack factors, it's possible for a new btree to be one + * level taller than the old btree. Update the incore btree height so + * that we don't trip the verifiers when writing the new btree blocks + * to disk. + */ + pag->pagf_alt_refcount_level = rr->new_btree.bload.btree_height; + + /* Add all observed refcount records. */ + rr->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(refc_cur, &rr->new_btree.bload, rr); + if (error) + goto err_level; + + /* + * Install the new btree in the AG header. After this point the old + * btree is no longer accessible and the new tree is live. + */ + xfs_refcountbt_commit_staged_btree(refc_cur, sc->tp, sc->sa.agf_bp); + xfs_btree_del_cursor(refc_cur, 0); + + /* Reset the AGF counters now that we've changed the btree shape. */ + error = xrep_refc_reset_counters(rr); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + error = xrep_newbt_commit(&rr->new_btree); + if (error) + return error; + + return xrep_roll_ag_trans(sc); + +err_level: + pag->pagf_alt_refcount_level = 0; +err_cur: + xfs_btree_del_cursor(refc_cur, error); +err_newbt: + xrep_newbt_cancel(&rr->new_btree); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_refc_remove_old_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + int error; + + /* Free the old refcountbt blocks if they're not in use. */ + error = xrep_reap_agblocks(sc, &rr->old_refcountbt_blocks, + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); + if (error) + return error; + + /* + * Now that we've zapped all the old refcountbt blocks we can turn off + * the alternate height mechanism and reset the per-AG space + * reservations. + */ + pag->pagf_alt_refcount_level = 0; + sc->flags |= XREP_RESET_PERAG_RESV; + return 0; +} + +/* Rebuild the refcount btree. */ +int +xrep_refcountbt( + struct xfs_scrub *sc) +{ + struct xrep_refc *rr; + struct xfs_mount *mp = sc->mp; + char *descr; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; + + rr = kzalloc(sizeof(struct xrep_refc), XCHK_GFP_FLAGS); + if (!rr) + return -ENOMEM; + rr->sc = sc; + + /* Set up enough storage to handle one refcount record per block. */ + descr = xchk_xfile_ag_descr(sc, "reference count records"); + error = xfarray_create(descr, mp->m_sb.sb_agblocks, + sizeof(struct xfs_refcount_irec), + &rr->refcount_records); + kfree(descr); + if (error) + goto out_rr; + + /* Collect all reference counts. */ + xagb_bitmap_init(&rr->old_refcountbt_blocks); + error = xrep_refc_find_refcounts(rr); + if (error) + goto out_bitmap; + + /* Rebuild the refcount information. */ + error = xrep_refc_build_new_tree(rr); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_refc_remove_old_tree(rr); + +out_bitmap: + xagb_bitmap_destroy(&rr->old_refcountbt_blocks); + xfarray_destroy(rr->refcount_records); +out_rr: + kfree(rr); + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 3ff5e37316685..42325305d29d9 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -73,6 +73,7 @@ int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); +int xrep_refcountbt(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -126,6 +127,7 @@ xrep_setup_nothing( #define xrep_agi xrep_notsupported #define xrep_allocbt xrep_notsupported #define xrep_iallocbt xrep_notsupported +#define xrep_refcountbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 57f2db883792e..71aee7e3dd43a 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -276,7 +276,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .setup = xchk_setup_ag_refcountbt, .scrub = xchk_refcountbt, .has = xfs_has_reflink, - .repair = xrep_notsupported, + .repair = xrep_refcountbt, }, [XFS_SCRUB_TYPE_INODE] = { /* inode record */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 79d1316b288ed..358c7ddbf14e2 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1205,27 +1205,29 @@ TRACE_EVENT(xrep_ibt_found, __entry->freemask) ) -TRACE_EVENT(xrep_refcount_extent_fn, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - struct xfs_refcount_irec *irec), - TP_ARGS(mp, agno, irec), +TRACE_EVENT(xrep_refc_found, + TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *rec), + TP_ARGS(pag, rec), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) + __field(enum xfs_refc_domain, domain) __field(xfs_agblock_t, startblock) __field(xfs_extlen_t, blockcount) __field(xfs_nlink_t, refcount) ), TP_fast_assign( - __entry->dev = mp->m_super->s_dev; - __entry->agno = agno; - __entry->startblock = irec->rc_startblock; - __entry->blockcount = irec->rc_blockcount; - __entry->refcount = irec->rc_refcount; + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->domain = rec->rc_domain; + __entry->startblock = rec->rc_startblock; + __entry->blockcount = rec->rc_blockcount; + __entry->refcount = rec->rc_refcount; ), - TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u", + TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, + __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS), __entry->startblock, __entry->blockcount, __entry->refcount) ^ permalink raw reply related [flat|nested] 156+ messages in thread
* [PATCHSET v25.0 0/5] xfs: online repair of AG btrees @ 2023-05-26 0:29 Darrick J. Wong 2023-05-26 0:52 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2023-05-26 0:29 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, Now that we've spent a lot of time reworking common code in online fsck, we're ready to start rebuilding the AG space btrees. This series implements repair functions for the free space, inode, and refcount btrees. Rebuilding the reverse mapping btree is much more intense and is left for a subsequent patchset. The fstests counterpart of this patchset implements stress testing of repair. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-ag-btrees --- fs/xfs/Makefile | 3 fs/xfs/libxfs/xfs_ag.h | 10 fs/xfs/libxfs/xfs_ag_resv.c | 2 fs/xfs/libxfs/xfs_alloc.c | 18 - fs/xfs/libxfs/xfs_alloc.h | 2 fs/xfs/libxfs/xfs_alloc_btree.c | 13 - fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_ialloc.c | 41 +- fs/xfs/libxfs/xfs_ialloc.h | 3 fs/xfs/libxfs/xfs_refcount.c | 18 - fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 - fs/xfs/libxfs/xfs_types.h | 7 fs/xfs/scrub/agheader_repair.c | 5 fs/xfs/scrub/alloc.c | 14 - fs/xfs/scrub/alloc_repair.c | 910 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.c | 1 fs/xfs/scrub/common.h | 13 + fs/xfs/scrub/ialloc_repair.c | 872 ++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.c | 44 ++ fs/xfs/scrub/newbt.h | 6 fs/xfs/scrub/reap.c | 17 + fs/xfs/scrub/refcount_repair.c | 791 +++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.c | 128 +++++ fs/xfs/scrub/repair.h | 43 ++ fs/xfs/scrub/scrub.c | 22 + fs/xfs/scrub/scrub.h | 9 fs/xfs/scrub/trace.h | 112 +++- fs/xfs/scrub/xfarray.h | 22 + fs/xfs/xfs_extent_busy.c | 13 + fs/xfs/xfs_extent_busy.h | 2 fs/xfs/xfs_icache.c | 127 ++++- fs/xfs/xfs_trace.h | 22 + 34 files changed, 3223 insertions(+), 110 deletions(-) create mode 100644 fs/xfs/scrub/alloc_repair.c create mode 100644 fs/xfs/scrub/ialloc_repair.c create mode 100644 fs/xfs/scrub/refcount_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair refcount btrees 2023-05-26 0:29 [PATCHSET v25.0 0/5] xfs: online repair of AG btrees Darrick J. Wong @ 2023-05-26 0:52 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2023-05-26 0:52 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Reconstruct the refcount data from the rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_ag.h | 1 fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_refcount.c | 18 + fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 + fs/xfs/scrub/refcount_repair.c | 791 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 22 + 11 files changed, 862 insertions(+), 18 deletions(-) create mode 100644 fs/xfs/scrub/refcount_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 192bb14cf6ab..cc74e2fe850e 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -183,6 +183,7 @@ xfs-y += $(addprefix scrub/, \ ialloc_repair.o \ newbt.o \ reap.o \ + refcount_repair.o \ repair.o \ ) endif diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 686f4eadd574..616812911a23 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -87,6 +87,7 @@ struct xfs_perag { * verifiers while rebuilding the AG btrees. */ uint8_t pagf_alt_levels[XFS_BTNUM_AGF]; + uint8_t pagf_alt_refcount_level; #endif spinlock_t pag_state_lock; diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index c100e92140be..ea8d3659df20 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -5212,3 +5212,29 @@ xfs_btree_destroy_cur_caches(void) xfs_rmapbt_destroy_cur_cache(); xfs_refcountbt_destroy_cur_cache(); } + +/* Move the btree cursor before the first record. */ +int +xfs_btree_goto_left_edge( + struct xfs_btree_cur *cur) +{ + int stat = 0; + int error; + + memset(&cur->bc_rec, 0, sizeof(cur->bc_rec)); + error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat); + if (error) + return error; + if (!stat) + return 0; + + error = xfs_btree_decrement(cur, 0, &stat); + if (error) + return error; + if (stat != 0) { + ASSERT(0); + return -EFSCORRUPTED; + } + + return 0; +} diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index 2db03f0ae961..5525d3715d57 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -738,4 +738,6 @@ xfs_btree_alloc_cursor( int __init xfs_btree_init_cur_caches(void); void xfs_btree_destroy_cur_caches(void); +int xfs_btree_goto_left_edge(struct xfs_btree_cur *cur); + #endif /* __XFS_BTREE_H__ */ diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index cd21cb1c9058..c834078449ed 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -120,14 +120,11 @@ xfs_refcount_btrec_to_irec( irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount); } -/* Simple checks for refcount records. */ -xfs_failaddr_t -xfs_refcount_check_irec( - struct xfs_btree_cur *cur, +inline xfs_failaddr_t +xfs_refcount_check_perag_irec( + struct xfs_perag *pag, const struct xfs_refcount_irec *irec) { - struct xfs_perag *pag = cur->bc_ag.pag; - if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN) return __this_address; @@ -144,6 +141,15 @@ xfs_refcount_check_irec( return NULL; } +/* Simple checks for refcount records. */ +xfs_failaddr_t +xfs_refcount_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *irec) +{ + return xfs_refcount_check_perag_irec(cur->bc_ag.pag, irec); +} + static inline int xfs_refcount_complain_bad_rec( struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h index 783cd89ca195..2d6fecb258bb 100644 --- a/fs/xfs/libxfs/xfs_refcount.h +++ b/fs/xfs/libxfs/xfs_refcount.h @@ -117,6 +117,8 @@ extern int xfs_refcount_has_records(struct xfs_btree_cur *cur, union xfs_btree_rec; extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec, struct xfs_refcount_irec *irec); +xfs_failaddr_t xfs_refcount_check_perag_irec(struct xfs_perag *pag, + const struct xfs_refcount_irec *irec); xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur, const struct xfs_refcount_irec *irec); extern int xfs_refcount_insert(struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index d4afc5f4e6a5..efe22aa1c906 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -232,7 +232,18 @@ xfs_refcountbt_verify( level = be16_to_cpu(block->bb_level); if (pag && xfs_perag_initialised_agf(pag)) { - if (level >= pag->pagf_refcount_level) + unsigned int maxlevel = pag->pagf_refcount_level; + +#ifdef CONFIG_XFS_ONLINE_REPAIR + /* + * Online repair could be rewriting the refcount btree, so + * we'll validate against the larger of either tree while this + * is going on. + */ + maxlevel = max_t(unsigned int, maxlevel, + pag->pagf_alt_refcount_level); +#endif + if (level >= maxlevel) return __this_address; } else if (level >= mp->m_refc_maxlevels) return __this_address; diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c new file mode 100644 index 000000000000..5f3a641c38b6 --- /dev/null +++ b/fs/xfs/scrub/refcount_repair.c @@ -0,0 +1,791 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_inode.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_alloc.h" +#include "xfs_ialloc.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_refcount.h" +#include "xfs_refcount_btree.h" +#include "xfs_error.h" +#include "xfs_ag.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/newbt.h" +#include "scrub/reap.h" + +/* + * Rebuilding the Reference Count Btree + * ==================================== + * + * This algorithm is "borrowed" from xfs_repair. Imagine the rmap + * entries as rectangles representing extents of physical blocks, and + * that the rectangles can be laid down to allow them to overlap each + * other; then we know that we must emit a refcnt btree entry wherever + * the amount of overlap changes, i.e. the emission stimulus is + * level-triggered: + * + * - --- + * -- ----- ---- --- ------ + * -- ---- ----------- ---- --------- + * -------------------------------- ----------- + * ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + * 2 1 23 21 3 43 234 2123 1 01 2 3 0 + * + * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner). + * + * Note that in the actual refcnt btree we don't store the refcount < 2 + * cases because the bnobt tells us which blocks are free; single-use + * blocks aren't recorded in the bnobt or the refcntbt. If the rmapbt + * supports storing multiple entries covering a given block we could + * theoretically dispense with the refcntbt and simply count rmaps, but + * that's inefficient in the (hot) write path, so we'll take the cost of + * the extra tree to save time. Also there's no guarantee that rmap + * will be enabled. + * + * Given an array of rmaps sorted by physical block number, a starting + * physical block (sp), a bag to hold rmaps that cover sp, and the next + * physical block where the level changes (np), we can reconstruct the + * refcount btree as follows: + * + * While there are still unprocessed rmaps in the array, + * - Set sp to the physical block (pblk) of the next unprocessed rmap. + * - Add to the bag all rmaps in the array where startblock == sp. + * - Set np to the physical block where the bag size will change. This + * is the minimum of (the pblk of the next unprocessed rmap) and + * (startblock + len of each rmap in the bag). + * - Record the bag size as old_bag_size. + * + * - While the bag isn't empty, + * - Remove from the bag all rmaps where startblock + len == np. + * - Add to the bag all rmaps in the array where startblock == np. + * - If the bag size isn't old_bag_size, store the refcount entry + * (sp, np - sp, bag_size) in the refcnt btree. + * - If the bag is empty, break out of the inner loop. + * - Set old_bag_size to the bag size + * - Set sp = np. + * - Set np to the physical block where the bag size will change. + * This is the minimum of (the pblk of the next unprocessed rmap) + * and (startblock + len of each rmap in the bag). + * + * Like all the other repairers, we make a list of all the refcount + * records we need, then reinitialize the refcount btree root and + * insert all the records. + */ + +/* The only parts of the rmap that we care about for computing refcounts. */ +struct xrep_refc_rmap { + xfs_agblock_t startblock; + xfs_extlen_t blockcount; +} __packed; + +struct xrep_refc { + /* refcount extents */ + struct xfarray *refcount_records; + + /* new refcountbt information */ + struct xrep_newbt new_btree; + + /* old refcountbt blocks */ + struct xagb_bitmap old_refcountbt_blocks; + + struct xfs_scrub *sc; + + /* get_records()'s position in the refcount record array. */ + xfarray_idx_t array_cur; + + /* # of refcountbt blocks */ + xfs_extlen_t btblocks; +}; + +/* Check for any obvious conflicts with this shared/CoW staging extent. */ +STATIC int +xrep_refc_check_ext( + struct xfs_scrub *sc, + const struct xfs_refcount_irec *rec) +{ + enum xbtree_recpacking outcome; + int error; + + if (xfs_refcount_check_perag_irec(sc->sa.pag, rec) != NULL) + return -EFSCORRUPTED; + + /* Make sure this isn't free space. */ + error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rc_startblock, + rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + /* Must not be an inode chunk. */ + error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur, + rec->rc_startblock, rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + return 0; +} + +/* Record a reference count extent. */ +STATIC int +xrep_refc_stash( + struct xrep_refc *rr, + enum xfs_refc_domain domain, + xfs_agblock_t agbno, + xfs_extlen_t len, + uint64_t refcount) +{ + struct xfs_refcount_irec irec = { + .rc_startblock = agbno, + .rc_blockcount = len, + .rc_domain = domain, + }; + struct xfs_scrub *sc = rr->sc; + int error = 0; + + if (xchk_should_terminate(sc, &error)) + return error; + + irec.rc_refcount = min_t(uint64_t, MAXREFCOUNT, refcount); + + error = xrep_refc_check_ext(rr->sc, &irec); + if (error) + return error; + + trace_xrep_refc_found(sc->sa.pag, &irec); + + return xfarray_append(rr->refcount_records, &irec); +} + +/* Record a CoW staging extent. */ +STATIC int +xrep_refc_stash_cow( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + return xrep_refc_stash(rr, XFS_REFC_DOMAIN_COW, agbno, len, 1); +} + +/* Decide if an rmap could describe a shared extent. */ +static inline bool +xrep_refc_rmap_shareable( + struct xfs_mount *mp, + const struct xfs_rmap_irec *rmap) +{ + /* AG metadata are never sharable */ + if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner)) + return false; + + /* Metadata in files are never shareable */ + if (xfs_internal_inum(mp, rmap->rm_owner)) + return false; + + /* Metadata and unwritten file blocks are not shareable. */ + if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN)) + return false; + + return true; +} + +/* + * Walk along the reverse mapping records until we find one that could describe + * a shared extent. + */ +STATIC int +xrep_refc_walk_rmaps( + struct xrep_refc *rr, + struct xrep_refc_rmap *rrm, + bool *have_rec) +{ + struct xfs_rmap_irec rmap; + struct xfs_btree_cur *cur = rr->sc->sa.rmap_cur; + struct xfs_mount *mp = cur->bc_mp; + int have_gt; + int error = 0; + + *have_rec = false; + + /* + * Loop through the remaining rmaps. Remember CoW staging + * extents and the refcountbt blocks from the old tree for later + * disposal. We can only share written data fork extents, so + * keep looping until we find an rmap for one. + */ + do { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfs_btree_increment(cur, 0, &have_gt); + if (error) + return error; + if (!have_gt) + return 0; + + error = xfs_rmap_get_rec(cur, &rmap, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(mp, !have_gt)) + return -EFSCORRUPTED; + + if (rmap.rm_owner == XFS_RMAP_OWN_COW) { + error = xrep_refc_stash_cow(rr, rmap.rm_startblock, + rmap.rm_blockcount); + if (error) + return error; + } else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) { + /* refcountbt block, dump it when we're done. */ + rr->btblocks += rmap.rm_blockcount; + error = xagb_bitmap_set(&rr->old_refcountbt_blocks, + rmap.rm_startblock, rmap.rm_blockcount); + if (error) + return error; + } + } while (!xrep_refc_rmap_shareable(mp, &rmap)); + + rrm->startblock = rmap.rm_startblock; + rrm->blockcount = rmap.rm_blockcount; + *have_rec = true; + return 0; +} + +static inline uint32_t +xrep_refc_encode_startblock( + const struct xfs_refcount_irec *irec) +{ + uint32_t start; + + start = irec->rc_startblock & ~XFS_REFC_COWFLAG; + if (irec->rc_domain == XFS_REFC_DOMAIN_COW) + start |= XFS_REFC_COWFLAG; + + return start; +} + +/* Sort in the same order as the ondisk records. */ +static int +xrep_refc_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_refcount_irec *ap = a; + const struct xfs_refcount_irec *bp = b; + uint32_t sa, sb; + + sa = xrep_refc_encode_startblock(ap); + sb = xrep_refc_encode_startblock(bp); + + if (sa > sb) + return 1; + if (sa < sb) + return -1; + return 0; +} + +/* + * Sort the refcount extents by startblock or else the btree records will be in + * the wrong order. Make sure the records do not overlap in physical space. + */ +STATIC int +xrep_refc_sort_records( + struct xrep_refc *rr) +{ + struct xfs_refcount_irec irec; + xfarray_idx_t cur; + enum xfs_refc_domain dom = XFS_REFC_DOMAIN_SHARED; + xfs_agblock_t next_agbno = 0; + int error; + + error = xfarray_sort(rr->refcount_records, xrep_refc_extent_cmp, + XFARRAY_SORT_KILLABLE); + if (error) + return error; + + foreach_xfarray_idx(rr->refcount_records, cur) { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfarray_load(rr->refcount_records, cur, &irec); + if (error) + return error; + + if (dom == XFS_REFC_DOMAIN_SHARED && + irec.rc_domain == XFS_REFC_DOMAIN_COW) { + dom = irec.rc_domain; + next_agbno = 0; + } + + if (dom != irec.rc_domain) + return -EFSCORRUPTED; + if (irec.rc_startblock < next_agbno) + return -EFSCORRUPTED; + + next_agbno = irec.rc_startblock + irec.rc_blockcount; + } + + return error; +} + +#define RRM_NEXT(r) ((r).startblock + (r).blockcount) +/* + * Find the next block where the refcount changes, given the next rmap we + * looked at and the ones we're already tracking. + */ +static inline int +xrep_refc_next_edge( + struct xfarray *rmap_bag, + struct xrep_refc_rmap *next_rrm, + bool next_valid, + xfs_agblock_t *nbnop) +{ + struct xrep_refc_rmap rrm; + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + xfs_agblock_t nbno = NULLAGBLOCK; + int error; + + if (next_valid) + nbno = next_rrm->startblock; + + while ((error = xfarray_iter(rmap_bag, &array_cur, &rrm)) == 1) + nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm)); + + if (error) + return error; + + /* + * We should have found /something/ because either next_rrm is the next + * interesting rmap to look at after emitting this refcount extent, or + * there are other rmaps in rmap_bag contributing to the current + * sharing count. But if something is seriously wrong, bail out. + */ + if (nbno == NULLAGBLOCK) + return -EFSCORRUPTED; + + *nbnop = nbno; + return 0; +} + +/* + * Walk forward through the rmap btree to collect all rmaps starting at + * @bno in @rmap_bag. These represent the file(s) that share ownership of + * the current block. Upon return, the rmap cursor points to the last record + * satisfying the startblock constraint. + */ +static int +xrep_refc_push_rmaps_at( + struct xrep_refc *rr, + struct xfarray *rmap_bag, + xfs_agblock_t bno, + struct xrep_refc_rmap *rrm, + bool *have, + uint64_t *stack_sz) +{ + struct xfs_scrub *sc = rr->sc; + int have_gt; + int error; + + while (*have && rrm->startblock == bno) { + error = xfarray_store_anywhere(rmap_bag, rrm); + if (error) + return error; + (*stack_sz)++; + error = xrep_refc_walk_rmaps(rr, rrm, have); + if (error) + return error; + } + + error = xfs_btree_decrement(sc->sa.rmap_cur, 0, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(sc->mp, !have_gt)) + return -EFSCORRUPTED; + + return 0; +} + +/* Iterate all the rmap records to generate reference count data. */ +STATIC int +xrep_refc_find_refcounts( + struct xrep_refc *rr) +{ + struct xrep_refc_rmap rrm; + struct xfs_scrub *sc = rr->sc; + struct xfarray *rmap_bag; + uint64_t old_stack_sz; + uint64_t stack_sz = 0; + xfs_agblock_t sbno; + xfs_agblock_t cbno; + xfs_agblock_t nbno; + bool have; + int error; + + xrep_ag_btcur_init(sc, &sc->sa); + + /* + * Set up a sparse array to store all the rmap records that we're + * tracking to generate a reference count record. If this exceeds + * MAXREFCOUNT, we clamp rc_refcount. + */ + error = xfarray_create(sc->mp, "rmap bag", 0, + sizeof(struct xrep_refc_rmap), &rmap_bag); + if (error) + goto out_cur; + + /* Start the rmapbt cursor to the left of all records. */ + error = xfs_btree_goto_left_edge(sc->sa.rmap_cur); + if (error) + goto out_bag; + + /* Process reverse mappings into refcount data. */ + while (xfs_btree_has_more_records(sc->sa.rmap_cur)) { + /* Push all rmaps with pblk == sbno onto the stack */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (!have) + break; + sbno = cbno = rrm.startblock; + error = xrep_refc_push_rmaps_at(rr, rmap_bag, sbno, + &rrm, &have, &stack_sz); + if (error) + goto out_bag; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + old_stack_sz = stack_sz; + + /* While stack isn't empty... */ + while (stack_sz) { + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + + /* Pop all rmaps that end at nbno */ + while ((error = xfarray_iter(rmap_bag, &array_cur, + &rrm)) == 1) { + if (RRM_NEXT(rrm) != nbno) + continue; + error = xfarray_unset(rmap_bag, array_cur - 1); + if (error) + goto out_bag; + stack_sz--; + } + if (error) + goto out_bag; + + /* Push array items that start at nbno */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (have) { + error = xrep_refc_push_rmaps_at(rr, rmap_bag, + nbno, &rrm, &have, &stack_sz); + if (error) + goto out_bag; + } + + /* Emit refcount if necessary */ + ASSERT(nbno > cbno); + if (stack_sz != old_stack_sz) { + if (old_stack_sz > 1) { + error = xrep_refc_stash(rr, + XFS_REFC_DOMAIN_SHARED, + cbno, nbno - cbno, + old_stack_sz); + if (error) + goto out_bag; + } + cbno = nbno; + } + + /* Stack empty, go find the next rmap */ + if (stack_sz == 0) + break; + old_stack_sz = stack_sz; + sbno = nbno; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, + &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + } + } + + ASSERT(stack_sz == 0); +out_bag: + xfarray_destroy(rmap_bag); +out_cur: + xchk_ag_btcur_free(&sc->sa); + return error; +} +#undef RRM_NEXT + +/* Retrieve refcountbt data for bulk load. */ +STATIC int +xrep_refc_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_refcount_irec *irec = &cur->bc_rec.rc; + struct xrep_refc *rr = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + error = xfarray_load(rr->refcount_records, rr->array_cur++, + irec); + if (error) + return error; + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_refc_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_refc *rr = priv; + int error; + + error = xrep_newbt_relog_autoreap(&rr->new_btree); + if (error) + return error; + + return xrep_newbt_claim_block(cur, &rr->new_btree, ptr); +} + +/* Update the AGF counters. */ +STATIC int +xrep_refc_reset_counters( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + + /* + * After we commit the new btree to disk, it is possible that the + * process to reap the old btree blocks will race with the AIL trying + * to checkpoint the old btree blocks into the filesystem. If the new + * tree is shorter than the old one, the refcountbt write verifier will + * fail and the AIL will shut down the filesystem. + * + * To avoid this, save the old incore btree height values as the alt + * height values before re-initializing the perag info from the updated + * AGF to capture all the new values. + */ + pag->pagf_alt_refcount_level = pag->pagf_refcount_level; + + /* Reinitialize with the values we just logged. */ + return xrep_reinit_pagf(sc); +} + +/* + * Use the collected refcount information to stage a new refcount btree. If + * this is successful we'll return with the new btree root information logged + * to the repair transaction but not yet committed. + */ +STATIC int +xrep_refc_build_new_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_btree_cur *refc_cur; + struct xfs_perag *pag = sc->sa.pag; + xfs_fsblock_t fsbno; + int error; + + error = xrep_refc_sort_records(rr); + if (error) + return error; + + /* + * Prepare to construct the new btree by reserving disk space for the + * new btree and setting up all the accounting information we'll need + * to root the new btree while it's under construction and before we + * attach it to the AG header. + */ + fsbno = XFS_AGB_TO_FSB(sc->mp, pag->pag_agno, xfs_refc_block(sc->mp)); + xrep_newbt_init_ag(&rr->new_btree, sc, &XFS_RMAP_OINFO_REFC, fsbno, + XFS_AG_RESV_METADATA); + rr->new_btree.bload.get_records = xrep_refc_get_records; + rr->new_btree.bload.claim_block = xrep_refc_claim_block; + + /* Compute how many blocks we'll need. */ + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, &rr->new_btree.afake, + pag); + error = xfs_btree_bload_compute_geometry(refc_cur, + &rr->new_btree.bload, + xfarray_length(rr->refcount_records)); + if (error) + goto err_cur; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto err_cur; + + /* Reserve the space we'll need for the new btree. */ + error = xrep_newbt_alloc_blocks(&rr->new_btree, + rr->new_btree.bload.nr_blocks); + if (error) + goto err_cur; + + /* + * Due to btree slack factors, it's possible for a new btree to be one + * level taller than the old btree. Update the incore btree height so + * that we don't trip the verifiers when writing the new btree blocks + * to disk. + */ + pag->pagf_alt_refcount_level = rr->new_btree.bload.btree_height; + + /* Add all observed refcount records. */ + rr->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(refc_cur, &rr->new_btree.bload, rr); + if (error) + goto err_level; + + /* + * Install the new btree in the AG header. After this point the old + * btree is no longer accessible and the new tree is live. + */ + xfs_refcountbt_commit_staged_btree(refc_cur, sc->tp, sc->sa.agf_bp); + xfs_btree_del_cursor(refc_cur, 0); + + /* Reset the AGF counters now that we've changed the btree shape. */ + error = xrep_refc_reset_counters(rr); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + error = xrep_newbt_commit(&rr->new_btree); + if (error) + return error; + + return xrep_roll_ag_trans(sc); + +err_level: + pag->pagf_alt_refcount_level = 0; +err_cur: + xfs_btree_del_cursor(refc_cur, error); +err_newbt: + xrep_newbt_cancel(&rr->new_btree); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_refc_remove_old_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + int error; + + /* Free the old refcountbt blocks if they're not in use. */ + error = xrep_reap_agblocks(sc, &rr->old_refcountbt_blocks, + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); + if (error) + return error; + + /* + * Now that we've zapped all the old refcountbt blocks we can turn off + * the alternate height mechanism and reset the per-AG space + * reservations. + */ + pag->pagf_alt_refcount_level = 0; + sc->flags |= XREP_RESET_PERAG_RESV; + return 0; +} + +/* Rebuild the refcount btree. */ +int +xrep_refcountbt( + struct xfs_scrub *sc) +{ + struct xrep_refc *rr; + struct xfs_mount *mp = sc->mp; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; + + rr = kzalloc(sizeof(struct xrep_refc), XCHK_GFP_FLAGS); + if (!rr) + return -ENOMEM; + rr->sc = sc; + + /* Set up enough storage to handle one refcount record per block. */ + error = xfarray_create(mp, "refcount records", + mp->m_sb.sb_agblocks, + sizeof(struct xfs_refcount_irec), + &rr->refcount_records); + if (error) + goto out_rr; + + /* Collect all reference counts. */ + xagb_bitmap_init(&rr->old_refcountbt_blocks); + error = xrep_refc_find_refcounts(rr); + if (error) + goto out_bitmap; + + /* Rebuild the refcount information. */ + error = xrep_refc_build_new_tree(rr); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_refc_remove_old_tree(rr); + +out_bitmap: + xagb_bitmap_destroy(&rr->old_refcountbt_blocks); + xfarray_destroy(rr->refcount_records); +out_rr: + kfree(rr); + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index a9c3ca8e0e8b..bb8afee297cb 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -71,6 +71,7 @@ int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); +int xrep_refcountbt(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -123,6 +124,7 @@ xrep_setup_nothing( #define xrep_agi xrep_notsupported #define xrep_allocbt xrep_notsupported #define xrep_iallocbt xrep_notsupported +#define xrep_refcountbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 410387869338..e95734d0c0ad 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -275,7 +275,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .setup = xchk_setup_ag_refcountbt, .scrub = xchk_refcountbt, .has = xfs_has_reflink, - .repair = xrep_notsupported, + .repair = xrep_refcountbt, }, [XFS_SCRUB_TYPE_INODE] = { /* inode record */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 961fbae3d3ca..69eb301fd81e 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1185,27 +1185,29 @@ TRACE_EVENT(xrep_ibt_found, __entry->freemask) ) -TRACE_EVENT(xrep_refcount_extent_fn, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - struct xfs_refcount_irec *irec), - TP_ARGS(mp, agno, irec), +TRACE_EVENT(xrep_refc_found, + TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *rec), + TP_ARGS(pag, rec), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) + __field(enum xfs_refc_domain, domain) __field(xfs_agblock_t, startblock) __field(xfs_extlen_t, blockcount) __field(xfs_nlink_t, refcount) ), TP_fast_assign( - __entry->dev = mp->m_super->s_dev; - __entry->agno = agno; - __entry->startblock = irec->rc_startblock; - __entry->blockcount = irec->rc_blockcount; - __entry->refcount = irec->rc_refcount; + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->domain = rec->rc_domain; + __entry->startblock = rec->rc_startblock; + __entry->blockcount = rec->rc_blockcount; + __entry->refcount = rec->rc_refcount; ), - TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u", + TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, + __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS), __entry->startblock, __entry->blockcount, __entry->refcount) ^ permalink raw reply related [flat|nested] 156+ messages in thread
* [PATCHSET v24.0 0/5] xfs: online repair of AG btrees @ 2022-12-30 22:12 Darrick J. Wong 2022-12-30 22:12 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs Hi all, Now that we've spent a lot of time reworking common code in online fsck, we're ready to start rebuilding the AG space btrees. This series implements repair functions for the free space, inode, and refcount btrees. Rebuilding the reverse mapping btree is much more intense and is left for a subsequent patchset. The fstests counterpart of this patchset implements stress testing of repair. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-ag-btrees --- fs/xfs/Makefile | 3 fs/xfs/libxfs/xfs_ag.h | 10 fs/xfs/libxfs/xfs_ag_resv.c | 2 fs/xfs/libxfs/xfs_alloc.c | 18 - fs/xfs/libxfs/xfs_alloc.h | 2 fs/xfs/libxfs/xfs_alloc_btree.c | 13 - fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_ialloc.c | 41 +- fs/xfs/libxfs/xfs_ialloc.h | 3 fs/xfs/libxfs/xfs_refcount.c | 18 - fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 - fs/xfs/libxfs/xfs_types.h | 7 fs/xfs/scrub/agheader_repair.c | 4 fs/xfs/scrub/alloc.c | 14 - fs/xfs/scrub/alloc_repair.c | 912 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.c | 1 fs/xfs/scrub/common.h | 13 + fs/xfs/scrub/ialloc_repair.c | 873 ++++++++++++++++++++++++++++++++++ fs/xfs/scrub/newbt.c | 13 + fs/xfs/scrub/newbt.h | 4 fs/xfs/scrub/reap.c | 17 + fs/xfs/scrub/refcount_repair.c | 791 +++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.c | 128 +++++ fs/xfs/scrub/repair.h | 43 ++ fs/xfs/scrub/scrub.c | 22 + fs/xfs/scrub/scrub.h | 9 fs/xfs/scrub/trace.h | 112 +++- fs/xfs/scrub/xfarray.h | 22 + fs/xfs/xfs_extent_busy.c | 13 + fs/xfs/xfs_extent_busy.h | 2 fs/xfs/xfs_icache.c | 122 ++++- fs/xfs/xfs_trace.h | 22 + 34 files changed, 3193 insertions(+), 104 deletions(-) create mode 100644 fs/xfs/scrub/alloc_repair.c create mode 100644 fs/xfs/scrub/ialloc_repair.c create mode 100644 fs/xfs/scrub/refcount_repair.c ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair refcount btrees 2022-12-30 22:12 [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong @ 2022-12-30 22:12 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Reconstruct the refcount data from the rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_ag.h | 1 fs/xfs/libxfs/xfs_btree.c | 26 + fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_refcount.c | 18 + fs/xfs/libxfs/xfs_refcount.h | 2 fs/xfs/libxfs/xfs_refcount_btree.c | 13 + fs/xfs/scrub/refcount_repair.c | 791 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 22 + 11 files changed, 862 insertions(+), 18 deletions(-) create mode 100644 fs/xfs/scrub/refcount_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 48985e83ad4c..c448c2a4d691 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -181,6 +181,7 @@ xfs-y += $(addprefix scrub/, \ ialloc_repair.o \ newbt.o \ reap.o \ + refcount_repair.o \ repair.o \ ) endif diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index bb87f6677495..fd663d04bdff 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -89,6 +89,7 @@ struct xfs_perag { * verifiers while rebuilding the AG btrees. */ uint8_t pagf_alt_levels[XFS_BTNUM_AGF]; + uint8_t pagf_alt_refcount_level; #endif spinlock_t pag_state_lock; diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 842a710e6c3b..b63650a5d690 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -5198,3 +5198,29 @@ xfs_btree_destroy_cur_caches(void) xfs_rmapbt_destroy_cur_cache(); xfs_refcountbt_destroy_cur_cache(); } + +/* Move the btree cursor before the first record. */ +int +xfs_btree_goto_left_edge( + struct xfs_btree_cur *cur) +{ + int stat = 0; + int error; + + memset(&cur->bc_rec, 0, sizeof(cur->bc_rec)); + error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat); + if (error) + return error; + if (!stat) + return 0; + + error = xfs_btree_decrement(cur, 0, &stat); + if (error) + return error; + if (stat != 0) { + ASSERT(0); + return -EFSCORRUPTED; + } + + return 0; +} diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index 2db03f0ae961..5525d3715d57 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -738,4 +738,6 @@ xfs_btree_alloc_cursor( int __init xfs_btree_init_cur_caches(void); void xfs_btree_destroy_cur_caches(void); +int xfs_btree_goto_left_edge(struct xfs_btree_cur *cur); + #endif /* __XFS_BTREE_H__ */ diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index c1c65774dcc2..8082bb7b953a 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -120,14 +120,11 @@ xfs_refcount_btrec_to_irec( irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount); } -/* Simple checks for refcount records. */ -xfs_failaddr_t -xfs_refcount_check_irec( - struct xfs_btree_cur *cur, +inline xfs_failaddr_t +xfs_refcount_check_perag_irec( + struct xfs_perag *pag, const struct xfs_refcount_irec *irec) { - struct xfs_perag *pag = cur->bc_ag.pag; - if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN) return __this_address; @@ -144,6 +141,15 @@ xfs_refcount_check_irec( return NULL; } +/* Simple checks for refcount records. */ +xfs_failaddr_t +xfs_refcount_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *irec) +{ + return xfs_refcount_check_perag_irec(cur->bc_ag.pag, irec); +} + static inline int xfs_refcount_complain_bad_rec( struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h index 783cd89ca195..2d6fecb258bb 100644 --- a/fs/xfs/libxfs/xfs_refcount.h +++ b/fs/xfs/libxfs/xfs_refcount.h @@ -117,6 +117,8 @@ extern int xfs_refcount_has_records(struct xfs_btree_cur *cur, union xfs_btree_rec; extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec, struct xfs_refcount_irec *irec); +xfs_failaddr_t xfs_refcount_check_perag_irec(struct xfs_perag *pag, + const struct xfs_refcount_irec *irec); xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur, const struct xfs_refcount_irec *irec); extern int xfs_refcount_insert(struct xfs_btree_cur *cur, diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index 2ec45e2ffbe1..1bf991bf452f 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -232,7 +232,18 @@ xfs_refcountbt_verify( level = be16_to_cpu(block->bb_level); if (pag && pag->pagf_init) { - if (level >= pag->pagf_refcount_level) + unsigned int maxlevel = pag->pagf_refcount_level; + +#ifdef CONFIG_XFS_ONLINE_REPAIR + /* + * Online repair could be rewriting the refcount btree, so + * we'll validate against the larger of either tree while this + * is going on. + */ + maxlevel = max_t(unsigned int, maxlevel, + pag->pagf_alt_refcount_level); +#endif + if (level >= maxlevel) return __this_address; } else if (level >= mp->m_refc_maxlevels) return __this_address; diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c new file mode 100644 index 000000000000..d3f0384d084d --- /dev/null +++ b/fs/xfs/scrub/refcount_repair.c @@ -0,0 +1,791 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_btree_staging.h" +#include "xfs_inode.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_alloc.h" +#include "xfs_ialloc.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_refcount.h" +#include "xfs_refcount_btree.h" +#include "xfs_error.h" +#include "xfs_ag.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/newbt.h" +#include "scrub/reap.h" + +/* + * Rebuilding the Reference Count Btree + * ==================================== + * + * This algorithm is "borrowed" from xfs_repair. Imagine the rmap + * entries as rectangles representing extents of physical blocks, and + * that the rectangles can be laid down to allow them to overlap each + * other; then we know that we must emit a refcnt btree entry wherever + * the amount of overlap changes, i.e. the emission stimulus is + * level-triggered: + * + * - --- + * -- ----- ---- --- ------ + * -- ---- ----------- ---- --------- + * -------------------------------- ----------- + * ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + * 2 1 23 21 3 43 234 2123 1 01 2 3 0 + * + * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner). + * + * Note that in the actual refcnt btree we don't store the refcount < 2 + * cases because the bnobt tells us which blocks are free; single-use + * blocks aren't recorded in the bnobt or the refcntbt. If the rmapbt + * supports storing multiple entries covering a given block we could + * theoretically dispense with the refcntbt and simply count rmaps, but + * that's inefficient in the (hot) write path, so we'll take the cost of + * the extra tree to save time. Also there's no guarantee that rmap + * will be enabled. + * + * Given an array of rmaps sorted by physical block number, a starting + * physical block (sp), a bag to hold rmaps that cover sp, and the next + * physical block where the level changes (np), we can reconstruct the + * refcount btree as follows: + * + * While there are still unprocessed rmaps in the array, + * - Set sp to the physical block (pblk) of the next unprocessed rmap. + * - Add to the bag all rmaps in the array where startblock == sp. + * - Set np to the physical block where the bag size will change. This + * is the minimum of (the pblk of the next unprocessed rmap) and + * (startblock + len of each rmap in the bag). + * - Record the bag size as old_bag_size. + * + * - While the bag isn't empty, + * - Remove from the bag all rmaps where startblock + len == np. + * - Add to the bag all rmaps in the array where startblock == np. + * - If the bag size isn't old_bag_size, store the refcount entry + * (sp, np - sp, bag_size) in the refcnt btree. + * - If the bag is empty, break out of the inner loop. + * - Set old_bag_size to the bag size + * - Set sp = np. + * - Set np to the physical block where the bag size will change. + * This is the minimum of (the pblk of the next unprocessed rmap) + * and (startblock + len of each rmap in the bag). + * + * Like all the other repairers, we make a list of all the refcount + * records we need, then reinitialize the refcount btree root and + * insert all the records. + */ + +/* The only parts of the rmap that we care about for computing refcounts. */ +struct xrep_refc_rmap { + xfs_agblock_t startblock; + xfs_extlen_t blockcount; +} __packed; + +struct xrep_refc { + /* refcount extents */ + struct xfarray *refcount_records; + + /* new refcountbt information */ + struct xrep_newbt new_btree; + + /* old refcountbt blocks */ + struct xagb_bitmap old_refcountbt_blocks; + + struct xfs_scrub *sc; + + /* get_records()'s position in the refcount record array. */ + xfarray_idx_t array_cur; + + /* # of refcountbt blocks */ + xfs_extlen_t btblocks; +}; + +/* Check for any obvious conflicts with this shared/CoW staging extent. */ +STATIC int +xrep_refc_check_ext( + struct xfs_scrub *sc, + const struct xfs_refcount_irec *rec) +{ + enum xbtree_recpacking outcome; + int error; + + if (xfs_refcount_check_perag_irec(sc->sa.pag, rec) != NULL) + return -EFSCORRUPTED; + + /* Make sure this isn't free space. */ + error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rc_startblock, + rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + /* Must not be an inode chunk. */ + error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur, + rec->rc_startblock, rec->rc_blockcount, &outcome); + if (error) + return error; + if (outcome != XBTREE_RECPACKING_EMPTY) + return -EFSCORRUPTED; + + return 0; +} + +/* Record a reference count extent. */ +STATIC int +xrep_refc_stash( + struct xrep_refc *rr, + enum xfs_refc_domain domain, + xfs_agblock_t agbno, + xfs_extlen_t len, + uint64_t refcount) +{ + struct xfs_refcount_irec irec = { + .rc_startblock = agbno, + .rc_blockcount = len, + .rc_domain = domain, + }; + struct xfs_scrub *sc = rr->sc; + int error = 0; + + if (xchk_should_terminate(sc, &error)) + return error; + + irec.rc_refcount = min_t(uint64_t, MAXREFCOUNT, refcount); + + error = xrep_refc_check_ext(rr->sc, &irec); + if (error) + return error; + + trace_xrep_refc_found(sc->sa.pag, &irec); + + return xfarray_append(rr->refcount_records, &irec); +} + +/* Record a CoW staging extent. */ +STATIC int +xrep_refc_stash_cow( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + return xrep_refc_stash(rr, XFS_REFC_DOMAIN_COW, agbno, len, 1); +} + +/* Decide if an rmap could describe a shared extent. */ +static inline bool +xrep_refc_rmap_shareable( + struct xfs_mount *mp, + const struct xfs_rmap_irec *rmap) +{ + /* AG metadata are never sharable */ + if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner)) + return false; + + /* Metadata in files are never shareable */ + if (xfs_internal_inum(mp, rmap->rm_owner)) + return false; + + /* Metadata and unwritten file blocks are not shareable. */ + if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN)) + return false; + + return true; +} + +/* + * Walk along the reverse mapping records until we find one that could describe + * a shared extent. + */ +STATIC int +xrep_refc_walk_rmaps( + struct xrep_refc *rr, + struct xrep_refc_rmap *rrm, + bool *have_rec) +{ + struct xfs_rmap_irec rmap; + struct xfs_btree_cur *cur = rr->sc->sa.rmap_cur; + struct xfs_mount *mp = cur->bc_mp; + int have_gt; + int error = 0; + + *have_rec = false; + + /* + * Loop through the remaining rmaps. Remember CoW staging + * extents and the refcountbt blocks from the old tree for later + * disposal. We can only share written data fork extents, so + * keep looping until we find an rmap for one. + */ + do { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfs_btree_increment(cur, 0, &have_gt); + if (error) + return error; + if (!have_gt) + return 0; + + error = xfs_rmap_get_rec(cur, &rmap, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(mp, !have_gt)) + return -EFSCORRUPTED; + + if (rmap.rm_owner == XFS_RMAP_OWN_COW) { + error = xrep_refc_stash_cow(rr, rmap.rm_startblock, + rmap.rm_blockcount); + if (error) + return error; + } else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) { + /* refcountbt block, dump it when we're done. */ + rr->btblocks += rmap.rm_blockcount; + error = xagb_bitmap_set(&rr->old_refcountbt_blocks, + rmap.rm_startblock, rmap.rm_blockcount); + if (error) + return error; + } + } while (!xrep_refc_rmap_shareable(mp, &rmap)); + + rrm->startblock = rmap.rm_startblock; + rrm->blockcount = rmap.rm_blockcount; + *have_rec = true; + return 0; +} + +static inline uint32_t +xrep_refc_encode_startblock( + const struct xfs_refcount_irec *irec) +{ + uint32_t start; + + start = irec->rc_startblock & ~XFS_REFC_COWFLAG; + if (irec->rc_domain == XFS_REFC_DOMAIN_COW) + start |= XFS_REFC_COWFLAG; + + return start; +} + +/* Sort in the same order as the ondisk records. */ +static int +xrep_refc_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_refcount_irec *ap = a; + const struct xfs_refcount_irec *bp = b; + uint32_t sa, sb; + + sa = xrep_refc_encode_startblock(ap); + sb = xrep_refc_encode_startblock(bp); + + if (sa > sb) + return 1; + if (sa < sb) + return -1; + return 0; +} + +/* + * Sort the refcount extents by startblock or else the btree records will be in + * the wrong order. Make sure the records do not overlap in physical space. + */ +STATIC int +xrep_refc_sort_records( + struct xrep_refc *rr) +{ + struct xfs_refcount_irec irec; + xfarray_idx_t cur; + enum xfs_refc_domain dom = XFS_REFC_DOMAIN_SHARED; + xfs_agblock_t next_agbno = 0; + int error; + + error = xfarray_sort(rr->refcount_records, xrep_refc_extent_cmp, + XFARRAY_SORT_KILLABLE); + if (error) + return error; + + foreach_xfarray_idx(rr->refcount_records, cur) { + if (xchk_should_terminate(rr->sc, &error)) + return error; + + error = xfarray_load(rr->refcount_records, cur, &irec); + if (error) + return error; + + if (dom == XFS_REFC_DOMAIN_SHARED && + irec.rc_domain == XFS_REFC_DOMAIN_COW) { + dom = irec.rc_domain; + next_agbno = 0; + } + + if (dom != irec.rc_domain) + return -EFSCORRUPTED; + if (irec.rc_startblock < next_agbno) + return -EFSCORRUPTED; + + next_agbno = irec.rc_startblock + irec.rc_blockcount; + } + + return error; +} + +#define RRM_NEXT(r) ((r).startblock + (r).blockcount) +/* + * Find the next block where the refcount changes, given the next rmap we + * looked at and the ones we're already tracking. + */ +static inline int +xrep_refc_next_edge( + struct xfarray *rmap_bag, + struct xrep_refc_rmap *next_rrm, + bool next_valid, + xfs_agblock_t *nbnop) +{ + struct xrep_refc_rmap rrm; + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + xfs_agblock_t nbno = NULLAGBLOCK; + int error; + + if (next_valid) + nbno = next_rrm->startblock; + + while ((error = xfarray_iter(rmap_bag, &array_cur, &rrm)) == 1) + nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm)); + + if (error) + return error; + + /* + * We should have found /something/ because either next_rrm is the next + * interesting rmap to look at after emitting this refcount extent, or + * there are other rmaps in rmap_bag contributing to the current + * sharing count. But if something is seriously wrong, bail out. + */ + if (nbno == NULLAGBLOCK) + return -EFSCORRUPTED; + + *nbnop = nbno; + return 0; +} + +/* + * Walk forward through the rmap btree to collect all rmaps starting at + * @bno in @rmap_bag. These represent the file(s) that share ownership of + * the current block. Upon return, the rmap cursor points to the last record + * satisfying the startblock constraint. + */ +static int +xrep_refc_push_rmaps_at( + struct xrep_refc *rr, + struct xfarray *rmap_bag, + xfs_agblock_t bno, + struct xrep_refc_rmap *rrm, + bool *have, + uint64_t *stack_sz) +{ + struct xfs_scrub *sc = rr->sc; + int have_gt; + int error; + + while (*have && rrm->startblock == bno) { + error = xfarray_store_anywhere(rmap_bag, rrm); + if (error) + return error; + (*stack_sz)++; + error = xrep_refc_walk_rmaps(rr, rrm, have); + if (error) + return error; + } + + error = xfs_btree_decrement(sc->sa.rmap_cur, 0, &have_gt); + if (error) + return error; + if (XFS_IS_CORRUPT(sc->mp, !have_gt)) + return -EFSCORRUPTED; + + return 0; +} + +/* Iterate all the rmap records to generate reference count data. */ +STATIC int +xrep_refc_find_refcounts( + struct xrep_refc *rr) +{ + struct xrep_refc_rmap rrm; + struct xfs_scrub *sc = rr->sc; + struct xfarray *rmap_bag; + uint64_t old_stack_sz; + uint64_t stack_sz = 0; + xfs_agblock_t sbno; + xfs_agblock_t cbno; + xfs_agblock_t nbno; + bool have; + int error; + + xrep_ag_btcur_init(sc, &sc->sa); + + /* + * Set up a sparse array to store all the rmap records that we're + * tracking to generate a reference count record. If this exceeds + * MAXREFCOUNT, we clamp rc_refcount. + */ + error = xfarray_create(sc->mp, "rmap bag", 0, + sizeof(struct xrep_refc_rmap), &rmap_bag); + if (error) + goto out_cur; + + /* Start the rmapbt cursor to the left of all records. */ + error = xfs_btree_goto_left_edge(sc->sa.rmap_cur); + if (error) + goto out_bag; + + /* Process reverse mappings into refcount data. */ + while (xfs_btree_has_more_records(sc->sa.rmap_cur)) { + /* Push all rmaps with pblk == sbno onto the stack */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (!have) + break; + sbno = cbno = rrm.startblock; + error = xrep_refc_push_rmaps_at(rr, rmap_bag, sbno, + &rrm, &have, &stack_sz); + if (error) + goto out_bag; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + old_stack_sz = stack_sz; + + /* While stack isn't empty... */ + while (stack_sz) { + xfarray_idx_t array_cur = XFARRAY_CURSOR_INIT; + + /* Pop all rmaps that end at nbno */ + while ((error = xfarray_iter(rmap_bag, &array_cur, + &rrm)) == 1) { + if (RRM_NEXT(rrm) != nbno) + continue; + error = xfarray_unset(rmap_bag, array_cur - 1); + if (error) + goto out_bag; + stack_sz--; + } + if (error) + goto out_bag; + + /* Push array items that start at nbno */ + error = xrep_refc_walk_rmaps(rr, &rrm, &have); + if (error) + goto out_bag; + if (have) { + error = xrep_refc_push_rmaps_at(rr, rmap_bag, + nbno, &rrm, &have, &stack_sz); + if (error) + goto out_bag; + } + + /* Emit refcount if necessary */ + ASSERT(nbno > cbno); + if (stack_sz != old_stack_sz) { + if (old_stack_sz > 1) { + error = xrep_refc_stash(rr, + XFS_REFC_DOMAIN_SHARED, + cbno, nbno - cbno, + old_stack_sz); + if (error) + goto out_bag; + } + cbno = nbno; + } + + /* Stack empty, go find the next rmap */ + if (stack_sz == 0) + break; + old_stack_sz = stack_sz; + sbno = nbno; + + /* Set nbno to the bno of the next refcount change */ + error = xrep_refc_next_edge(rmap_bag, &rrm, have, + &nbno); + if (error) + goto out_bag; + + ASSERT(nbno > sbno); + } + } + + ASSERT(stack_sz == 0); +out_bag: + xfarray_destroy(rmap_bag); +out_cur: + xchk_ag_btcur_free(&sc->sa); + return error; +} +#undef RRM_NEXT + +/* Retrieve refcountbt data for bulk load. */ +STATIC int +xrep_refc_get_records( + struct xfs_btree_cur *cur, + unsigned int idx, + struct xfs_btree_block *block, + unsigned int nr_wanted, + void *priv) +{ + struct xfs_refcount_irec *irec = &cur->bc_rec.rc; + struct xrep_refc *rr = priv; + union xfs_btree_rec *block_rec; + unsigned int loaded; + int error; + + for (loaded = 0; loaded < nr_wanted; loaded++, idx++) { + error = xfarray_load(rr->refcount_records, rr->array_cur++, + irec); + if (error) + return error; + + block_rec = xfs_btree_rec_addr(cur, idx, block); + cur->bc_ops->init_rec_from_cur(cur, block_rec); + } + + return loaded; +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_refc_claim_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_refc *rr = priv; + int error; + + error = xrep_newbt_relog_autoreap(&rr->new_btree); + if (error) + return error; + + return xrep_newbt_claim_block(cur, &rr->new_btree, ptr); +} + +/* Update the AGF counters. */ +STATIC int +xrep_refc_reset_counters( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + + /* + * After we commit the new btree to disk, it is possible that the + * process to reap the old btree blocks will race with the AIL trying + * to checkpoint the old btree blocks into the filesystem. If the new + * tree is shorter than the old one, the refcountbt write verifier will + * fail and the AIL will shut down the filesystem. + * + * To avoid this, save the old incore btree height values as the alt + * height values before re-initializing the perag info from the updated + * AGF to capture all the new values. + */ + pag->pagf_alt_refcount_level = pag->pagf_refcount_level; + + /* Reinitialize with the values we just logged. */ + return xrep_reinit_pagf(sc); +} + +/* + * Use the collected refcount information to stage a new refcount btree. If + * this is successful we'll return with the new btree root information logged + * to the repair transaction but not yet committed. + */ +STATIC int +xrep_refc_build_new_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_btree_cur *refc_cur; + struct xfs_perag *pag = sc->sa.pag; + xfs_fsblock_t fsbno; + int error; + + error = xrep_refc_sort_records(rr); + if (error) + return error; + + /* + * Prepare to construct the new btree by reserving disk space for the + * new btree and setting up all the accounting information we'll need + * to root the new btree while it's under construction and before we + * attach it to the AG header. + */ + fsbno = XFS_AGB_TO_FSB(sc->mp, pag->pag_agno, xfs_refc_block(sc->mp)); + xrep_newbt_init_ag(&rr->new_btree, sc, &XFS_RMAP_OINFO_REFC, fsbno, + XFS_AG_RESV_METADATA); + rr->new_btree.bload.get_records = xrep_refc_get_records; + rr->new_btree.bload.claim_block = xrep_refc_claim_block; + + /* Compute how many blocks we'll need. */ + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, &rr->new_btree.afake, + pag); + error = xfs_btree_bload_compute_geometry(refc_cur, + &rr->new_btree.bload, + xfarray_length(rr->refcount_records)); + if (error) + goto err_cur; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto err_cur; + + /* Reserve the space we'll need for the new btree. */ + error = xrep_newbt_alloc_blocks(&rr->new_btree, + rr->new_btree.bload.nr_blocks); + if (error) + goto err_cur; + + /* + * Due to btree slack factors, it's possible for a new btree to be one + * level taller than the old btree. Update the incore btree height so + * that we don't trip the verifiers when writing the new btree blocks + * to disk. + */ + pag->pagf_alt_refcount_level = rr->new_btree.bload.btree_height; + + /* Add all observed refcount records. */ + rr->array_cur = XFARRAY_CURSOR_INIT; + error = xfs_btree_bload(refc_cur, &rr->new_btree.bload, rr); + if (error) + goto err_level; + + /* + * Install the new btree in the AG header. After this point the old + * btree is no longer accessible and the new tree is live. + */ + xfs_refcountbt_commit_staged_btree(refc_cur, sc->tp, sc->sa.agf_bp); + xfs_btree_del_cursor(refc_cur, 0); + + /* Reset the AGF counters now that we've changed the btree shape. */ + error = xrep_refc_reset_counters(rr); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + error = xrep_newbt_commit(&rr->new_btree); + if (error) + return error; + + return xrep_roll_ag_trans(sc); + +err_level: + pag->pagf_alt_refcount_level = 0; +err_cur: + xfs_btree_del_cursor(refc_cur, error); +err_newbt: + xrep_newbt_cancel(&rr->new_btree); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_refc_remove_old_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + int error; + + /* Free the old refcountbt blocks if they're not in use. */ + error = xrep_reap_agblocks(sc, &rr->old_refcountbt_blocks, + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); + if (error) + return error; + + /* + * Now that we've zapped all the old refcountbt blocks we can turn off + * the alternate height mechanism and reset the per-AG space + * reservations. + */ + pag->pagf_alt_refcount_level = 0; + sc->flags |= XREP_RESET_PERAG_RESV; + return 0; +} + +/* Rebuild the refcount btree. */ +int +xrep_refcountbt( + struct xfs_scrub *sc) +{ + struct xrep_refc *rr; + struct xfs_mount *mp = sc->mp; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; + + rr = kzalloc(sizeof(struct xrep_refc), XCHK_GFP_FLAGS); + if (!rr) + return -ENOMEM; + rr->sc = sc; + + /* Set up enough storage to handle one refcount record per block. */ + error = xfarray_create(mp, "refcount records", + mp->m_sb.sb_agblocks, + sizeof(struct xfs_refcount_irec), + &rr->refcount_records); + if (error) + goto out_rr; + + /* Collect all reference counts. */ + xagb_bitmap_init(&rr->old_refcountbt_blocks); + error = xrep_refc_find_refcounts(rr); + if (error) + goto out_bitmap; + + /* Rebuild the refcount information. */ + error = xrep_refc_build_new_tree(rr); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_refc_remove_old_tree(rr); + +out_bitmap: + xagb_bitmap_destroy(&rr->old_refcountbt_blocks); + xfarray_destroy(rr->refcount_records); +out_rr: + kfree(rr); + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index b6e60362b7cb..e93cae73cf61 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -71,6 +71,7 @@ int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); +int xrep_refcountbt(struct xfs_scrub *sc); int xrep_reinit_pagf(struct xfs_scrub *sc); int xrep_reinit_pagi(struct xfs_scrub *sc); @@ -123,6 +124,7 @@ xrep_setup_nothing( #define xrep_agi xrep_notsupported #define xrep_allocbt xrep_notsupported #define xrep_iallocbt xrep_notsupported +#define xrep_refcountbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index aef30515c050..449c3e623c63 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -277,7 +277,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .setup = xchk_setup_ag_refcountbt, .scrub = xchk_refcountbt, .has = xfs_has_reflink, - .repair = xrep_notsupported, + .repair = xrep_refcountbt, }, [XFS_SCRUB_TYPE_INODE] = { /* inode record */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 5e66be26055b..8532dcd16630 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1186,27 +1186,29 @@ TRACE_EVENT(xrep_ibt_found, __entry->freemask) ) -TRACE_EVENT(xrep_refcount_extent_fn, - TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - struct xfs_refcount_irec *irec), - TP_ARGS(mp, agno, irec), +TRACE_EVENT(xrep_refc_found, + TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *rec), + TP_ARGS(pag, rec), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) + __field(enum xfs_refc_domain, domain) __field(xfs_agblock_t, startblock) __field(xfs_extlen_t, blockcount) __field(xfs_nlink_t, refcount) ), TP_fast_assign( - __entry->dev = mp->m_super->s_dev; - __entry->agno = agno; - __entry->startblock = irec->rc_startblock; - __entry->blockcount = irec->rc_blockcount; - __entry->refcount = irec->rc_refcount; + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->domain = rec->rc_domain; + __entry->startblock = rec->rc_startblock; + __entry->blockcount = rec->rc_blockcount; + __entry->refcount = rec->rc_refcount; ), - TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u", + TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno, + __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS), __entry->startblock, __entry->blockcount, __entry->refcount) ^ permalink raw reply related [flat|nested] 156+ messages in thread
* [PATCH v22 0/5] xfs: online repair of AG btrees @ 2020-01-01 1:02 Darrick J. Wong 2020-01-01 1:03 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2020-01-01 1:02 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs Hi all, This is the first part of the twenty-second revision of a patchset that adds to XFS kernel support for online metadata scrubbing and repair. There aren't any on-disk format changes. New for this version is a rebase against 5.5-rc4, bulk loading of btrees, integration with the health reporting subsystem, and the explicit revalidation of all metadata structures that were rebuilt. First, create a new data structure that provides an abstraction of a big memory array by using linked lists. This is where we store records for btree reconstruction. This first implementation is memory inefficient and consumes a /lot/ of kernel memory, but lays the groundwork for a later patch in the set to convert the implementation to use a (memfd) swap file, which enables us to use pageable memory without pounding the slab cache. The three patches after that implement reconstruction of the free space btrees, inode btrees, and reference count btree. The reverse mapping btree requires considerably more thought and will be covered later. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair refcount btrees 2020-01-01 1:02 [PATCH v22 0/5] xfs: online repair of AG btrees Darrick J. Wong @ 2020-01-01 1:03 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2020-01-01 1:03 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs From: Darrick J. Wong <darrick.wong@oracle.com> Reconstruct the refcount data from the rmap btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/Makefile | 1 fs/xfs/scrub/refcount_repair.c | 611 ++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 13 - 5 files changed, 622 insertions(+), 7 deletions(-) create mode 100644 fs/xfs/scrub/refcount_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 5e37417c6992..7506dc7092e2 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -163,6 +163,7 @@ xfs-y += $(addprefix scrub/, \ array.o \ bitmap.o \ ialloc_repair.o \ + refcount_repair.o \ repair.o \ xfile.o \ ) diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c new file mode 100644 index 000000000000..a4083c0f3feb --- /dev/null +++ b/fs/xfs/scrub/refcount_repair.c @@ -0,0 +1,611 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2019 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <darrick.wong@oracle.com> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_alloc.h" +#include "xfs_ialloc.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_refcount.h" +#include "xfs_refcount_btree.h" +#include "xfs_error.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/array.h" + +/* + * Rebuilding the Reference Count Btree + * ==================================== + * + * This algorithm is "borrowed" from xfs_repair. Imagine the rmap + * entries as rectangles representing extents of physical blocks, and + * that the rectangles can be laid down to allow them to overlap each + * other; then we know that we must emit a refcnt btree entry wherever + * the amount of overlap changes, i.e. the emission stimulus is + * level-triggered: + * + * - --- + * -- ----- ---- --- ------ + * -- ---- ----------- ---- --------- + * -------------------------------- ----------- + * ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + * 2 1 23 21 3 43 234 2123 1 01 2 3 0 + * + * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner). + * + * Note that in the actual refcnt btree we don't store the refcount < 2 + * cases because the bnobt tells us which blocks are free; single-use + * blocks aren't recorded in the bnobt or the refcntbt. If the rmapbt + * supports storing multiple entries covering a given block we could + * theoretically dispense with the refcntbt and simply count rmaps, but + * that's inefficient in the (hot) write path, so we'll take the cost of + * the extra tree to save time. Also there's no guarantee that rmap + * will be enabled. + * + * Given an array of rmaps sorted by physical block number, a starting + * physical block (sp), a bag to hold rmaps that cover sp, and the next + * physical block where the level changes (np), we can reconstruct the + * refcount btree as follows: + * + * While there are still unprocessed rmaps in the array, + * - Set sp to the physical block (pblk) of the next unprocessed rmap. + * - Add to the bag all rmaps in the array where startblock == sp. + * - Set np to the physical block where the bag size will change. This + * is the minimum of (the pblk of the next unprocessed rmap) and + * (startblock + len of each rmap in the bag). + * - Record the bag size as old_bag_size. + * + * - While the bag isn't empty, + * - Remove from the bag all rmaps where startblock + len == np. + * - Add to the bag all rmaps in the array where startblock == np. + * - If the bag size isn't old_bag_size, store the refcount entry + * (sp, np - sp, bag_size) in the refcnt btree. + * - If the bag is empty, break out of the inner loop. + * - Set old_bag_size to the bag size + * - Set sp = np. + * - Set np to the physical block where the bag size will change. + * This is the minimum of (the pblk of the next unprocessed rmap) + * and (startblock + len of each rmap in the bag). + * + * Like all the other repairers, we make a list of all the refcount + * records we need, then reinitialize the refcount btree root and + * insert all the records. + */ + +/* The only parts of the rmap that we care about for computing refcounts. */ +struct xrep_refc_rmap { + xfs_agblock_t startblock; + xfs_extlen_t blockcount; +} __packed; + +struct xrep_refc { + /* refcount extents */ + struct xfbma *refcount_records; + + /* new refcountbt information */ + struct xrep_newbt new_btree_info; + struct xfs_btree_bload refc_bload; + + /* old refcountbt blocks */ + struct xbitmap old_refcountbt_blocks; + + struct xfs_scrub *sc; + + /* # of refcountbt blocks */ + xfs_extlen_t btblocks; + + /* get_data()'s position in the free space record array. */ + uint64_t iter; +}; + +/* Record a reference count extent. */ +STATIC int +xrep_refc_stash( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len, + xfs_nlink_t refcount) +{ + struct xfs_refcount_irec irec = { + .rc_startblock = agbno, + .rc_blockcount = len, + .rc_refcount = refcount, + }; + int error = 0; + + trace_xrep_refc_found(rr->sc->mp, rr->sc->sa.agno, agbno, len, + refcount); + + if (xchk_should_terminate(rr->sc, &error)) + return error; + + return xfbma_append(rr->refcount_records, &irec); +} + +/* Record a CoW staging extent. */ +STATIC int +xrep_refc_stash_cow( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + return xrep_refc_stash(rr, agbno + XFS_REFC_COW_START, len, 1); +} + +/* Grab the next (abbreviated) rmap record from the rmapbt. */ +STATIC int +xrep_refc_next_rrm( + struct xfs_btree_cur *cur, + struct xrep_refc *rr, + struct xrep_refc_rmap *rrm, + bool *have_rec) +{ + struct xfs_rmap_irec rmap; + struct xfs_mount *mp = cur->bc_mp; + xfs_fsblock_t fsbno; + int have_gt; + int error = 0; + + *have_rec = false; + /* + * Loop through the remaining rmaps. Remember CoW staging + * extents and the refcountbt blocks from the old tree for later + * disposal. We can only share written data fork extents, so + * keep looping until we find an rmap for one. + */ + do { + if (xchk_should_terminate(rr->sc, &error)) + goto out_error; + + error = xfs_btree_increment(cur, 0, &have_gt); + if (error) + goto out_error; + if (!have_gt) + return 0; + + error = xfs_rmap_get_rec(cur, &rmap, &have_gt); + if (error) + goto out_error; + if (XFS_IS_CORRUPT(mp, !have_gt)) { + error = -EFSCORRUPTED; + goto out_error; + } + + if (rmap.rm_owner == XFS_RMAP_OWN_COW) { + error = xrep_refc_stash_cow(rr, rmap.rm_startblock, + rmap.rm_blockcount); + if (error) + goto out_error; + } else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) { + /* refcountbt block, dump it when we're done. */ + rr->btblocks += rmap.rm_blockcount; + fsbno = XFS_AGB_TO_FSB(cur->bc_mp, + cur->bc_private.a.agno, + rmap.rm_startblock); + error = xbitmap_set(&rr->old_refcountbt_blocks, + fsbno, rmap.rm_blockcount); + if (error) + goto out_error; + } + } while (XFS_RMAP_NON_INODE_OWNER(rmap.rm_owner) || + xfs_internal_inum(mp, rmap.rm_owner) || + (rmap.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN))); + + rrm->startblock = rmap.rm_startblock; + rrm->blockcount = rmap.rm_blockcount; + *have_rec = true; + return 0; + +out_error: + return error; +} + +/* Compare two btree extents. */ +static int +xrep_refc_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_refcount_irec *ap = a; + const struct xfs_refcount_irec *bp = b; + + if (ap->rc_startblock > bp->rc_startblock) + return 1; + else if (ap->rc_startblock < bp->rc_startblock) + return -1; + return 0; +} + +#define RRM_NEXT(r) ((r).startblock + (r).blockcount) +/* + * Find the next block where the refcount changes, given the next rmap we + * looked at and the ones we're already tracking. + */ +static inline xfs_agblock_t +xrep_refc_next_edge( + struct xfbma *rmap_bag, + struct xrep_refc_rmap *next_rrm, + bool next_valid) +{ + struct xrep_refc_rmap rrm; + uint64_t i; + xfs_agblock_t nbno; + + nbno = next_valid ? next_rrm->startblock : NULLAGBLOCK; + foreach_xfbma_item(rmap_bag, i, rrm) + nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm)); + return nbno; +} + +/* Iterate all the rmap records to generate reference count data. */ +STATIC int +xrep_refc_find_refcounts( + struct xrep_refc *rr) +{ + struct xrep_refc_rmap rrm; + struct xfs_scrub *sc = rr->sc; + struct xfbma *rmap_bag; + struct xfs_btree_cur *cur; + xfs_agblock_t sbno; + xfs_agblock_t cbno; + xfs_agblock_t nbno; + size_t old_stack_sz; + size_t stack_sz = 0; + bool have; + int have_gt; + int error; + + /* Start the rmapbt cursor to the left of all records. */ + cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp, + sc->sa.agno); + error = xfs_rmap_lookup_le(cur, 0, 0, 0, 0, 0, &have_gt); + if (error) + goto out_cur; + ASSERT(have_gt == 0); + + /* Set up some storage */ + rmap_bag = xfbma_init(sizeof(struct xrep_refc_rmap)); + if (IS_ERR(rmap_bag)) { + error = PTR_ERR(rmap_bag); + goto out_cur; + } + + /* Process reverse mappings into refcount data. */ + while (xfs_btree_has_more_records(cur)) { + /* Push all rmaps with pblk == sbno onto the stack */ + error = xrep_refc_next_rrm(cur, rr, &rrm, &have); + if (error) + goto out; + if (!have) + break; + sbno = cbno = rrm.startblock; + while (have && rrm.startblock == sbno) { + error = xfbma_insert_anywhere(rmap_bag, &rrm); + if (error) + goto out; + stack_sz++; + error = xrep_refc_next_rrm(cur, rr, &rrm, &have); + if (error) + goto out; + } + error = xfs_btree_decrement(cur, 0, &have_gt); + if (error) + goto out; + if (XFS_IS_CORRUPT(sc->mp, !have_gt)) { + error = -EFSCORRUPTED; + goto out; + } + + /* Set nbno to the bno of the next refcount change */ + nbno = xrep_refc_next_edge(rmap_bag, &rrm, have); + if (nbno == NULLAGBLOCK) { + error = -EFSCORRUPTED; + goto out; + } + + ASSERT(nbno > sbno); + old_stack_sz = stack_sz; + + /* While stack isn't empty... */ + while (stack_sz) { + uint64_t i; + + /* Pop all rmaps that end at nbno */ + foreach_xfbma_item(rmap_bag, i, rrm) { + if (RRM_NEXT(rrm) != nbno) + continue; + error = xfbma_nullify(rmap_bag, i); + if (error) + goto out; + stack_sz--; + } + + /* Push array items that start at nbno */ + error = xrep_refc_next_rrm(cur, rr, &rrm, &have); + if (error) + goto out; + while (have && rrm.startblock == nbno) { + error = xfbma_insert_anywhere(rmap_bag, + &rrm); + if (error) + goto out; + stack_sz++; + error = xrep_refc_next_rrm(cur, rr, &rrm, + &have); + if (error) + goto out; + } + error = xfs_btree_decrement(cur, 0, &have_gt); + if (error) + goto out; + if (XFS_IS_CORRUPT(sc->mp, !have_gt)) { + error = -EFSCORRUPTED; + goto out; + } + + /* Emit refcount if necessary */ + ASSERT(nbno > cbno); + if (stack_sz != old_stack_sz) { + if (old_stack_sz > 1) { + error = xrep_refc_stash(rr, cbno, + nbno - cbno, + old_stack_sz); + if (error) + goto out; + } + cbno = nbno; + } + + /* Stack empty, go find the next rmap */ + if (stack_sz == 0) + break; + old_stack_sz = stack_sz; + sbno = nbno; + + /* Set nbno to the bno of the next refcount change */ + nbno = xrep_refc_next_edge(rmap_bag, &rrm, have); + if (nbno == NULLAGBLOCK) { + error = -EFSCORRUPTED; + goto out; + } + + ASSERT(nbno > sbno); + } + } + + ASSERT(stack_sz == 0); +out: + xfbma_destroy(rmap_bag); +out_cur: + xfs_btree_del_cursor(cur, error); + return error; +} +#undef RRM_NEXT + +/* Retrieve refcountbt data for bulk load. */ +STATIC int +xrep_refc_get_data( + struct xfs_btree_cur *cur, + void *priv) +{ + struct xrep_refc *rr = priv; + + return xfbma_get_data(rr->refcount_records, &rr->iter, &cur->bc_rec.rc); +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_refc_alloc_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_refc *rr = priv; + + return xrep_newbt_claim_block(cur, &rr->new_btree_info, ptr); +} + +/* Update the AGF counters. */ +STATIC int +xrep_refc_reset_counters( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + struct xfs_buf *bp; + + /* + * Mark the pagf information stale and use the accessor function to + * forcibly reload it from the values we just logged. We still own the + * AGF bp so we can safely ignore bp. + */ + ASSERT(pag->pagf_init); + pag->pagf_init = 0; + + return xfs_alloc_read_agf(sc->mp, sc->tp, sc->sa.agno, 0, &bp); +} + +/* + * Use the collected refcount information to stage a new refcount btree. If + * this is successful we'll return with the new btree root information logged + * to the repair transaction but not yet committed. + */ +STATIC int +xrep_refc_build_new_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_btree_cur *refc_cur; + int error; + + rr->refc_bload.get_data = xrep_refc_get_data; + rr->refc_bload.alloc_block = xrep_refc_alloc_block; + xrep_bload_estimate_slack(sc, &rr->refc_bload); + + /* + * Sort the refcount extents by startblock or else the btree records + * will be in the wrong order. + */ + error = xfbma_sort(rr->refcount_records, xrep_refc_extent_cmp); + if (error) + return error; + + /* + * Prepare to construct the new btree by reserving disk space for the + * new btree and setting up all the accounting information we'll need + * to root the new btree while it's under construction and before we + * attach it to the AG header. + */ + xrep_newbt_init_ag(&rr->new_btree_info, sc, &XFS_RMAP_OINFO_REFC, + XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, + xfs_refc_block(sc->mp)), + XFS_AG_RESV_METADATA); + + /* Compute how many blocks we'll need. */ + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, sc->tp, + &rr->new_btree_info.afake, sc->sa.agno); + error = xfs_btree_bload_compute_geometry(refc_cur, &rr->refc_bload, + xfbma_length(rr->refcount_records)); + if (error) + goto err_cur; + xfs_btree_del_cursor(refc_cur, error); + + /* + * Reserve the space we'll need for the new btree. Drop the cursor + * while we do this because that can roll the transaction and cursors + * can't handle that. + */ + error = xrep_newbt_alloc_blocks(&rr->new_btree_info, + rr->refc_bload.nr_blocks); + if (error) + goto err_newbt; + + /* Add all observed refcount records. */ + rr->iter = 0; + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, sc->tp, + &rr->new_btree_info.afake, sc->sa.agno); + error = xfs_btree_bload(refc_cur, &rr->refc_bload, rr); + if (error) + goto err_cur; + + /* + * Install the new btree in the AG header. After this point the old + * btree is no longer accessible and the new tree is live. + * + * Note: We re-read the AGF here to ensure the buffer type is set + * properly. Since we built a new tree without attaching to the AGF + * buffer, the buffer item may have fallen off the buffer. This ought + * to succeed since the AGF is held across transaction rolls. + */ + error = xfs_read_agf(sc->mp, sc->tp, sc->sa.agno, 0, &sc->sa.agf_bp); + if (error) + goto err_cur; + + /* Commit our new btree. */ + xfs_refcountbt_commit_staged_btree(refc_cur, sc->sa.agf_bp); + xfs_btree_del_cursor(refc_cur, 0); + + /* Reset the AGF counters now that we've changed the btree shape. */ + error = xrep_refc_reset_counters(rr); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + xrep_newbt_destroy(&rr->new_btree_info, error); + + return xrep_roll_ag_trans(sc); +err_cur: + xfs_btree_del_cursor(refc_cur, error); +err_newbt: + xrep_newbt_destroy(&rr->new_btree_info, error); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_refc_remove_old_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + int error; + + /* Free the old refcountbt blocks if they're not in use. */ + error = xrep_reap_extents(sc, &rr->old_refcountbt_blocks, + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); + if (error) + return error; + + sc->flags |= XREP_RESET_PERAG_RESV; + return 0; +} + +/* Rebuild the refcount btree. */ +int +xrep_refcountbt( + struct xfs_scrub *sc) +{ + struct xrep_refc *rr; + struct xfs_mount *mp = sc->mp; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) + return -EOPNOTSUPP; + + rr = kmem_zalloc(sizeof(struct xrep_refc), KM_NOFS | KM_MAYFAIL); + if (!rr) + return -ENOMEM; + rr->sc = sc; + + xchk_perag_get(sc->mp, &sc->sa); + + /* Set up some storage */ + rr->refcount_records = xfbma_init(sizeof(struct xfs_refcount_irec)); + if (IS_ERR(rr->refcount_records)) { + error = PTR_ERR(rr->refcount_records); + goto out_rr; + } + + /* Collect all reference counts. */ + xbitmap_init(&rr->old_refcountbt_blocks); + error = xrep_refc_find_refcounts(rr); + if (error) + goto out_bitmap; + + /* Rebuild the refcount information. */ + error = xrep_refc_build_new_tree(rr); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_refc_remove_old_tree(rr); + +out_bitmap: + xbitmap_destroy(&rr->old_refcountbt_blocks); + xfbma_destroy(rr->refcount_records); +out_rr: + kmem_free(rr); + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 8b320e905e00..c0769aaae9a4 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -70,6 +70,7 @@ int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); +int xrep_refcountbt(struct xfs_scrub *sc); struct xrep_newbt_resv { /* Link to list of extents that we've reserved. */ @@ -167,6 +168,7 @@ xrep_reset_perag_resv( #define xrep_agi xrep_notsupported #define xrep_allocbt xrep_notsupported #define xrep_iallocbt xrep_notsupported +#define xrep_refcountbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 5853b826c7f9..7a036a8e4189 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -255,7 +255,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .setup = xchk_setup_ag_refcountbt, .scrub = xchk_refcountbt, .has = xfs_sb_version_hasreflink, - .repair = xrep_notsupported, + .repair = xrep_refcountbt, }, [XFS_SCRUB_TYPE_INODE] = { /* inode record */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 9bf75c97fdd1..ed9484de80fe 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -782,10 +782,11 @@ TRACE_EVENT(xrep_ibt_found, __entry->freemask) ) -TRACE_EVENT(xrep_refcount_extent_fn, +TRACE_EVENT(xrep_refc_found, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - struct xfs_refcount_irec *irec), - TP_ARGS(mp, agno, irec), + xfs_agblock_t startblock, xfs_extlen_t blockcount, + xfs_nlink_t refcount), + TP_ARGS(mp, agno, startblock, blockcount, refcount), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) @@ -796,9 +797,9 @@ TRACE_EVENT(xrep_refcount_extent_fn, TP_fast_assign( __entry->dev = mp->m_super->s_dev; __entry->agno = agno; - __entry->startblock = irec->rc_startblock; - __entry->blockcount = irec->rc_blockcount; - __entry->refcount = irec->rc_refcount; + __entry->startblock = startblock; + __entry->blockcount = blockcount; + __entry->refcount = refcount; ), TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u", MAJOR(__entry->dev), MINOR(__entry->dev), ^ permalink raw reply related [flat|nested] 156+ messages in thread
* [PATCH v21 0/5] xfs: online repair of AG btrees @ 2019-10-29 23:31 Darrick J. Wong 2019-10-29 23:32 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 0 siblings, 1 reply; 156+ messages in thread From: Darrick J. Wong @ 2019-10-29 23:31 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs Hi all, This is the first part of the twenty-first revision of a patchset that adds to XFS kernel support for online metadata scrubbing and repair. There aren't any on-disk format changes. New for this version is a rebase against 5.4-rc5, bulk loading of btrees, integration with the health reporting subsystem, and the explicit revalidation of all metadata structures that were rebuilt. First, create a new data structure that provides an abstraction of a big memory array by using linked lists. This is where we store records for btree reconstruction. This first implementation is memory inefficient and consumes a /lot/ of kernel memory, but lays the groundwork for a later patch in the set to convert the implementation to use a (memfd) swap file, which enables us to use pageable memory without pounding the slab cache. The three patches after that implement reconstruction of the free space btrees, inode btrees, and reference count btree. The reverse mapping btree requires considerably more thought and will be covered later. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees ^ permalink raw reply [flat|nested] 156+ messages in thread
* [PATCH 5/5] xfs: repair refcount btrees 2019-10-29 23:31 [PATCH v21 0/5] xfs: online repair of AG btrees Darrick J. Wong @ 2019-10-29 23:32 ` Darrick J. Wong 0 siblings, 0 replies; 156+ messages in thread From: Darrick J. Wong @ 2019-10-29 23:32 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs From: Darrick J. Wong <darrick.wong@oracle.com> Reconstruct the refcount data from the rmap btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/Makefile | 1 fs/xfs/scrub/refcount_repair.c | 604 ++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 13 - 5 files changed, 615 insertions(+), 7 deletions(-) create mode 100644 fs/xfs/scrub/refcount_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 6ca9755f031e..5ece4a554c9f 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -164,6 +164,7 @@ xfs-y += $(addprefix scrub/, \ array.o \ bitmap.o \ ialloc_repair.o \ + refcount_repair.o \ repair.o \ xfile.o \ ) diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c new file mode 100644 index 000000000000..a95ee21251d3 --- /dev/null +++ b/fs/xfs/scrub/refcount_repair.c @@ -0,0 +1,604 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2019 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <darrick.wong@oracle.com> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_alloc.h" +#include "xfs_ialloc.h" +#include "xfs_rmap.h" +#include "xfs_rmap_btree.h" +#include "xfs_refcount.h" +#include "xfs_refcount_btree.h" +#include "xfs_error.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/btree.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/bitmap.h" +#include "scrub/array.h" + +/* + * Rebuilding the Reference Count Btree + * ==================================== + * + * This algorithm is "borrowed" from xfs_repair. Imagine the rmap + * entries as rectangles representing extents of physical blocks, and + * that the rectangles can be laid down to allow them to overlap each + * other; then we know that we must emit a refcnt btree entry wherever + * the amount of overlap changes, i.e. the emission stimulus is + * level-triggered: + * + * - --- + * -- ----- ---- --- ------ + * -- ---- ----------- ---- --------- + * -------------------------------- ----------- + * ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + * 2 1 23 21 3 43 234 2123 1 01 2 3 0 + * + * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner). + * + * Note that in the actual refcnt btree we don't store the refcount < 2 + * cases because the bnobt tells us which blocks are free; single-use + * blocks aren't recorded in the bnobt or the refcntbt. If the rmapbt + * supports storing multiple entries covering a given block we could + * theoretically dispense with the refcntbt and simply count rmaps, but + * that's inefficient in the (hot) write path, so we'll take the cost of + * the extra tree to save time. Also there's no guarantee that rmap + * will be enabled. + * + * Given an array of rmaps sorted by physical block number, a starting + * physical block (sp), a bag to hold rmaps that cover sp, and the next + * physical block where the level changes (np), we can reconstruct the + * refcount btree as follows: + * + * While there are still unprocessed rmaps in the array, + * - Set sp to the physical block (pblk) of the next unprocessed rmap. + * - Add to the bag all rmaps in the array where startblock == sp. + * - Set np to the physical block where the bag size will change. This + * is the minimum of (the pblk of the next unprocessed rmap) and + * (startblock + len of each rmap in the bag). + * - Record the bag size as old_bag_size. + * + * - While the bag isn't empty, + * - Remove from the bag all rmaps where startblock + len == np. + * - Add to the bag all rmaps in the array where startblock == np. + * - If the bag size isn't old_bag_size, store the refcount entry + * (sp, np - sp, bag_size) in the refcnt btree. + * - If the bag is empty, break out of the inner loop. + * - Set old_bag_size to the bag size + * - Set sp = np. + * - Set np to the physical block where the bag size will change. + * This is the minimum of (the pblk of the next unprocessed rmap) + * and (startblock + len of each rmap in the bag). + * + * Like all the other repairers, we make a list of all the refcount + * records we need, then reinitialize the refcount btree root and + * insert all the records. + */ + +/* The only parts of the rmap that we care about for computing refcounts. */ +struct xrep_refc_rmap { + xfs_agblock_t startblock; + xfs_extlen_t blockcount; +} __packed; + +struct xrep_refc { + /* refcount extents */ + struct xfbma *refcount_records; + + /* new refcountbt information */ + struct xrep_newbt new_btree_info; + struct xfs_btree_bload refc_bload; + + /* old refcountbt blocks */ + struct xbitmap old_refcountbt_blocks; + + struct xfs_scrub *sc; + + /* # of refcountbt blocks */ + xfs_extlen_t btblocks; + + /* get_data()'s position in the free space record array. */ + uint64_t iter; +}; + +/* Record a reference count extent. */ +STATIC int +xrep_refc_remember( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len, + xfs_nlink_t refcount) +{ + struct xfs_refcount_irec irec = { + .rc_startblock = agbno, + .rc_blockcount = len, + .rc_refcount = refcount, + }; + + trace_xrep_refc_found(rr->sc->mp, rr->sc->sa.agno, agbno, len, + refcount); + + return xfbma_append(rr->refcount_records, &irec); +} + +/* Record a CoW staging extent. */ +STATIC int +xrep_refc_remember_cow( + struct xrep_refc *rr, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + return xrep_refc_remember(rr, agbno + XFS_REFC_COW_START, len, 1); +} + +/* Grab the next (abbreviated) rmap record from the rmapbt. */ +STATIC int +xrep_refc_next_rrm( + struct xfs_btree_cur *cur, + struct xrep_refc *rr, + struct xrep_refc_rmap *rrm, + bool *have_rec) +{ + struct xfs_rmap_irec rmap; + struct xfs_mount *mp = cur->bc_mp; + xfs_fsblock_t fsbno; + int have_gt; + int error = 0; + + *have_rec = false; + /* + * Loop through the remaining rmaps. Remember CoW staging + * extents and the refcountbt blocks from the old tree for later + * disposal. We can only share written data fork extents, so + * keep looping until we find an rmap for one. + */ + do { + if (xchk_should_terminate(rr->sc, &error)) + goto out_error; + + error = xfs_btree_increment(cur, 0, &have_gt); + if (error) + goto out_error; + if (!have_gt) + return 0; + + error = xfs_rmap_get_rec(cur, &rmap, &have_gt); + if (error) + goto out_error; + XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error); + + if (rmap.rm_owner == XFS_RMAP_OWN_COW) { + error = xrep_refc_remember_cow(rr, rmap.rm_startblock, + rmap.rm_blockcount); + if (error) + goto out_error; + } else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) { + /* refcountbt block, dump it when we're done. */ + rr->btblocks += rmap.rm_blockcount; + fsbno = XFS_AGB_TO_FSB(cur->bc_mp, + cur->bc_private.a.agno, + rmap.rm_startblock); + error = xbitmap_set(&rr->old_refcountbt_blocks, + fsbno, rmap.rm_blockcount); + if (error) + goto out_error; + } + } while (XFS_RMAP_NON_INODE_OWNER(rmap.rm_owner) || + xfs_internal_inum(mp, rmap.rm_owner) || + (rmap.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN))); + + rrm->startblock = rmap.rm_startblock; + rrm->blockcount = rmap.rm_blockcount; + *have_rec = true; + return 0; + +out_error: + return error; +} + +/* Compare two btree extents. */ +static int +xrep_refc_extent_cmp( + const void *a, + const void *b) +{ + const struct xfs_refcount_irec *ap = a; + const struct xfs_refcount_irec *bp = b; + + if (ap->rc_startblock > bp->rc_startblock) + return 1; + else if (ap->rc_startblock < bp->rc_startblock) + return -1; + return 0; +} + +#define RRM_NEXT(r) ((r).startblock + (r).blockcount) +/* + * Find the next block where the refcount changes, given the next rmap we + * looked at and the ones we're already tracking. + */ +static inline xfs_agblock_t +xrep_refc_next_edge( + struct xfbma *rmap_bag, + struct xrep_refc_rmap *next_rrm, + bool next_valid) +{ + struct xrep_refc_rmap rrm; + uint64_t i; + xfs_agblock_t nbno; + + nbno = next_valid ? next_rrm->startblock : NULLAGBLOCK; + foreach_xfbma_item(rmap_bag, i, rrm) + nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm)); + return nbno; +} + +/* Iterate all the rmap records to generate reference count data. */ +STATIC int +xrep_refc_find_refcounts( + struct xrep_refc *rr) +{ + struct xrep_refc_rmap rrm; + struct xfs_scrub *sc = rr->sc; + struct xfbma *rmap_bag; + struct xfs_btree_cur *cur; + xfs_agblock_t sbno; + xfs_agblock_t cbno; + xfs_agblock_t nbno; + size_t old_stack_sz; + size_t stack_sz = 0; + bool have; + int have_gt; + int error; + + /* Start the rmapbt cursor to the left of all records. */ + cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp, + sc->sa.agno); + error = xfs_rmap_lookup_le(cur, 0, 0, 0, 0, 0, &have_gt); + if (error) + goto out_cur; + ASSERT(have_gt == 0); + + /* Set up some storage */ + rmap_bag = xfbma_init(sizeof(struct xrep_refc_rmap)); + if (IS_ERR(rmap_bag)) { + error = PTR_ERR(rmap_bag); + goto out_cur; + } + + /* Process reverse mappings into refcount data. */ + while (xfs_btree_has_more_records(cur)) { + /* Push all rmaps with pblk == sbno onto the stack */ + error = xrep_refc_next_rrm(cur, rr, &rrm, &have); + if (error) + goto out; + if (!have) + break; + sbno = cbno = rrm.startblock; + while (have && rrm.startblock == sbno) { + error = xfbma_insert_anywhere(rmap_bag, &rrm); + if (error) + goto out; + stack_sz++; + error = xrep_refc_next_rrm(cur, rr, &rrm, &have); + if (error) + goto out; + } + error = xfs_btree_decrement(cur, 0, &have_gt); + if (error) + goto out; + XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out); + + /* Set nbno to the bno of the next refcount change */ + nbno = xrep_refc_next_edge(rmap_bag, &rrm, have); + if (nbno == NULLAGBLOCK) { + error = -EFSCORRUPTED; + goto out; + } + + ASSERT(nbno > sbno); + old_stack_sz = stack_sz; + + /* While stack isn't empty... */ + while (stack_sz) { + uint64_t i; + + /* Pop all rmaps that end at nbno */ + foreach_xfbma_item(rmap_bag, i, rrm) { + if (RRM_NEXT(rrm) != nbno) + continue; + error = xfbma_nullify(rmap_bag, i); + if (error) + goto out; + stack_sz--; + } + + /* Push array items that start at nbno */ + error = xrep_refc_next_rrm(cur, rr, &rrm, &have); + if (error) + goto out; + while (have && rrm.startblock == nbno) { + error = xfbma_insert_anywhere(rmap_bag, + &rrm); + if (error) + goto out; + stack_sz++; + error = xrep_refc_next_rrm(cur, rr, &rrm, + &have); + if (error) + goto out; + } + error = xfs_btree_decrement(cur, 0, &have_gt); + if (error) + goto out; + XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out); + + /* Emit refcount if necessary */ + ASSERT(nbno > cbno); + if (stack_sz != old_stack_sz) { + if (old_stack_sz > 1) { + error = xrep_refc_remember(rr, cbno, + nbno - cbno, + old_stack_sz); + if (error) + goto out; + } + cbno = nbno; + } + + /* Stack empty, go find the next rmap */ + if (stack_sz == 0) + break; + old_stack_sz = stack_sz; + sbno = nbno; + + /* Set nbno to the bno of the next refcount change */ + nbno = xrep_refc_next_edge(rmap_bag, &rrm, have); + if (nbno == NULLAGBLOCK) { + error = -EFSCORRUPTED; + goto out; + } + + ASSERT(nbno > sbno); + } + } + + ASSERT(stack_sz == 0); +out: + xfbma_destroy(rmap_bag); +out_cur: + xfs_btree_del_cursor(cur, error); + return error; +} +#undef RRM_NEXT + +/* Retrieve refcountbt data for bulk load. */ +STATIC int +xrep_refc_get_data( + struct xfs_btree_cur *cur, + void *priv) +{ + struct xfs_refcount_irec *refc = &cur->bc_rec.rc; + struct xrep_refc *rr = priv; + int error; + + do { + error = xfbma_get(rr->refcount_records, rr->iter++, refc); + } while (error == 0 && xfbma_is_null(rr->refcount_records, refc)); + + return error; +} + +/* Feed one of the new btree blocks to the bulk loader. */ +STATIC int +xrep_refc_alloc_block( + struct xfs_btree_cur *cur, + union xfs_btree_ptr *ptr, + void *priv) +{ + struct xrep_refc *rr = priv; + + return xrep_newbt_claim_block(cur, &rr->new_btree_info, ptr); +} + +/* Update the AGF counters. */ +STATIC int +xrep_refc_reset_counters( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_perag *pag = sc->sa.pag; + struct xfs_buf *bp; + + /* + * Mark the pagf information stale and use the accessor function to + * forcibly reload it from the values we just logged. We still own the + * AGF bp so we can safely ignore bp. + */ + ASSERT(pag->pagf_init); + pag->pagf_init = 0; + + return xfs_alloc_read_agf(sc->mp, sc->tp, sc->sa.agno, 0, &bp); +} + +/* + * Use the collected refcount information to stage a new refcount btree. If + * this is successful we'll return with the new btree root information logged + * to the repair transaction but not yet committed. + */ +STATIC int +xrep_refc_build_new_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + struct xfs_btree_cur *refc_cur; + int error; + + rr->refc_bload.get_data = xrep_refc_get_data; + rr->refc_bload.alloc_block = xrep_refc_alloc_block; + xrep_bload_estimate_slack(sc, &rr->refc_bload); + + /* + * Sort the refcount extents by startblock or else the btree records + * will be in the wrong order. + */ + error = xfbma_sort(rr->refcount_records, xrep_refc_extent_cmp); + if (error) + return error; + + /* + * Prepare to construct the new btree by reserving disk space for the + * new btree and setting up all the accounting information we'll need + * to root the new btree while it's under construction and before we + * attach it to the AG header. + */ + xrep_newbt_init_ag(&rr->new_btree_info, sc, &XFS_RMAP_OINFO_REFC, + XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, + xfs_refc_block(sc->mp)), + XFS_AG_RESV_METADATA); + + /* Compute how many blocks we'll need. */ + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, sc->tp, + &rr->new_btree_info.afake, sc->sa.agno); + error = xfs_btree_bload_compute_geometry(refc_cur, &rr->refc_bload, + xfbma_length(rr->refcount_records)); + if (error) + goto err_cur; + xfs_btree_del_cursor(refc_cur, error); + + /* + * Reserve the space we'll need for the new btree. Drop the cursor + * while we do this because that can roll the transaction and cursors + * can't handle that. + */ + error = xrep_newbt_alloc_blocks(&rr->new_btree_info, + rr->refc_bload.nr_blocks); + if (error) + goto err_newbt; + + /* Add all observed refcount records. */ + rr->iter = 0; + refc_cur = xfs_refcountbt_stage_cursor(sc->mp, sc->tp, + &rr->new_btree_info.afake, sc->sa.agno); + error = xfs_btree_bload(refc_cur, &rr->refc_bload, rr); + if (error) + goto err_cur; + + /* + * Install the new btree in the AG header. After this point the old + * btree is no longer accessible and the new tree is live. + * + * Note: We re-read the AGF here to ensure the buffer type is set + * properly. Since we built a new tree without attaching to the AGF + * buffer, the buffer item may have fallen off the buffer. This ought + * to succeed since the AGF is held across transaction rolls. + */ + error = xfs_read_agf(sc->mp, sc->tp, sc->sa.agno, 0, &sc->sa.agf_bp); + if (error) + goto err_cur; + + /* Commit our new btree. */ + xfs_refcountbt_commit_staged_btree(refc_cur, sc->sa.agf_bp); + xfs_btree_del_cursor(refc_cur, 0); + + /* Reset the AGF counters now that we've changed the btree shape. */ + error = xrep_refc_reset_counters(rr); + if (error) + goto err_newbt; + + /* Dispose of any unused blocks and the accounting information. */ + xrep_newbt_destroy(&rr->new_btree_info, error); + + return xrep_roll_ag_trans(sc); +err_cur: + xfs_btree_del_cursor(refc_cur, error); +err_newbt: + xrep_newbt_destroy(&rr->new_btree_info, error); + return error; +} + +/* + * Now that we've logged the roots of the new btrees, invalidate all of the + * old blocks and free them. + */ +STATIC int +xrep_refc_remove_old_tree( + struct xrep_refc *rr) +{ + struct xfs_scrub *sc = rr->sc; + int error; + + /* Free the old refcountbt blocks if they're not in use. */ + error = xrep_reap_extents(sc, &rr->old_refcountbt_blocks, + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); + if (error) + return error; + + sc->flags |= XREP_RESET_PERAG_RESV; + return 0; +} + +/* Rebuild the refcount btree. */ +int +xrep_refcountbt( + struct xfs_scrub *sc) +{ + struct xrep_refc *rr; + struct xfs_mount *mp = sc->mp; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) + return -EOPNOTSUPP; + + rr = kmem_zalloc(sizeof(struct xrep_refc), KM_NOFS | KM_MAYFAIL); + if (!rr) + return -ENOMEM; + rr->sc = sc; + + xchk_perag_get(sc->mp, &sc->sa); + + /* Set up some storage */ + rr->refcount_records = xfbma_init(sizeof(struct xfs_refcount_irec)); + if (IS_ERR(rr->refcount_records)) { + error = PTR_ERR(rr->refcount_records); + goto out_rr; + } + + /* Collect all reference counts. */ + xbitmap_init(&rr->old_refcountbt_blocks); + error = xrep_refc_find_refcounts(rr); + if (error) + goto out_bitmap; + + /* Rebuild the refcount information. */ + error = xrep_refc_build_new_tree(rr); + if (error) + goto out_bitmap; + + /* Kill the old tree. */ + error = xrep_refc_remove_old_tree(rr); + +out_bitmap: + xbitmap_destroy(&rr->old_refcountbt_blocks); + xfbma_destroy(rr->refcount_records); +out_rr: + kmem_free(rr); + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 8b320e905e00..c0769aaae9a4 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -70,6 +70,7 @@ int xrep_agfl(struct xfs_scrub *sc); int xrep_agi(struct xfs_scrub *sc); int xrep_allocbt(struct xfs_scrub *sc); int xrep_iallocbt(struct xfs_scrub *sc); +int xrep_refcountbt(struct xfs_scrub *sc); struct xrep_newbt_resv { /* Link to list of extents that we've reserved. */ @@ -167,6 +168,7 @@ xrep_reset_perag_resv( #define xrep_agi xrep_notsupported #define xrep_allocbt xrep_notsupported #define xrep_iallocbt xrep_notsupported +#define xrep_refcountbt xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 6011823d0d40..b104231af049 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -254,7 +254,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .setup = xchk_setup_ag_refcountbt, .scrub = xchk_refcountbt, .has = xfs_sb_version_hasreflink, - .repair = xrep_notsupported, + .repair = xrep_refcountbt, }, [XFS_SCRUB_TYPE_INODE] = { /* inode record */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 9bf75c97fdd1..ed9484de80fe 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -782,10 +782,11 @@ TRACE_EVENT(xrep_ibt_found, __entry->freemask) ) -TRACE_EVENT(xrep_refcount_extent_fn, +TRACE_EVENT(xrep_refc_found, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, - struct xfs_refcount_irec *irec), - TP_ARGS(mp, agno, irec), + xfs_agblock_t startblock, xfs_extlen_t blockcount, + xfs_nlink_t refcount), + TP_ARGS(mp, agno, startblock, blockcount, refcount), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_agnumber_t, agno) @@ -796,9 +797,9 @@ TRACE_EVENT(xrep_refcount_extent_fn, TP_fast_assign( __entry->dev = mp->m_super->s_dev; __entry->agno = agno; - __entry->startblock = irec->rc_startblock; - __entry->blockcount = irec->rc_blockcount; - __entry->refcount = irec->rc_refcount; + __entry->startblock = startblock; + __entry->blockcount = blockcount; + __entry->refcount = refcount; ), TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u", MAJOR(__entry->dev), MINOR(__entry->dev), ^ permalink raw reply related [flat|nested] 156+ messages in thread
end of thread, other threads:[~2023-12-06 5:16 UTC | newest] Thread overview: 156+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-11-24 23:39 [MEGAPATCHSET v28] xfs: online repair, second part of part 1 Darrick J. Wong 2023-11-24 23:44 ` [PATCHSET v28.0 0/1] xfs: prevent livelocks in xchk_iget Darrick J. Wong 2023-11-24 23:46 ` [PATCH 1/1] xfs: make xchk_iget safer in the presence of corrupt inode btrees Darrick J. Wong 2023-11-25 4:57 ` Christoph Hellwig 2023-11-27 21:55 ` Darrick J. Wong 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: reserve disk space for online repairs Darrick J. Wong 2023-11-24 23:47 ` [PATCH 1/7] xfs: don't append work items to logged xfs_defer_pending objects Darrick J. Wong 2023-11-25 5:04 ` Christoph Hellwig 2023-11-24 23:47 ` [PATCH 2/7] xfs: allow pausing of pending deferred work items Darrick J. Wong 2023-11-25 5:05 ` Christoph Hellwig 2023-11-24 23:47 ` [PATCH 3/7] xfs: remove __xfs_free_extent_later Darrick J. Wong 2023-11-25 5:05 ` Christoph Hellwig 2023-11-24 23:47 ` [PATCH 4/7] xfs: automatic freeing of freshly allocated unwritten space Darrick J. Wong 2023-11-25 5:06 ` Christoph Hellwig 2023-11-24 23:48 ` [PATCH 5/7] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong 2023-11-26 13:14 ` Christoph Hellwig 2023-11-27 22:34 ` Darrick J. Wong 2023-11-28 5:41 ` Christoph Hellwig 2023-11-28 17:02 ` Darrick J. Wong 2023-11-24 23:48 ` [PATCH 6/7] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong 2023-11-26 13:15 ` Christoph Hellwig 2023-11-24 23:48 ` [PATCH 7/7] xfs: force small EFIs for reaping btree extents Darrick J. Wong 2023-11-25 5:13 ` Christoph Hellwig 2023-11-27 22:46 ` Darrick J. Wong 2023-11-24 23:45 ` [PATCHSET v28.0 0/4] xfs: prepare repair for bulk loading Darrick J. Wong 2023-11-24 23:48 ` [PATCH 1/4] xfs: force all buffers to be written during btree bulk load Darrick J. Wong 2023-11-25 5:49 ` Christoph Hellwig 2023-11-28 1:50 ` Darrick J. Wong 2023-11-28 7:13 ` Christoph Hellwig 2023-11-28 15:18 ` Christoph Hellwig 2023-11-28 17:07 ` Darrick J. Wong 2023-11-30 4:33 ` Christoph Hellwig 2023-11-24 23:49 ` [PATCH 2/4] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong 2023-11-25 5:50 ` Christoph Hellwig 2023-11-28 1:44 ` Darrick J. Wong 2023-11-28 5:42 ` Christoph Hellwig 2023-11-28 17:07 ` Darrick J. Wong 2023-11-24 23:49 ` [PATCH 3/4] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong 2023-11-25 5:51 ` Christoph Hellwig 2023-11-24 23:49 ` [PATCH 4/4] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong 2023-11-25 5:53 ` Christoph Hellwig 2023-11-27 22:56 ` Darrick J. Wong 2023-11-28 20:11 ` Darrick J. Wong 2023-11-29 5:50 ` Christoph Hellwig 2023-11-29 5:57 ` Darrick J. Wong 2023-11-24 23:45 ` [PATCHSET v28.0 0/5] xfs: online repair of AG btrees Darrick J. Wong 2023-11-24 23:50 ` [PATCH 1/5] xfs: create separate structures and code for u32 bitmaps Darrick J. Wong 2023-11-25 5:57 ` Christoph Hellwig 2023-11-28 1:34 ` Darrick J. Wong 2023-11-28 5:43 ` Christoph Hellwig 2023-11-24 23:50 ` [PATCH 2/5] xfs: roll the scrub transaction after completing a repair Darrick J. Wong 2023-11-25 6:05 ` Christoph Hellwig 2023-11-28 1:29 ` Darrick J. Wong 2023-11-24 23:50 ` [PATCH 3/5] xfs: repair free space btrees Darrick J. Wong 2023-11-25 6:11 ` Christoph Hellwig 2023-11-28 1:05 ` Darrick J. Wong 2023-11-28 15:10 ` Christoph Hellwig 2023-11-28 21:13 ` Darrick J. Wong 2023-11-29 5:56 ` Christoph Hellwig 2023-11-29 6:18 ` Darrick J. Wong 2023-11-29 6:24 ` Christoph Hellwig 2023-11-29 6:26 ` Darrick J. Wong 2023-11-24 23:50 ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong 2023-11-25 6:12 ` Christoph Hellwig 2023-11-28 1:09 ` Darrick J. Wong 2023-11-28 15:57 ` Christoph Hellwig 2023-11-28 21:37 ` Darrick J. Wong 2023-11-24 23:51 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 2023-11-28 16:07 ` Christoph Hellwig 2023-11-24 23:45 ` [PATCHSET v28.0 0/7] xfs: online repair of inodes and forks Darrick J. Wong 2023-11-24 23:51 ` [PATCH 1/7] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong 2023-11-25 6:13 ` Christoph Hellwig 2023-11-24 23:51 ` [PATCH 2/7] xfs: try to attach dquots to files before repairing them Darrick J. Wong 2023-11-25 6:14 ` Christoph Hellwig 2023-11-24 23:51 ` [PATCH 3/7] xfs: repair inode records Darrick J. Wong 2023-11-28 17:08 ` Christoph Hellwig 2023-11-28 23:08 ` Darrick J. Wong 2023-11-29 6:02 ` Christoph Hellwig 2023-12-05 23:08 ` Darrick J. Wong 2023-12-06 5:16 ` Christoph Hellwig 2023-11-24 23:52 ` [PATCH 4/7] xfs: zap broken inode forks Darrick J. Wong 2023-11-30 4:44 ` Christoph Hellwig 2023-11-30 21:08 ` Darrick J. Wong 2023-12-04 4:39 ` Christoph Hellwig 2023-12-04 20:43 ` Darrick J. Wong 2023-12-05 4:28 ` Christoph Hellwig 2023-11-24 23:52 ` [PATCH 5/7] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong 2023-11-30 4:47 ` Christoph Hellwig 2023-11-30 21:37 ` Darrick J. Wong 2023-12-04 4:41 ` Christoph Hellwig 2023-12-04 20:44 ` Darrick J. Wong 2023-11-24 23:52 ` [PATCH 6/7] xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped Darrick J. Wong 2023-11-24 23:52 ` [PATCH 7/7] xfs: repair obviously broken inode modes Darrick J. Wong 2023-11-30 4:49 ` Christoph Hellwig 2023-11-30 21:18 ` Darrick J. Wong 2023-12-04 4:42 ` Christoph Hellwig 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong 2023-11-24 23:53 ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong 2023-11-30 4:53 ` Christoph Hellwig 2023-11-30 21:48 ` Darrick J. Wong 2023-12-04 4:42 ` Christoph Hellwig 2023-11-24 23:53 ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong 2023-11-30 5:07 ` Christoph Hellwig 2023-12-01 1:38 ` Darrick J. Wong 2023-11-24 23:53 ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong 2023-11-28 14:20 ` Christoph Hellwig 2023-11-29 5:42 ` Darrick J. Wong 2023-11-29 6:03 ` Christoph Hellwig 2023-11-24 23:53 ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong 2023-11-28 13:59 ` Christoph Hellwig 2023-11-24 23:54 ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong 2023-11-30 5:10 ` Christoph Hellwig 2023-11-24 23:46 ` [PATCHSET v28.0 0/6] xfs: online repair of rt bitmap file Darrick J. Wong 2023-11-24 23:54 ` [PATCH 1/6] xfs: check rt bitmap file geometry more thoroughly Darrick J. Wong 2023-11-28 14:04 ` Christoph Hellwig 2023-11-28 23:27 ` Darrick J. Wong 2023-11-29 6:05 ` Christoph Hellwig 2023-11-29 6:20 ` Darrick J. Wong 2023-11-24 23:54 ` [PATCH 2/6] xfs: check rt summary " Darrick J. Wong 2023-11-28 14:05 ` Christoph Hellwig 2023-11-28 23:30 ` Darrick J. Wong 2023-11-29 1:23 ` Darrick J. Wong 2023-11-29 6:05 ` Christoph Hellwig 2023-11-29 6:21 ` Darrick J. Wong 2023-11-29 6:23 ` Christoph Hellwig 2023-11-30 0:10 ` Darrick J. Wong 2023-11-24 23:54 ` [PATCH 3/6] xfs: always check the rtbitmap and rtsummary files Darrick J. Wong 2023-11-28 14:06 ` Christoph Hellwig 2023-11-24 23:55 ` [PATCH 4/6] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong 2023-11-30 5:12 ` Christoph Hellwig 2023-11-24 23:55 ` [PATCH 5/6] xfs: create a new inode fork block unmap helper Darrick J. Wong 2023-11-25 6:17 ` Christoph Hellwig 2023-11-24 23:55 ` [PATCH 6/6] xfs: online repair of realtime bitmaps Darrick J. Wong 2023-11-30 5:16 ` Christoph Hellwig 2023-11-24 23:46 ` [PATCHSET v28.0 0/5] xfs: online repair of quota and rt metadata files Darrick J. Wong 2023-11-24 23:56 ` [PATCH 1/5] xfs: check the ondisk space mapping behind a dquot Darrick J. Wong 2023-11-30 5:17 ` Christoph Hellwig 2023-11-24 23:56 ` [PATCH 2/5] xfs: check dquot resource timers Darrick J. Wong 2023-11-30 5:17 ` Christoph Hellwig 2023-11-24 23:56 ` [PATCH 3/5] xfs: pull xfs_qm_dqiterate back into scrub Darrick J. Wong 2023-11-30 5:22 ` Christoph Hellwig 2023-11-24 23:56 ` [PATCH 4/5] xfs: improve dquot iteration for scrub Darrick J. Wong 2023-11-30 5:25 ` Christoph Hellwig 2023-11-24 23:57 ` [PATCH 5/5] xfs: repair quotas Darrick J. Wong 2023-11-30 5:33 ` Christoph Hellwig 2023-11-30 22:10 ` Darrick J. Wong 2023-12-04 4:48 ` Christoph Hellwig 2023-12-04 20:52 ` Darrick J. Wong 2023-12-05 4:27 ` Christoph Hellwig 2023-12-05 5:20 ` Darrick J. Wong 2023-12-04 4:49 ` Christoph Hellwig -- strict thread matches above, loose matches on Subject: below -- 2023-07-27 22:20 [PATCHSET v26.0 0/5] xfs: online repair of AG btrees Darrick J. Wong 2023-07-27 22:31 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 2023-05-26 0:29 [PATCHSET v25.0 0/5] xfs: online repair of AG btrees Darrick J. Wong 2023-05-26 0:52 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 2022-12-30 22:12 [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong 2022-12-30 22:12 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 2020-01-01 1:02 [PATCH v22 0/5] xfs: online repair of AG btrees Darrick J. Wong 2020-01-01 1:03 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong 2019-10-29 23:31 [PATCH v21 0/5] xfs: online repair of AG btrees Darrick J. Wong 2019-10-29 23:32 ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.