* [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand @ 2018-03-07 19:24 Brian Foster 2018-03-08 14:03 ` Carlos Maiolino 0 siblings, 1 reply; 19+ messages in thread From: Brian Foster @ 2018-03-07 19:24 UTC (permalink / raw) To: linux-xfs Disliked-by: Brian Foster <bfoster@redhat.com> --- Sent as RFC for the time being. This tests Ok on a straight xfstests run and also seems to pass Darrick's agfl fixup tester (thanks) both on upstream and on a rhel7 kernel with some minor supporting hacks. I tried to tighten up the logic a bit to reduce the odds of mistaking actual corruption for a padding mismatch as much as possible. E.g., limit to cases where the agfl is wrapped, make sure we don't mistake a corruption that looks like an agfl with 120 entries on a packed kernel, etc. While I do prefer an on-demand fixup approach to a mount time scan, ISTM that in either case it's impossible to completely eliminate the risk of confusing corruption with a padding mismatch so long as we're doing a manual agfl fixup. The more I think about that the more I really dislike doing this. :( After some IRC discussion with djwong and sandeen, I'm wondering if the existence of 'xfs_repair -d' is a good enough last resort for those users who might be bit by unexpected padding issues on a typical upgrade. If so, we could fall back to a safer mount-time detection model that enforces a read-only mount and let the user run repair. The supposition is that those who aren't prepared to repair via a ramdisk or whatever should be able to 'xfs_repair -d' a rootfs that is mounted read-only provided agfl padding is the only inconsistency. Eric points out that we can still write an unmount record for a read-only mount, but I'm not sure that would be a problem if repair only needs to fix the agfl. xfs_repair shouldn't touch the log unless there's a recovery issue or it needs to be reformatted to update the LSN, both of which seem to qualify as "you have more problems than agfl padding and need to run repair anyways" to me. Thoughts? Brian fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_trace.h | 18 ++++++ 3 files changed, 164 insertions(+), 2 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index c02781a4c091..31330996e31c 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( } /* + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to + * padding is only significant if the agfl wraps around the end or refers to an + * invalid first/last value. + */ +static int +xfs_agfl_ondisk_size( + struct xfs_mount *mp, + int first, + int last, + int count) +{ + int active = count; + int agfl_size = XFS_AGFL_SIZE(mp); + bool wrapped = (first > last) ? true : false; + + if (count && last >= first) + active = last - first + 1; + else if (count) + active = agfl_size - first + last + 1; + + if (wrapped && active == count + 1) + agfl_size--; + else if ((wrapped && active == count - 1) || + first == agfl_size || last == agfl_size) + agfl_size++; + + /* + * We can't discern the packing problem from certain forms of corruption + * that may look exactly the same. To minimize the chance of mistaking + * corruption for a size mismatch, clamp the size to known valid values. + * A packed header agfl has 119 entries and the older unpacked format + * has one less. + */ + if (agfl_size < 118 || agfl_size > 119) + agfl_size = XFS_AGFL_SIZE(mp); + + return agfl_size; +} + +static bool +xfs_agfl_need_padfix( + struct xfs_mount *mp, + struct xfs_agf *agf) +{ + int f = be32_to_cpu(agf->agf_flfirst); + int l = be32_to_cpu(agf->agf_fllast); + int c = be32_to_cpu(agf->agf_flcount); + + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return false; + + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); +} + +static int +xfs_agfl_check_padfix( + struct xfs_trans *tp, + struct xfs_buf *agbp, + struct xfs_buf *agflbp, + struct xfs_perag *pag) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); + int agfl_size = XFS_AGFL_SIZE(mp); + int ofirst, olast, osize; + int nfirst, nlast; + int logflags = 0; + int startoff = 0; + + if (!pag->pagf_needpadfix) + return 0; + + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); + olast = nlast = be32_to_cpu(agf->agf_fllast); + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); + + /* + * If the on-disk agfl is smaller than what the kernel expects, the + * last slot of the on-disk agfl is a gap with bogus data. Move the + * first valid block into the gap and bump the pointer. + */ + if (osize < agfl_size) { + ASSERT(pag->pagf_flcount != 0); + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; + nfirst++; + goto done; + } + + /* + * Otherwise, the on-disk agfl is larger than what the current kernel + * expects. If empty, just fix up the first and last pointers. If not, + * move the inaccessible block to the end of the valid range. + */ + nfirst = do_mod(nfirst, agfl_size); + if (pag->pagf_flcount == 0) { + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); + goto done; + } + if (nlast != agfl_size) + nlast++; + nlast = do_mod(nlast, agfl_size); + agfl_bno[nlast] = agfl_bno[osize - 1]; + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; + +done: + if (nfirst != ofirst) { + agf->agf_flfirst = cpu_to_be32(nfirst); + logflags |= XFS_AGF_FLFIRST; + } + if (nlast != olast) { + agf->agf_fllast = cpu_to_be32(nlast); + logflags |= XFS_AGF_FLLAST; + } + if (startoff) { + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); + xfs_trans_log_buf(tp, agflbp, startoff, + startoff + sizeof(xfs_agblock_t) - 1); + } + if (logflags) + xfs_alloc_log_agf(tp, agbp, logflags); + + trace_xfs_agfl_padfix(mp, osize, agfl_size); + pag->pagf_needpadfix = false; + + return 0; +} + +/* * Decide whether to use this allocation group for this allocation. * If so, fix up the btree freelist's size. */ @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( if (error) return error; + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); + if (error) { + xfs_perag_put(pag); + return error; + } /* * Get the block number and update the data structures. @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) agf->agf_flfirst = 0; - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); be32_add_cpu(&agf->agf_flcount, -1); xfs_trans_agflist_delta(tp, -1); pag->pagf_flcount--; @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, be32_to_cpu(agf->agf_seqno), &agflbp))) return error; + + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); + if (error) { + xfs_perag_put(pag); + return error; + } + be32_add_cpu(&agf->agf_fllast, 1); if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) agf->agf_fllast = 0; - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); be32_add_cpu(&agf->agf_flcount, 1); xfs_trans_agflist_delta(tp, 1); pag->pagf_flcount++; @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( pag->pagb_count = 0; pag->pagb_tree = RB_ROOT; pag->pagf_init = 1; + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); } #ifdef DEBUG else if (!XFS_FORCED_SHUTDOWN(mp)) { diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index e0792d036be2..78a6377a9b38 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -353,6 +353,7 @@ typedef struct xfs_perag { char pagi_inodeok; /* The agi is ok for inodes */ uint8_t pagf_levels[XFS_BTNUM_AGF]; /* # of levels in bno & cnt btree */ + bool pagf_needpadfix; uint32_t pagf_flcount; /* count of blocks in freelist */ xfs_extlen_t pagf_freeblks; /* total free blocks */ xfs_extlen_t pagf_longest; /* longest free space */ diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 945de08af7ba..c7a3bcd6cc4a 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, __entry->logflags) ); +TRACE_EVENT(xfs_agfl_padfix, + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), + TP_ARGS(mp, osize, nsize), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(int, osize) + __field(int, nsize) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->osize = osize; + __entry->nsize = nsize; + ), + TP_printk("dev %d:%d old size %d new size %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->osize, __entry->nsize) +); + #endif /* _TRACE_XFS_H */ #undef TRACE_INCLUDE_PATH -- 2.13.6 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-07 19:24 [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand Brian Foster @ 2018-03-08 14:03 ` Carlos Maiolino 2018-03-08 14:15 ` Carlos Maiolino 2018-03-09 13:16 ` Brian Foster 0 siblings, 2 replies; 19+ messages in thread From: Carlos Maiolino @ 2018-03-08 14:03 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs Hi, On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > Disliked-by: Brian Foster <bfoster@redhat.com> > --- > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > and also seems to pass Darrick's agfl fixup tester (thanks) both on > upstream and on a rhel7 kernel with some minor supporting hacks. > > I tried to tighten up the logic a bit to reduce the odds of mistaking > actual corruption for a padding mismatch as much as possible. E.g., > limit to cases where the agfl is wrapped, make sure we don't mistake a > corruption that looks like an agfl with 120 entries on a packed kernel, > etc. > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > that in either case it's impossible to completely eliminate the risk of > confusing corruption with a padding mismatch so long as we're doing a > manual agfl fixup. The more I think about that the more I really dislike > doing this. :( > > After some IRC discussion with djwong and sandeen, I'm wondering if the > existence of 'xfs_repair -d' is a good enough last resort for those > users who might be bit by unexpected padding issues on a typical > upgrade. If so, we could fall back to a safer mount-time detection model > that enforces a read-only mount and let the user run repair. The > supposition is that those who aren't prepared to repair via a ramdisk or > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > read-only provided agfl padding is the only inconsistency. > > Eric points out that we can still write an unmount record for a > read-only mount, but I'm not sure that would be a problem if repair only > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > a recovery issue or it needs to be reformatted to update the LSN, both > of which seem to qualify as "you have more problems than agfl padding > and need to run repair anyways" to me. Thoughts? > Sorry if this may sound stupid, but in the possibility this can help the issue, or at least me learning something new. ISTM this issue is all related to the way xfs_agfl packing. I read the commit log where packed attribute was added to xfs_agfl, and I was wondering... What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? Merge it together in a 64bit field when reading it from disk, or split it when writing to? It seems to me this would avoid the size difference we are seeing now in 32/64 bit systems, and avoid such risk of confusion when trying to discern between a corrupted agfl and a padding mismatch. But still, just as reinforcement, this is just a guess, and I have no idea if this is feasible or not, but in case I'm completely nuts, I'll learn something new :) Cheers. > Brian > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > fs/xfs/xfs_mount.h | 1 + > fs/xfs/xfs_trace.h | 18 ++++++ > 3 files changed, 164 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index c02781a4c091..31330996e31c 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > } > > /* > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > + * padding is only significant if the agfl wraps around the end or refers to an > + * invalid first/last value. > + */ > +static int > +xfs_agfl_ondisk_size( > + struct xfs_mount *mp, > + int first, > + int last, > + int count) > +{ > + int active = count; > + int agfl_size = XFS_AGFL_SIZE(mp); > + bool wrapped = (first > last) ? true : false; > + > + if (count && last >= first) > + active = last - first + 1; > + else if (count) > + active = agfl_size - first + last + 1; > + > + if (wrapped && active == count + 1) > + agfl_size--; > + else if ((wrapped && active == count - 1) || > + first == agfl_size || last == agfl_size) > + agfl_size++; > + > + /* > + * We can't discern the packing problem from certain forms of corruption > + * that may look exactly the same. To minimize the chance of mistaking > + * corruption for a size mismatch, clamp the size to known valid values. > + * A packed header agfl has 119 entries and the older unpacked format > + * has one less. > + */ > + if (agfl_size < 118 || agfl_size > 119) > + agfl_size = XFS_AGFL_SIZE(mp); > + > + return agfl_size; > +} > + > +static bool > +xfs_agfl_need_padfix( > + struct xfs_mount *mp, > + struct xfs_agf *agf) > +{ > + int f = be32_to_cpu(agf->agf_flfirst); > + int l = be32_to_cpu(agf->agf_fllast); > + int c = be32_to_cpu(agf->agf_flcount); > + > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > + return false; > + > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > +} > + > +static int > +xfs_agfl_check_padfix( > + struct xfs_trans *tp, > + struct xfs_buf *agbp, > + struct xfs_buf *agflbp, > + struct xfs_perag *pag) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > + int agfl_size = XFS_AGFL_SIZE(mp); > + int ofirst, olast, osize; > + int nfirst, nlast; > + int logflags = 0; > + int startoff = 0; > + > + if (!pag->pagf_needpadfix) > + return 0; > + > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > + olast = nlast = be32_to_cpu(agf->agf_fllast); > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > + > + /* > + * If the on-disk agfl is smaller than what the kernel expects, the > + * last slot of the on-disk agfl is a gap with bogus data. Move the > + * first valid block into the gap and bump the pointer. > + */ > + if (osize < agfl_size) { > + ASSERT(pag->pagf_flcount != 0); > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > + nfirst++; > + goto done; > + } > + > + /* > + * Otherwise, the on-disk agfl is larger than what the current kernel > + * expects. If empty, just fix up the first and last pointers. If not, > + * move the inaccessible block to the end of the valid range. > + */ > + nfirst = do_mod(nfirst, agfl_size); > + if (pag->pagf_flcount == 0) { > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > + goto done; > + } > + if (nlast != agfl_size) > + nlast++; > + nlast = do_mod(nlast, agfl_size); > + agfl_bno[nlast] = agfl_bno[osize - 1]; > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > + > +done: > + if (nfirst != ofirst) { > + agf->agf_flfirst = cpu_to_be32(nfirst); > + logflags |= XFS_AGF_FLFIRST; > + } > + if (nlast != olast) { > + agf->agf_fllast = cpu_to_be32(nlast); > + logflags |= XFS_AGF_FLLAST; > + } > + if (startoff) { > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > + xfs_trans_log_buf(tp, agflbp, startoff, > + startoff + sizeof(xfs_agblock_t) - 1); > + } > + if (logflags) > + xfs_alloc_log_agf(tp, agbp, logflags); > + > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > + pag->pagf_needpadfix = false; > + > + return 0; > +} > + > +/* > * Decide whether to use this allocation group for this allocation. > * If so, fix up the btree freelist's size. > */ > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > if (error) > return error; > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > + if (error) { > + xfs_perag_put(pag); > + return error; > + } > > /* > * Get the block number and update the data structures. > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > agf->agf_flfirst = 0; > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > be32_add_cpu(&agf->agf_flcount, -1); > xfs_trans_agflist_delta(tp, -1); > pag->pagf_flcount--; > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > be32_to_cpu(agf->agf_seqno), &agflbp))) > return error; > + > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > + if (error) { > + xfs_perag_put(pag); > + return error; > + } > + > be32_add_cpu(&agf->agf_fllast, 1); > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > agf->agf_fllast = 0; > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > be32_add_cpu(&agf->agf_flcount, 1); > xfs_trans_agflist_delta(tp, 1); > pag->pagf_flcount++; > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > pag->pagb_count = 0; > pag->pagb_tree = RB_ROOT; > pag->pagf_init = 1; > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > } > #ifdef DEBUG > else if (!XFS_FORCED_SHUTDOWN(mp)) { > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index e0792d036be2..78a6377a9b38 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > char pagi_inodeok; /* The agi is ok for inodes */ > uint8_t pagf_levels[XFS_BTNUM_AGF]; > /* # of levels in bno & cnt btree */ > + bool pagf_needpadfix; > uint32_t pagf_flcount; /* count of blocks in freelist */ > xfs_extlen_t pagf_freeblks; /* total free blocks */ > xfs_extlen_t pagf_longest; /* longest free space */ > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 945de08af7ba..c7a3bcd6cc4a 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > __entry->logflags) > ); > > +TRACE_EVENT(xfs_agfl_padfix, > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > + TP_ARGS(mp, osize, nsize), > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(int, osize) > + __field(int, nsize) > + ), > + TP_fast_assign( > + __entry->dev = mp->m_super->s_dev; > + __entry->osize = osize; > + __entry->nsize = nsize; > + ), > + TP_printk("dev %d:%d old size %d new size %d", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + __entry->osize, __entry->nsize) > +); > + > #endif /* _TRACE_XFS_H */ > > #undef TRACE_INCLUDE_PATH > -- > 2.13.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Carlos ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-08 14:03 ` Carlos Maiolino @ 2018-03-08 14:15 ` Carlos Maiolino 2018-03-09 13:16 ` Brian Foster 1 sibling, 0 replies; 19+ messages in thread From: Carlos Maiolino @ 2018-03-08 14:15 UTC (permalink / raw) To: Brian Foster, linux-xfs On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > Hi, > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > Disliked-by: Brian Foster <bfoster@redhat.com> > > --- > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > actual corruption for a padding mismatch as much as possible. E.g., > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > corruption that looks like an agfl with 120 entries on a packed kernel, > > etc. > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > that in either case it's impossible to completely eliminate the risk of > > confusing corruption with a padding mismatch so long as we're doing a > > manual agfl fixup. The more I think about that the more I really dislike > > doing this. :( > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > existence of 'xfs_repair -d' is a good enough last resort for those > > users who might be bit by unexpected padding issues on a typical > > upgrade. If so, we could fall back to a safer mount-time detection model > > that enforces a read-only mount and let the user run repair. The > > supposition is that those who aren't prepared to repair via a ramdisk or > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > read-only provided agfl padding is the only inconsistency. > > > > Eric points out that we can still write an unmount record for a > > read-only mount, but I'm not sure that would be a problem if repair only > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > a recovery issue or it needs to be reformatted to update the LSN, both > > of which seem to qualify as "you have more problems than agfl padding > > and need to run repair anyways" to me. Thoughts? > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > or at least me learning something new. > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > log where packed attribute was added to xfs_agfl, and I was wondering... > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > Merge it together in a 64bit field when reading it from disk, or split it when > writing to? > It seems to me this would avoid the size difference we are seeing now in 32/64 > bit systems, and avoid such risk of confusion when trying to discern between a > corrupted agfl and a padding mismatch. > > But still, just as reinforcement, this is just a guess, and I have no idea if > this is feasible or not, but in case I'm completely nuts, I'll learn something > new :) Btw, I thought about this because once the LSN is logically separated between cycle number and block number, it could possibly be recorded using two separated 32bit fields in xfs_agfl, but again, I don't know yet if this feasible, just my $0.02 > > Cheers. > > > Brian > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > fs/xfs/xfs_mount.h | 1 + > > fs/xfs/xfs_trace.h | 18 ++++++ > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > index c02781a4c091..31330996e31c 100644 > > --- a/fs/xfs/libxfs/xfs_alloc.c > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > } > > > > /* > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > + * padding is only significant if the agfl wraps around the end or refers to an > > + * invalid first/last value. > > + */ > > +static int > > +xfs_agfl_ondisk_size( > > + struct xfs_mount *mp, > > + int first, > > + int last, > > + int count) > > +{ > > + int active = count; > > + int agfl_size = XFS_AGFL_SIZE(mp); > > + bool wrapped = (first > last) ? true : false; > > + > > + if (count && last >= first) > > + active = last - first + 1; > > + else if (count) > > + active = agfl_size - first + last + 1; > > + > > + if (wrapped && active == count + 1) > > + agfl_size--; > > + else if ((wrapped && active == count - 1) || > > + first == agfl_size || last == agfl_size) > > + agfl_size++; > > + > > + /* > > + * We can't discern the packing problem from certain forms of corruption > > + * that may look exactly the same. To minimize the chance of mistaking > > + * corruption for a size mismatch, clamp the size to known valid values. > > + * A packed header agfl has 119 entries and the older unpacked format > > + * has one less. > > + */ > > + if (agfl_size < 118 || agfl_size > 119) > > + agfl_size = XFS_AGFL_SIZE(mp); > > + > > + return agfl_size; > > +} > > + > > +static bool > > +xfs_agfl_need_padfix( > > + struct xfs_mount *mp, > > + struct xfs_agf *agf) > > +{ > > + int f = be32_to_cpu(agf->agf_flfirst); > > + int l = be32_to_cpu(agf->agf_fllast); > > + int c = be32_to_cpu(agf->agf_flcount); > > + > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > + return false; > > + > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > +} > > + > > +static int > > +xfs_agfl_check_padfix( > > + struct xfs_trans *tp, > > + struct xfs_buf *agbp, > > + struct xfs_buf *agflbp, > > + struct xfs_perag *pag) > > +{ > > + struct xfs_mount *mp = tp->t_mountp; > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > + int agfl_size = XFS_AGFL_SIZE(mp); > > + int ofirst, olast, osize; > > + int nfirst, nlast; > > + int logflags = 0; > > + int startoff = 0; > > + > > + if (!pag->pagf_needpadfix) > > + return 0; > > + > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > + > > + /* > > + * If the on-disk agfl is smaller than what the kernel expects, the > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > + * first valid block into the gap and bump the pointer. > > + */ > > + if (osize < agfl_size) { > > + ASSERT(pag->pagf_flcount != 0); > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > + nfirst++; > > + goto done; > > + } > > + > > + /* > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > + * expects. If empty, just fix up the first and last pointers. If not, > > + * move the inaccessible block to the end of the valid range. > > + */ > > + nfirst = do_mod(nfirst, agfl_size); > > + if (pag->pagf_flcount == 0) { > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > + goto done; > > + } > > + if (nlast != agfl_size) > > + nlast++; > > + nlast = do_mod(nlast, agfl_size); > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > + > > +done: > > + if (nfirst != ofirst) { > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > + logflags |= XFS_AGF_FLFIRST; > > + } > > + if (nlast != olast) { > > + agf->agf_fllast = cpu_to_be32(nlast); > > + logflags |= XFS_AGF_FLLAST; > > + } > > + if (startoff) { > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > + xfs_trans_log_buf(tp, agflbp, startoff, > > + startoff + sizeof(xfs_agblock_t) - 1); > > + } > > + if (logflags) > > + xfs_alloc_log_agf(tp, agbp, logflags); > > + > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > + pag->pagf_needpadfix = false; > > + > > + return 0; > > +} > > + > > +/* > > * Decide whether to use this allocation group for this allocation. > > * If so, fix up the btree freelist's size. > > */ > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > if (error) > > return error; > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > + if (error) { > > + xfs_perag_put(pag); > > + return error; > > + } > > > > /* > > * Get the block number and update the data structures. > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > agf->agf_flfirst = 0; > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > be32_add_cpu(&agf->agf_flcount, -1); > > xfs_trans_agflist_delta(tp, -1); > > pag->pagf_flcount--; > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > return error; > > + > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > + if (error) { > > + xfs_perag_put(pag); > > + return error; > > + } > > + > > be32_add_cpu(&agf->agf_fllast, 1); > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > agf->agf_fllast = 0; > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > be32_add_cpu(&agf->agf_flcount, 1); > > xfs_trans_agflist_delta(tp, 1); > > pag->pagf_flcount++; > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > pag->pagb_count = 0; > > pag->pagb_tree = RB_ROOT; > > pag->pagf_init = 1; > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > } > > #ifdef DEBUG > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > index e0792d036be2..78a6377a9b38 100644 > > --- a/fs/xfs/xfs_mount.h > > +++ b/fs/xfs/xfs_mount.h > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > char pagi_inodeok; /* The agi is ok for inodes */ > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > /* # of levels in bno & cnt btree */ > > + bool pagf_needpadfix; > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > xfs_extlen_t pagf_longest; /* longest free space */ > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > --- a/fs/xfs/xfs_trace.h > > +++ b/fs/xfs/xfs_trace.h > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > __entry->logflags) > > ); > > > > +TRACE_EVENT(xfs_agfl_padfix, > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > + TP_ARGS(mp, osize, nsize), > > + TP_STRUCT__entry( > > + __field(dev_t, dev) > > + __field(int, osize) > > + __field(int, nsize) > > + ), > > + TP_fast_assign( > > + __entry->dev = mp->m_super->s_dev; > > + __entry->osize = osize; > > + __entry->nsize = nsize; > > + ), > > + TP_printk("dev %d:%d old size %d new size %d", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + __entry->osize, __entry->nsize) > > +); > > + > > #endif /* _TRACE_XFS_H */ > > > > #undef TRACE_INCLUDE_PATH > > -- > > 2.13.6 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Carlos -- Carlos ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-08 14:03 ` Carlos Maiolino 2018-03-08 14:15 ` Carlos Maiolino @ 2018-03-09 13:16 ` Brian Foster 2018-03-09 17:33 ` Darrick J. Wong 2018-03-12 13:29 ` Carlos Maiolino 1 sibling, 2 replies; 19+ messages in thread From: Brian Foster @ 2018-03-09 13:16 UTC (permalink / raw) To: linux-xfs On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > Hi, > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > Disliked-by: Brian Foster <bfoster@redhat.com> > > --- > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > actual corruption for a padding mismatch as much as possible. E.g., > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > corruption that looks like an agfl with 120 entries on a packed kernel, > > etc. > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > that in either case it's impossible to completely eliminate the risk of > > confusing corruption with a padding mismatch so long as we're doing a > > manual agfl fixup. The more I think about that the more I really dislike > > doing this. :( > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > existence of 'xfs_repair -d' is a good enough last resort for those > > users who might be bit by unexpected padding issues on a typical > > upgrade. If so, we could fall back to a safer mount-time detection model > > that enforces a read-only mount and let the user run repair. The > > supposition is that those who aren't prepared to repair via a ramdisk or > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > read-only provided agfl padding is the only inconsistency. > > > > Eric points out that we can still write an unmount record for a > > read-only mount, but I'm not sure that would be a problem if repair only > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > a recovery issue or it needs to be reformatted to update the LSN, both > > of which seem to qualify as "you have more problems than agfl padding > > and need to run repair anyways" to me. Thoughts? > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > or at least me learning something new. > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > log where packed attribute was added to xfs_agfl, and I was wondering... > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > Merge it together in a 64bit field when reading it from disk, or split it when > writing to? > It seems to me this would avoid the size difference we are seeing now in 32/64 > bit systems, and avoid such risk of confusion when trying to discern between a > corrupted agfl and a padding mismatch. > I'm not following how you'd accomplish the latter..? We already have the packed attribute in place, so the padding is fixed with that. This effort has to do with trying to fix up an agfl written by an older kernel without the padding fix. My understanding is that the xfs_agfl header looks exactly the same on-disk in either case, the issue is a broken size calculation that causes the older kernel to not see/use one last slot in the agfl. If the agfl has wrapped and a newer kernel loads the same on-disk structure, it has no means to know whether the content of the final slot is a valid block or a "gap" left by an older kernel other than to check whether flcount matches the active count from flfirst -> fllast (and that's where potential confusion over a padding issue vs other corruption comes into play). Brian > But still, just as reinforcement, this is just a guess, and I have no idea if > this is feasible or not, but in case I'm completely nuts, I'll learn something > new :) > > Cheers. > > > Brian > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > fs/xfs/xfs_mount.h | 1 + > > fs/xfs/xfs_trace.h | 18 ++++++ > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > index c02781a4c091..31330996e31c 100644 > > --- a/fs/xfs/libxfs/xfs_alloc.c > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > } > > > > /* > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > + * padding is only significant if the agfl wraps around the end or refers to an > > + * invalid first/last value. > > + */ > > +static int > > +xfs_agfl_ondisk_size( > > + struct xfs_mount *mp, > > + int first, > > + int last, > > + int count) > > +{ > > + int active = count; > > + int agfl_size = XFS_AGFL_SIZE(mp); > > + bool wrapped = (first > last) ? true : false; > > + > > + if (count && last >= first) > > + active = last - first + 1; > > + else if (count) > > + active = agfl_size - first + last + 1; > > + > > + if (wrapped && active == count + 1) > > + agfl_size--; > > + else if ((wrapped && active == count - 1) || > > + first == agfl_size || last == agfl_size) > > + agfl_size++; > > + > > + /* > > + * We can't discern the packing problem from certain forms of corruption > > + * that may look exactly the same. To minimize the chance of mistaking > > + * corruption for a size mismatch, clamp the size to known valid values. > > + * A packed header agfl has 119 entries and the older unpacked format > > + * has one less. > > + */ > > + if (agfl_size < 118 || agfl_size > 119) > > + agfl_size = XFS_AGFL_SIZE(mp); > > + > > + return agfl_size; > > +} > > + > > +static bool > > +xfs_agfl_need_padfix( > > + struct xfs_mount *mp, > > + struct xfs_agf *agf) > > +{ > > + int f = be32_to_cpu(agf->agf_flfirst); > > + int l = be32_to_cpu(agf->agf_fllast); > > + int c = be32_to_cpu(agf->agf_flcount); > > + > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > + return false; > > + > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > +} > > + > > +static int > > +xfs_agfl_check_padfix( > > + struct xfs_trans *tp, > > + struct xfs_buf *agbp, > > + struct xfs_buf *agflbp, > > + struct xfs_perag *pag) > > +{ > > + struct xfs_mount *mp = tp->t_mountp; > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > + int agfl_size = XFS_AGFL_SIZE(mp); > > + int ofirst, olast, osize; > > + int nfirst, nlast; > > + int logflags = 0; > > + int startoff = 0; > > + > > + if (!pag->pagf_needpadfix) > > + return 0; > > + > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > + > > + /* > > + * If the on-disk agfl is smaller than what the kernel expects, the > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > + * first valid block into the gap and bump the pointer. > > + */ > > + if (osize < agfl_size) { > > + ASSERT(pag->pagf_flcount != 0); > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > + nfirst++; > > + goto done; > > + } > > + > > + /* > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > + * expects. If empty, just fix up the first and last pointers. If not, > > + * move the inaccessible block to the end of the valid range. > > + */ > > + nfirst = do_mod(nfirst, agfl_size); > > + if (pag->pagf_flcount == 0) { > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > + goto done; > > + } > > + if (nlast != agfl_size) > > + nlast++; > > + nlast = do_mod(nlast, agfl_size); > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > + > > +done: > > + if (nfirst != ofirst) { > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > + logflags |= XFS_AGF_FLFIRST; > > + } > > + if (nlast != olast) { > > + agf->agf_fllast = cpu_to_be32(nlast); > > + logflags |= XFS_AGF_FLLAST; > > + } > > + if (startoff) { > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > + xfs_trans_log_buf(tp, agflbp, startoff, > > + startoff + sizeof(xfs_agblock_t) - 1); > > + } > > + if (logflags) > > + xfs_alloc_log_agf(tp, agbp, logflags); > > + > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > + pag->pagf_needpadfix = false; > > + > > + return 0; > > +} > > + > > +/* > > * Decide whether to use this allocation group for this allocation. > > * If so, fix up the btree freelist's size. > > */ > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > if (error) > > return error; > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > + if (error) { > > + xfs_perag_put(pag); > > + return error; > > + } > > > > /* > > * Get the block number and update the data structures. > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > agf->agf_flfirst = 0; > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > be32_add_cpu(&agf->agf_flcount, -1); > > xfs_trans_agflist_delta(tp, -1); > > pag->pagf_flcount--; > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > return error; > > + > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > + if (error) { > > + xfs_perag_put(pag); > > + return error; > > + } > > + > > be32_add_cpu(&agf->agf_fllast, 1); > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > agf->agf_fllast = 0; > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > be32_add_cpu(&agf->agf_flcount, 1); > > xfs_trans_agflist_delta(tp, 1); > > pag->pagf_flcount++; > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > pag->pagb_count = 0; > > pag->pagb_tree = RB_ROOT; > > pag->pagf_init = 1; > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > } > > #ifdef DEBUG > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > index e0792d036be2..78a6377a9b38 100644 > > --- a/fs/xfs/xfs_mount.h > > +++ b/fs/xfs/xfs_mount.h > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > char pagi_inodeok; /* The agi is ok for inodes */ > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > /* # of levels in bno & cnt btree */ > > + bool pagf_needpadfix; > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > xfs_extlen_t pagf_longest; /* longest free space */ > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > --- a/fs/xfs/xfs_trace.h > > +++ b/fs/xfs/xfs_trace.h > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > __entry->logflags) > > ); > > > > +TRACE_EVENT(xfs_agfl_padfix, > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > + TP_ARGS(mp, osize, nsize), > > + TP_STRUCT__entry( > > + __field(dev_t, dev) > > + __field(int, osize) > > + __field(int, nsize) > > + ), > > + TP_fast_assign( > > + __entry->dev = mp->m_super->s_dev; > > + __entry->osize = osize; > > + __entry->nsize = nsize; > > + ), > > + TP_printk("dev %d:%d old size %d new size %d", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + __entry->osize, __entry->nsize) > > +); > > + > > #endif /* _TRACE_XFS_H */ > > > > #undef TRACE_INCLUDE_PATH > > -- > > 2.13.6 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Carlos > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 13:16 ` Brian Foster @ 2018-03-09 17:33 ` Darrick J. Wong 2018-03-09 18:37 ` Brian Foster 2018-03-12 13:29 ` Carlos Maiolino 1 sibling, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2018-03-09 17:33 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Mar 09, 2018 at 08:16:28AM -0500, Brian Foster wrote: > On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > > Hi, > > > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > > Disliked-by: Brian Foster <bfoster@redhat.com> > > > --- > > > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > > actual corruption for a padding mismatch as much as possible. E.g., > > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > > corruption that looks like an agfl with 120 entries on a packed kernel, > > > etc. > > > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > > that in either case it's impossible to completely eliminate the risk of > > > confusing corruption with a padding mismatch so long as we're doing a > > > manual agfl fixup. The more I think about that the more I really dislike > > > doing this. :( > > > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > > existence of 'xfs_repair -d' is a good enough last resort for those > > > users who might be bit by unexpected padding issues on a typical > > > upgrade. If so, we could fall back to a safer mount-time detection model > > > that enforces a read-only mount and let the user run repair. The > > > supposition is that those who aren't prepared to repair via a ramdisk or > > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > > read-only provided agfl padding is the only inconsistency. > > > > > > Eric points out that we can still write an unmount record for a > > > read-only mount, but I'm not sure that would be a problem if repair only > > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > > a recovery issue or it needs to be reformatted to update the LSN, both > > > of which seem to qualify as "you have more problems than agfl padding > > > and need to run repair anyways" to me. Thoughts? > > > > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > or at least me learning something new. > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > Merge it together in a 64bit field when reading it from disk, or split it when > > writing to? > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > bit systems, and avoid such risk of confusion when trying to discern between a > > corrupted agfl and a padding mismatch. > > > > I'm not following how you'd accomplish the latter..? We already have the > packed attribute in place, so the padding is fixed with that. This > effort has to do with trying to fix up an agfl written by an older > kernel without the padding fix. My understanding is that the xfs_agfl > header looks exactly the same on-disk in either case, the issue is a > broken size calculation that causes the older kernel to not see/use one > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > the same on-disk structure, it has no means to know whether the content > of the final slot is a valid block or a "gap" left by an older kernel > other than to check whether flcount matches the active count from > flfirst -> fllast (and that's where potential confusion over a padding > issue vs other corruption comes into play). Your understanding is correct. Sez me, who is watching the fsck.xfs -f discussion just in case that can be turned into a viable option quickly. ..and wondering what if we /did/ just implement Dave's suggestion from years ago where if the flcount doesn't match we just reset the agfl and call fix_freelist to refill it with fresh blocks... it would suck to leak blocks, though. Obviously, if the fs has rmap enabled then we can just rebuild it on the spot via xfs_repair_agfl() (see patch on mailing list). --D > Brian > > > But still, just as reinforcement, this is just a guess, and I have no idea if > > this is feasible or not, but in case I'm completely nuts, I'll learn something > > new :) > > > > Cheers. > > > > > Brian > > > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > > fs/xfs/xfs_mount.h | 1 + > > > fs/xfs/xfs_trace.h | 18 ++++++ > > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > index c02781a4c091..31330996e31c 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > > } > > > > > > /* > > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > > + * padding is only significant if the agfl wraps around the end or refers to an > > > + * invalid first/last value. > > > + */ > > > +static int > > > +xfs_agfl_ondisk_size( > > > + struct xfs_mount *mp, > > > + int first, > > > + int last, > > > + int count) > > > +{ > > > + int active = count; > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > + bool wrapped = (first > last) ? true : false; > > > + > > > + if (count && last >= first) > > > + active = last - first + 1; > > > + else if (count) > > > + active = agfl_size - first + last + 1; > > > + > > > + if (wrapped && active == count + 1) > > > + agfl_size--; > > > + else if ((wrapped && active == count - 1) || > > > + first == agfl_size || last == agfl_size) > > > + agfl_size++; > > > + > > > + /* > > > + * We can't discern the packing problem from certain forms of corruption > > > + * that may look exactly the same. To minimize the chance of mistaking > > > + * corruption for a size mismatch, clamp the size to known valid values. > > > + * A packed header agfl has 119 entries and the older unpacked format > > > + * has one less. > > > + */ > > > + if (agfl_size < 118 || agfl_size > 119) > > > + agfl_size = XFS_AGFL_SIZE(mp); > > > + > > > + return agfl_size; > > > +} > > > + > > > +static bool > > > +xfs_agfl_need_padfix( > > > + struct xfs_mount *mp, > > > + struct xfs_agf *agf) > > > +{ > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > + int l = be32_to_cpu(agf->agf_fllast); > > > + int c = be32_to_cpu(agf->agf_flcount); > > > + > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > + return false; > > > + > > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > > +} > > > + > > > +static int > > > +xfs_agfl_check_padfix( > > > + struct xfs_trans *tp, > > > + struct xfs_buf *agbp, > > > + struct xfs_buf *agflbp, > > > + struct xfs_perag *pag) > > > +{ > > > + struct xfs_mount *mp = tp->t_mountp; > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > + int ofirst, olast, osize; > > > + int nfirst, nlast; > > > + int logflags = 0; > > > + int startoff = 0; > > > + > > > + if (!pag->pagf_needpadfix) > > > + return 0; > > > + > > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > > + > > > + /* > > > + * If the on-disk agfl is smaller than what the kernel expects, the > > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > > + * first valid block into the gap and bump the pointer. > > > + */ > > > + if (osize < agfl_size) { > > > + ASSERT(pag->pagf_flcount != 0); > > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > > + nfirst++; > > > + goto done; > > > + } > > > + > > > + /* > > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > > + * expects. If empty, just fix up the first and last pointers. If not, > > > + * move the inaccessible block to the end of the valid range. > > > + */ > > > + nfirst = do_mod(nfirst, agfl_size); > > > + if (pag->pagf_flcount == 0) { > > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > > + goto done; > > > + } > > > + if (nlast != agfl_size) > > > + nlast++; > > > + nlast = do_mod(nlast, agfl_size); > > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > > + > > > +done: > > > + if (nfirst != ofirst) { > > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > > + logflags |= XFS_AGF_FLFIRST; > > > + } > > > + if (nlast != olast) { > > > + agf->agf_fllast = cpu_to_be32(nlast); > > > + logflags |= XFS_AGF_FLLAST; > > > + } > > > + if (startoff) { > > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > > + xfs_trans_log_buf(tp, agflbp, startoff, > > > + startoff + sizeof(xfs_agblock_t) - 1); > > > + } > > > + if (logflags) > > > + xfs_alloc_log_agf(tp, agbp, logflags); > > > + > > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > > + pag->pagf_needpadfix = false; > > > + > > > + return 0; > > > +} > > > + > > > +/* > > > * Decide whether to use this allocation group for this allocation. > > > * If so, fix up the btree freelist's size. > > > */ > > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > > if (error) > > > return error; > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > + if (error) { > > > + xfs_perag_put(pag); > > > + return error; > > > + } > > > > > > /* > > > * Get the block number and update the data structures. > > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > > agf->agf_flfirst = 0; > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > be32_add_cpu(&agf->agf_flcount, -1); > > > xfs_trans_agflist_delta(tp, -1); > > > pag->pagf_flcount--; > > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > > return error; > > > + > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > + if (error) { > > > + xfs_perag_put(pag); > > > + return error; > > > + } > > > + > > > be32_add_cpu(&agf->agf_fllast, 1); > > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > > agf->agf_fllast = 0; > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > be32_add_cpu(&agf->agf_flcount, 1); > > > xfs_trans_agflist_delta(tp, 1); > > > pag->pagf_flcount++; > > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > > pag->pagb_count = 0; > > > pag->pagb_tree = RB_ROOT; > > > pag->pagf_init = 1; > > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > > } > > > #ifdef DEBUG > > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > index e0792d036be2..78a6377a9b38 100644 > > > --- a/fs/xfs/xfs_mount.h > > > +++ b/fs/xfs/xfs_mount.h > > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > > char pagi_inodeok; /* The agi is ok for inodes */ > > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > > /* # of levels in bno & cnt btree */ > > > + bool pagf_needpadfix; > > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > > xfs_extlen_t pagf_longest; /* longest free space */ > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > > --- a/fs/xfs/xfs_trace.h > > > +++ b/fs/xfs/xfs_trace.h > > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > > __entry->logflags) > > > ); > > > > > > +TRACE_EVENT(xfs_agfl_padfix, > > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > > + TP_ARGS(mp, osize, nsize), > > > + TP_STRUCT__entry( > > > + __field(dev_t, dev) > > > + __field(int, osize) > > > + __field(int, nsize) > > > + ), > > > + TP_fast_assign( > > > + __entry->dev = mp->m_super->s_dev; > > > + __entry->osize = osize; > > > + __entry->nsize = nsize; > > > + ), > > > + TP_printk("dev %d:%d old size %d new size %d", > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > + __entry->osize, __entry->nsize) > > > +); > > > + > > > #endif /* _TRACE_XFS_H */ > > > > > > #undef TRACE_INCLUDE_PATH > > > -- > > > 2.13.6 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Carlos > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 17:33 ` Darrick J. Wong @ 2018-03-09 18:37 ` Brian Foster 2018-03-09 19:08 ` Darrick J. Wong 2018-03-09 22:10 ` Dave Chinner 0 siblings, 2 replies; 19+ messages in thread From: Brian Foster @ 2018-03-09 18:37 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > On Fri, Mar 09, 2018 at 08:16:28AM -0500, Brian Foster wrote: > > On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > > > Hi, > > > > > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > > > Disliked-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > > > actual corruption for a padding mismatch as much as possible. E.g., > > > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > > > corruption that looks like an agfl with 120 entries on a packed kernel, > > > > etc. > > > > > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > > > that in either case it's impossible to completely eliminate the risk of > > > > confusing corruption with a padding mismatch so long as we're doing a > > > > manual agfl fixup. The more I think about that the more I really dislike > > > > doing this. :( > > > > > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > > > existence of 'xfs_repair -d' is a good enough last resort for those > > > > users who might be bit by unexpected padding issues on a typical > > > > upgrade. If so, we could fall back to a safer mount-time detection model > > > > that enforces a read-only mount and let the user run repair. The > > > > supposition is that those who aren't prepared to repair via a ramdisk or > > > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > > > read-only provided agfl padding is the only inconsistency. > > > > > > > > Eric points out that we can still write an unmount record for a > > > > read-only mount, but I'm not sure that would be a problem if repair only > > > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > > > a recovery issue or it needs to be reformatted to update the LSN, both > > > > of which seem to qualify as "you have more problems than agfl padding > > > > and need to run repair anyways" to me. Thoughts? > > > > > > > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > > or at least me learning something new. > > > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > > Merge it together in a 64bit field when reading it from disk, or split it when > > > writing to? > > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > > bit systems, and avoid such risk of confusion when trying to discern between a > > > corrupted agfl and a padding mismatch. > > > > > > > I'm not following how you'd accomplish the latter..? We already have the > > packed attribute in place, so the padding is fixed with that. This > > effort has to do with trying to fix up an agfl written by an older > > kernel without the padding fix. My understanding is that the xfs_agfl > > header looks exactly the same on-disk in either case, the issue is a > > broken size calculation that causes the older kernel to not see/use one > > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > > the same on-disk structure, it has no means to know whether the content > > of the final slot is a valid block or a "gap" left by an older kernel > > other than to check whether flcount matches the active count from > > flfirst -> fllast (and that's where potential confusion over a padding > > issue vs other corruption comes into play). > > Your understanding is correct. > > Sez me, who is watching the fsck.xfs -f discussion just in case that can > be turned into a viable option quickly. > Thanks. > ..and wondering what if we /did/ just implement Dave's suggestion from > years ago where if the flcount doesn't match we just reset the agfl and > call fix_freelist to refill it with fresh blocks... it would suck to > leak blocks, though. Obviously, if the fs has rmap enabled then we can > just rebuild it on the spot via xfs_repair_agfl() (see patch on mailing > list). > I wasn't initially too fond of this idea, but thinking about it a bit more, there is definitely some value in terms of determinism. We know we'll just leak some blocks vs. riskily swizzling around a corrupted agfl. We've had the whole lingering unlinked inode thing in the past without that being such a huge problem, as a point of comparison to something like leaked blocks. Most users are likely to hit ENOSPC before utilizing every last block anyways. It also sounds like there was some general performance related concern over mount time scans. My initial distaste for that was more from a code/overkill perspective, I hadn't really considered the performance aspect. Given that, perhaps an on-demand reset with a "Corrupted AGFL, tossed $flcount blocks, unmount and run xfs_repair if you ever want to see them again" warning might not be so bad. It works the same whether the kernel is packed, unpacked or the agfl is just borked, and therefore the test matrix is much simpler as well. Hmmmm... Brian > --D > > > Brian > > > > > But still, just as reinforcement, this is just a guess, and I have no idea if > > > this is feasible or not, but in case I'm completely nuts, I'll learn something > > > new :) > > > > > > Cheers. > > > > > > > Brian > > > > > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > > > fs/xfs/xfs_mount.h | 1 + > > > > fs/xfs/xfs_trace.h | 18 ++++++ > > > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > > index c02781a4c091..31330996e31c 100644 > > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > > > } > > > > > > > > /* > > > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > > > + * padding is only significant if the agfl wraps around the end or refers to an > > > > + * invalid first/last value. > > > > + */ > > > > +static int > > > > +xfs_agfl_ondisk_size( > > > > + struct xfs_mount *mp, > > > > + int first, > > > > + int last, > > > > + int count) > > > > +{ > > > > + int active = count; > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > + bool wrapped = (first > last) ? true : false; > > > > + > > > > + if (count && last >= first) > > > > + active = last - first + 1; > > > > + else if (count) > > > > + active = agfl_size - first + last + 1; > > > > + > > > > + if (wrapped && active == count + 1) > > > > + agfl_size--; > > > > + else if ((wrapped && active == count - 1) || > > > > + first == agfl_size || last == agfl_size) > > > > + agfl_size++; > > > > + > > > > + /* > > > > + * We can't discern the packing problem from certain forms of corruption > > > > + * that may look exactly the same. To minimize the chance of mistaking > > > > + * corruption for a size mismatch, clamp the size to known valid values. > > > > + * A packed header agfl has 119 entries and the older unpacked format > > > > + * has one less. > > > > + */ > > > > + if (agfl_size < 118 || agfl_size > 119) > > > > + agfl_size = XFS_AGFL_SIZE(mp); > > > > + > > > > + return agfl_size; > > > > +} > > > > + > > > > +static bool > > > > +xfs_agfl_need_padfix( > > > > + struct xfs_mount *mp, > > > > + struct xfs_agf *agf) > > > > +{ > > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > > + int l = be32_to_cpu(agf->agf_fllast); > > > > + int c = be32_to_cpu(agf->agf_flcount); > > > > + > > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > > + return false; > > > > + > > > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > > > +} > > > > + > > > > +static int > > > > +xfs_agfl_check_padfix( > > > > + struct xfs_trans *tp, > > > > + struct xfs_buf *agbp, > > > > + struct xfs_buf *agflbp, > > > > + struct xfs_perag *pag) > > > > +{ > > > > + struct xfs_mount *mp = tp->t_mountp; > > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > + int ofirst, olast, osize; > > > > + int nfirst, nlast; > > > > + int logflags = 0; > > > > + int startoff = 0; > > > > + > > > > + if (!pag->pagf_needpadfix) > > > > + return 0; > > > > + > > > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > > > + > > > > + /* > > > > + * If the on-disk agfl is smaller than what the kernel expects, the > > > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > > > + * first valid block into the gap and bump the pointer. > > > > + */ > > > > + if (osize < agfl_size) { > > > > + ASSERT(pag->pagf_flcount != 0); > > > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > > > + nfirst++; > > > > + goto done; > > > > + } > > > > + > > > > + /* > > > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > > > + * expects. If empty, just fix up the first and last pointers. If not, > > > > + * move the inaccessible block to the end of the valid range. > > > > + */ > > > > + nfirst = do_mod(nfirst, agfl_size); > > > > + if (pag->pagf_flcount == 0) { > > > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > > > + goto done; > > > > + } > > > > + if (nlast != agfl_size) > > > > + nlast++; > > > > + nlast = do_mod(nlast, agfl_size); > > > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > > > + > > > > +done: > > > > + if (nfirst != ofirst) { > > > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > > > + logflags |= XFS_AGF_FLFIRST; > > > > + } > > > > + if (nlast != olast) { > > > > + agf->agf_fllast = cpu_to_be32(nlast); > > > > + logflags |= XFS_AGF_FLLAST; > > > > + } > > > > + if (startoff) { > > > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > > > + xfs_trans_log_buf(tp, agflbp, startoff, > > > > + startoff + sizeof(xfs_agblock_t) - 1); > > > > + } > > > > + if (logflags) > > > > + xfs_alloc_log_agf(tp, agbp, logflags); > > > > + > > > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > > > + pag->pagf_needpadfix = false; > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/* > > > > * Decide whether to use this allocation group for this allocation. > > > > * If so, fix up the btree freelist's size. > > > > */ > > > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > > > if (error) > > > > return error; > > > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > + if (error) { > > > > + xfs_perag_put(pag); > > > > + return error; > > > > + } > > > > > > > > /* > > > > * Get the block number and update the data structures. > > > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > > > agf->agf_flfirst = 0; > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > be32_add_cpu(&agf->agf_flcount, -1); > > > > xfs_trans_agflist_delta(tp, -1); > > > > pag->pagf_flcount--; > > > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > > > return error; > > > > + > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > + if (error) { > > > > + xfs_perag_put(pag); > > > > + return error; > > > > + } > > > > + > > > > be32_add_cpu(&agf->agf_fllast, 1); > > > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > > > agf->agf_fllast = 0; > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > be32_add_cpu(&agf->agf_flcount, 1); > > > > xfs_trans_agflist_delta(tp, 1); > > > > pag->pagf_flcount++; > > > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > > > pag->pagb_count = 0; > > > > pag->pagb_tree = RB_ROOT; > > > > pag->pagf_init = 1; > > > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > > > } > > > > #ifdef DEBUG > > > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > > index e0792d036be2..78a6377a9b38 100644 > > > > --- a/fs/xfs/xfs_mount.h > > > > +++ b/fs/xfs/xfs_mount.h > > > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > > > char pagi_inodeok; /* The agi is ok for inodes */ > > > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > > > /* # of levels in bno & cnt btree */ > > > > + bool pagf_needpadfix; > > > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > > > xfs_extlen_t pagf_longest; /* longest free space */ > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > > > --- a/fs/xfs/xfs_trace.h > > > > +++ b/fs/xfs/xfs_trace.h > > > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > > > __entry->logflags) > > > > ); > > > > > > > > +TRACE_EVENT(xfs_agfl_padfix, > > > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > > > + TP_ARGS(mp, osize, nsize), > > > > + TP_STRUCT__entry( > > > > + __field(dev_t, dev) > > > > + __field(int, osize) > > > > + __field(int, nsize) > > > > + ), > > > > + TP_fast_assign( > > > > + __entry->dev = mp->m_super->s_dev; > > > > + __entry->osize = osize; > > > > + __entry->nsize = nsize; > > > > + ), > > > > + TP_printk("dev %d:%d old size %d new size %d", > > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > > + __entry->osize, __entry->nsize) > > > > +); > > > > + > > > > #endif /* _TRACE_XFS_H */ > > > > > > > > #undef TRACE_INCLUDE_PATH > > > > -- > > > > 2.13.6 > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > Carlos > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 18:37 ` Brian Foster @ 2018-03-09 19:08 ` Darrick J. Wong 2018-03-09 21:20 ` Brian Foster 2018-03-09 22:10 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2018-03-09 19:08 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Eric Sandeen On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > > On Fri, Mar 09, 2018 at 08:16:28AM -0500, Brian Foster wrote: > > > On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > > > > Hi, > > > > > > > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > > > > Disliked-by: Brian Foster <bfoster@redhat.com> > > > > > --- > > > > > > > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > > > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > > > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > > > > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > > > > actual corruption for a padding mismatch as much as possible. E.g., > > > > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > > > > corruption that looks like an agfl with 120 entries on a packed kernel, > > > > > etc. > > > > > > > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > > > > that in either case it's impossible to completely eliminate the risk of > > > > > confusing corruption with a padding mismatch so long as we're doing a > > > > > manual agfl fixup. The more I think about that the more I really dislike > > > > > doing this. :( > > > > > > > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > > > > existence of 'xfs_repair -d' is a good enough last resort for those > > > > > users who might be bit by unexpected padding issues on a typical > > > > > upgrade. If so, we could fall back to a safer mount-time detection model > > > > > that enforces a read-only mount and let the user run repair. The > > > > > supposition is that those who aren't prepared to repair via a ramdisk or > > > > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > > > > read-only provided agfl padding is the only inconsistency. > > > > > > > > > > Eric points out that we can still write an unmount record for a > > > > > read-only mount, but I'm not sure that would be a problem if repair only > > > > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > > > > a recovery issue or it needs to be reformatted to update the LSN, both > > > > > of which seem to qualify as "you have more problems than agfl padding > > > > > and need to run repair anyways" to me. Thoughts? > > > > > > > > > > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > > > or at least me learning something new. > > > > > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > > > Merge it together in a 64bit field when reading it from disk, or split it when > > > > writing to? > > > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > > > bit systems, and avoid such risk of confusion when trying to discern between a > > > > corrupted agfl and a padding mismatch. > > > > > > > > > > I'm not following how you'd accomplish the latter..? We already have the > > > packed attribute in place, so the padding is fixed with that. This > > > effort has to do with trying to fix up an agfl written by an older > > > kernel without the padding fix. My understanding is that the xfs_agfl > > > header looks exactly the same on-disk in either case, the issue is a > > > broken size calculation that causes the older kernel to not see/use one > > > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > > > the same on-disk structure, it has no means to know whether the content > > > of the final slot is a valid block or a "gap" left by an older kernel > > > other than to check whether flcount matches the active count from > > > flfirst -> fllast (and that's where potential confusion over a padding > > > issue vs other corruption comes into play). > > > > Your understanding is correct. > > > > Sez me, who is watching the fsck.xfs -f discussion just in case that can > > be turned into a viable option quickly. > > > > Thanks. > > > ..and wondering what if we /did/ just implement Dave's suggestion from > > years ago where if the flcount doesn't match we just reset the agfl and > > call fix_freelist to refill it with fresh blocks... it would suck to > > leak blocks, though. Obviously, if the fs has rmap enabled then we can > > just rebuild it on the spot via xfs_repair_agfl() (see patch on mailing > > list). > > > > I wasn't initially too fond of this idea, but thinking about it a bit > more, there is definitely some value in terms of determinism. We know > we'll just leak some blocks vs. riskily swizzling around a corrupted > agfl. We've had the whole lingering unlinked inode thing in the past > without that being such a huge problem, as a point of comparison to > something like leaked blocks. Most users are likely to hit ENOSPC before > utilizing every last block anyways. I don't think the unlinked inode leftovers is a good comparison, those can leave gigabytes of space unreclaimed, and I thought Eric's patch to make it unconditional was a reaction to someone actually hitting ENOSPC. Also those unlinked inodes should all have valid metadata. However, the borked AGFLs have questionable metadata (is it corrupt, or is it just improperly padded?) and the number of blocks is probably small, so leaking ~10 blocks per AG probably isn't going to kill us. Even if the agfl is totally full on a 4k sector disk, 1000 blocks is still only 4MB. > It also sounds like there was some general performance related concern > over mount time scans. My initial distaste for that was more from a > code/overkill perspective, I hadn't really considered the performance > aspect. > > Given that, perhaps an on-demand reset with a "Corrupted AGFL, tossed > $flcount blocks, unmount and run xfs_repair if you ever want to see them > again" warning might not be so bad. It works the same whether the kernel > is packed, unpacked or the agfl is just borked, and therefore the test > matrix is much simpler as well. Hmmmm... Something occurred to me late yesterday while talking to Eric about his patch to do the unlinked inode recovery. For a reflinked filesystem, we already iterate every AG at mount time to clean out all the stale CoW staging extents. Given that we already take the performance hit to walk all the AGs, perhaps it makes sense to combine the stale cow, unlinked inode, and bad-pad agf scans into a single pass? Granted you do make the point that we've been leaking unlinked inodes on rootfses for ages with little trouble, but in the long run I think we'd prefer to clean up the filesystem leftovers. --D > Brian > > > --D > > > > > Brian > > > > > > > But still, just as reinforcement, this is just a guess, and I have no idea if > > > > this is feasible or not, but in case I'm completely nuts, I'll learn something > > > > new :) > > > > > > > > Cheers. > > > > > > > > > Brian > > > > > > > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > > > > fs/xfs/xfs_mount.h | 1 + > > > > > fs/xfs/xfs_trace.h | 18 ++++++ > > > > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > > > index c02781a4c091..31330996e31c 100644 > > > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > > > > } > > > > > > > > > > /* > > > > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > > > > + * padding is only significant if the agfl wraps around the end or refers to an > > > > > + * invalid first/last value. > > > > > + */ > > > > > +static int > > > > > +xfs_agfl_ondisk_size( > > > > > + struct xfs_mount *mp, > > > > > + int first, > > > > > + int last, > > > > > + int count) > > > > > +{ > > > > > + int active = count; > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > + bool wrapped = (first > last) ? true : false; > > > > > + > > > > > + if (count && last >= first) > > > > > + active = last - first + 1; > > > > > + else if (count) > > > > > + active = agfl_size - first + last + 1; > > > > > + > > > > > + if (wrapped && active == count + 1) > > > > > + agfl_size--; > > > > > + else if ((wrapped && active == count - 1) || > > > > > + first == agfl_size || last == agfl_size) > > > > > + agfl_size++; > > > > > + > > > > > + /* > > > > > + * We can't discern the packing problem from certain forms of corruption > > > > > + * that may look exactly the same. To minimize the chance of mistaking > > > > > + * corruption for a size mismatch, clamp the size to known valid values. > > > > > + * A packed header agfl has 119 entries and the older unpacked format > > > > > + * has one less. > > > > > + */ > > > > > + if (agfl_size < 118 || agfl_size > 119) > > > > > + agfl_size = XFS_AGFL_SIZE(mp); > > > > > + > > > > > + return agfl_size; > > > > > +} > > > > > + > > > > > +static bool > > > > > +xfs_agfl_need_padfix( > > > > > + struct xfs_mount *mp, > > > > > + struct xfs_agf *agf) > > > > > +{ > > > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > > > + int l = be32_to_cpu(agf->agf_fllast); > > > > > + int c = be32_to_cpu(agf->agf_flcount); > > > > > + > > > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > > > + return false; > > > > > + > > > > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > > > > +} > > > > > + > > > > > +static int > > > > > +xfs_agfl_check_padfix( > > > > > + struct xfs_trans *tp, > > > > > + struct xfs_buf *agbp, > > > > > + struct xfs_buf *agflbp, > > > > > + struct xfs_perag *pag) > > > > > +{ > > > > > + struct xfs_mount *mp = tp->t_mountp; > > > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > + int ofirst, olast, osize; > > > > > + int nfirst, nlast; > > > > > + int logflags = 0; > > > > > + int startoff = 0; > > > > > + > > > > > + if (!pag->pagf_needpadfix) > > > > > + return 0; > > > > > + > > > > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > > > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > > > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > > > > + > > > > > + /* > > > > > + * If the on-disk agfl is smaller than what the kernel expects, the > > > > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > > > > + * first valid block into the gap and bump the pointer. > > > > > + */ > > > > > + if (osize < agfl_size) { > > > > > + ASSERT(pag->pagf_flcount != 0); > > > > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > > > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > > > > + nfirst++; > > > > > + goto done; > > > > > + } > > > > > + > > > > > + /* > > > > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > > > > + * expects. If empty, just fix up the first and last pointers. If not, > > > > > + * move the inaccessible block to the end of the valid range. > > > > > + */ > > > > > + nfirst = do_mod(nfirst, agfl_size); > > > > > + if (pag->pagf_flcount == 0) { > > > > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > > > > + goto done; > > > > > + } > > > > > + if (nlast != agfl_size) > > > > > + nlast++; > > > > > + nlast = do_mod(nlast, agfl_size); > > > > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > > > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > > > > + > > > > > +done: > > > > > + if (nfirst != ofirst) { > > > > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > > > > + logflags |= XFS_AGF_FLFIRST; > > > > > + } > > > > > + if (nlast != olast) { > > > > > + agf->agf_fllast = cpu_to_be32(nlast); > > > > > + logflags |= XFS_AGF_FLLAST; > > > > > + } > > > > > + if (startoff) { > > > > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > > > > + xfs_trans_log_buf(tp, agflbp, startoff, > > > > > + startoff + sizeof(xfs_agblock_t) - 1); > > > > > + } > > > > > + if (logflags) > > > > > + xfs_alloc_log_agf(tp, agbp, logflags); > > > > > + > > > > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > > > > + pag->pagf_needpadfix = false; > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > +/* > > > > > * Decide whether to use this allocation group for this allocation. > > > > > * If so, fix up the btree freelist's size. > > > > > */ > > > > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > > > > if (error) > > > > > return error; > > > > > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > + if (error) { > > > > > + xfs_perag_put(pag); > > > > > + return error; > > > > > + } > > > > > > > > > > /* > > > > > * Get the block number and update the data structures. > > > > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > > > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > > > > agf->agf_flfirst = 0; > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > be32_add_cpu(&agf->agf_flcount, -1); > > > > > xfs_trans_agflist_delta(tp, -1); > > > > > pag->pagf_flcount--; > > > > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > > > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > > > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > > > > return error; > > > > > + > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > + if (error) { > > > > > + xfs_perag_put(pag); > > > > > + return error; > > > > > + } > > > > > + > > > > > be32_add_cpu(&agf->agf_fllast, 1); > > > > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > > > > agf->agf_fllast = 0; > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > be32_add_cpu(&agf->agf_flcount, 1); > > > > > xfs_trans_agflist_delta(tp, 1); > > > > > pag->pagf_flcount++; > > > > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > > > > pag->pagb_count = 0; > > > > > pag->pagb_tree = RB_ROOT; > > > > > pag->pagf_init = 1; > > > > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > > > > } > > > > > #ifdef DEBUG > > > > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > > > index e0792d036be2..78a6377a9b38 100644 > > > > > --- a/fs/xfs/xfs_mount.h > > > > > +++ b/fs/xfs/xfs_mount.h > > > > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > > > > char pagi_inodeok; /* The agi is ok for inodes */ > > > > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > > > > /* # of levels in bno & cnt btree */ > > > > > + bool pagf_needpadfix; > > > > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > > > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > > > > xfs_extlen_t pagf_longest; /* longest free space */ > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > > > > --- a/fs/xfs/xfs_trace.h > > > > > +++ b/fs/xfs/xfs_trace.h > > > > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > > > > __entry->logflags) > > > > > ); > > > > > > > > > > +TRACE_EVENT(xfs_agfl_padfix, > > > > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > > > > + TP_ARGS(mp, osize, nsize), > > > > > + TP_STRUCT__entry( > > > > > + __field(dev_t, dev) > > > > > + __field(int, osize) > > > > > + __field(int, nsize) > > > > > + ), > > > > > + TP_fast_assign( > > > > > + __entry->dev = mp->m_super->s_dev; > > > > > + __entry->osize = osize; > > > > > + __entry->nsize = nsize; > > > > > + ), > > > > > + TP_printk("dev %d:%d old size %d new size %d", > > > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > > > + __entry->osize, __entry->nsize) > > > > > +); > > > > > + > > > > > #endif /* _TRACE_XFS_H */ > > > > > > > > > > #undef TRACE_INCLUDE_PATH > > > > > -- > > > > > 2.13.6 > > > > > > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > the body of a message to majordomo@vger.kernel.org > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > > > Carlos > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 19:08 ` Darrick J. Wong @ 2018-03-09 21:20 ` Brian Foster 2018-03-09 22:37 ` Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Brian Foster @ 2018-03-09 21:20 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, Eric Sandeen On Fri, Mar 09, 2018 at 11:08:32AM -0800, Darrick J. Wong wrote: > On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > > > On Fri, Mar 09, 2018 at 08:16:28AM -0500, Brian Foster wrote: > > > > On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > > > > > Hi, > > > > > > > > > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > > > > > Disliked-by: Brian Foster <bfoster@redhat.com> > > > > > > --- > > > > > > > > > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > > > > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > > > > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > > > > > > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > > > > > actual corruption for a padding mismatch as much as possible. E.g., > > > > > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > > > > > corruption that looks like an agfl with 120 entries on a packed kernel, > > > > > > etc. > > > > > > > > > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > > > > > that in either case it's impossible to completely eliminate the risk of > > > > > > confusing corruption with a padding mismatch so long as we're doing a > > > > > > manual agfl fixup. The more I think about that the more I really dislike > > > > > > doing this. :( > > > > > > > > > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > > > > > existence of 'xfs_repair -d' is a good enough last resort for those > > > > > > users who might be bit by unexpected padding issues on a typical > > > > > > upgrade. If so, we could fall back to a safer mount-time detection model > > > > > > that enforces a read-only mount and let the user run repair. The > > > > > > supposition is that those who aren't prepared to repair via a ramdisk or > > > > > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > > > > > read-only provided agfl padding is the only inconsistency. > > > > > > > > > > > > Eric points out that we can still write an unmount record for a > > > > > > read-only mount, but I'm not sure that would be a problem if repair only > > > > > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > > > > > a recovery issue or it needs to be reformatted to update the LSN, both > > > > > > of which seem to qualify as "you have more problems than agfl padding > > > > > > and need to run repair anyways" to me. Thoughts? > > > > > > > > > > > > > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > > > > or at least me learning something new. > > > > > > > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > > > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > > > > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > > > > Merge it together in a 64bit field when reading it from disk, or split it when > > > > > writing to? > > > > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > > > > bit systems, and avoid such risk of confusion when trying to discern between a > > > > > corrupted agfl and a padding mismatch. > > > > > > > > > > > > > I'm not following how you'd accomplish the latter..? We already have the > > > > packed attribute in place, so the padding is fixed with that. This > > > > effort has to do with trying to fix up an agfl written by an older > > > > kernel without the padding fix. My understanding is that the xfs_agfl > > > > header looks exactly the same on-disk in either case, the issue is a > > > > broken size calculation that causes the older kernel to not see/use one > > > > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > > > > the same on-disk structure, it has no means to know whether the content > > > > of the final slot is a valid block or a "gap" left by an older kernel > > > > other than to check whether flcount matches the active count from > > > > flfirst -> fllast (and that's where potential confusion over a padding > > > > issue vs other corruption comes into play). > > > > > > Your understanding is correct. > > > > > > Sez me, who is watching the fsck.xfs -f discussion just in case that can > > > be turned into a viable option quickly. > > > > > > > Thanks. > > > > > ..and wondering what if we /did/ just implement Dave's suggestion from > > > years ago where if the flcount doesn't match we just reset the agfl and > > > call fix_freelist to refill it with fresh blocks... it would suck to > > > leak blocks, though. Obviously, if the fs has rmap enabled then we can > > > just rebuild it on the spot via xfs_repair_agfl() (see patch on mailing > > > list). > > > > > > > I wasn't initially too fond of this idea, but thinking about it a bit > > more, there is definitely some value in terms of determinism. We know > > we'll just leak some blocks vs. riskily swizzling around a corrupted > > agfl. We've had the whole lingering unlinked inode thing in the past > > without that being such a huge problem, as a point of comparison to > > something like leaked blocks. Most users are likely to hit ENOSPC before > > utilizing every last block anyways. > > I don't think the unlinked inode leftovers is a good comparison, those > can leave gigabytes of space unreclaimed, and I thought Eric's patch to > make it unconditional was a reaction to someone actually hitting ENOSPC. > Also those unlinked inodes should all have valid metadata. > Good point, the inode leaks can definitely have a larger impact in that regard. I was thinking more from the perspective of leaking the physical inodes moreso than the blocks they track. > However, the borked AGFLs have questionable metadata (is it corrupt, or > is it just improperly padded?) and the number of blocks is probably > small, so leaking ~10 blocks per AG probably isn't going to kill us. > Even if the agfl is totally full on a 4k sector disk, 1000 blocks is > still only 4MB. > Yep. More of an issue if users go back and forth causing multiple such resets, but honestly, if we warn loudly enough about it and users don't run repair in that situation you get what you get. > > It also sounds like there was some general performance related concern > > over mount time scans. My initial distaste for that was more from a > > code/overkill perspective, I hadn't really considered the performance > > aspect. > > > > Given that, perhaps an on-demand reset with a "Corrupted AGFL, tossed > > $flcount blocks, unmount and run xfs_repair if you ever want to see them > > again" warning might not be so bad. It works the same whether the kernel > > is packed, unpacked or the agfl is just borked, and therefore the test > > matrix is much simpler as well. Hmmmm... > > Something occurred to me late yesterday while talking to Eric about his > patch to do the unlinked inode recovery. For a reflinked filesystem, we > already iterate every AG at mount time to clean out all the stale CoW > staging extents. Given that we already take the performance hit to walk > all the AGs, perhaps it makes sense to combine the stale cow, unlinked > inode, and bad-pad agf scans into a single pass? > Is that necessary on every refcountbt=1 mount or only after an unclean shutdown? (If the former, I'm wondering if the better question is how could we limit that going forward..? ;). With regard to unlinked inodes, I've been thinking about turning that into a background operation for a little while, I just haven't got around to it yet. While the initial patches makes sense, I'm wondering if that's something we could ultimately kick off at mount time that doesn't necessarily have to be synchronous with the mount itself (but this is all not really paged in my head atm). In my mind, it seems a bit of a shame to do an explicit mount time scan only to use a risky hack (like the reset, or either of our previous patches, etc.) when we could do something clean like enforce a read-only mount. The latter would obviously require serialization with the mount. With an on-demand approach, we probably don't want to randomly transition to read-only or shutdown, so the reset + warn thing seems more amenable to me. I'm not against the mount scan for this particular issue as much as turning it into this infrastructural thing where we seemingly keep adding crap to it over time. The potential for abuse of that over time makes me cringe a bit, but tbh I haven't gone and actually tested mount time scans on large AG count systems to see if it's really that bad. > Granted you do make the point that we've been leaking unlinked inodes on > rootfses for ages with little trouble, but in the long run I think we'd > prefer to clean up the filesystem leftovers. > ISTM that the long run solution for the AGFL thing is online repair..? Brian > --D > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > > But still, just as reinforcement, this is just a guess, and I have no idea if > > > > > this is feasible or not, but in case I'm completely nuts, I'll learn something > > > > > new :) > > > > > > > > > > Cheers. > > > > > > > > > > > Brian > > > > > > > > > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > > > > > fs/xfs/xfs_mount.h | 1 + > > > > > > fs/xfs/xfs_trace.h | 18 ++++++ > > > > > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > > > > index c02781a4c091..31330996e31c 100644 > > > > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > > > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > > > > > } > > > > > > > > > > > > /* > > > > > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > > > > > + * padding is only significant if the agfl wraps around the end or refers to an > > > > > > + * invalid first/last value. > > > > > > + */ > > > > > > +static int > > > > > > +xfs_agfl_ondisk_size( > > > > > > + struct xfs_mount *mp, > > > > > > + int first, > > > > > > + int last, > > > > > > + int count) > > > > > > +{ > > > > > > + int active = count; > > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > > + bool wrapped = (first > last) ? true : false; > > > > > > + > > > > > > + if (count && last >= first) > > > > > > + active = last - first + 1; > > > > > > + else if (count) > > > > > > + active = agfl_size - first + last + 1; > > > > > > + > > > > > > + if (wrapped && active == count + 1) > > > > > > + agfl_size--; > > > > > > + else if ((wrapped && active == count - 1) || > > > > > > + first == agfl_size || last == agfl_size) > > > > > > + agfl_size++; > > > > > > + > > > > > > + /* > > > > > > + * We can't discern the packing problem from certain forms of corruption > > > > > > + * that may look exactly the same. To minimize the chance of mistaking > > > > > > + * corruption for a size mismatch, clamp the size to known valid values. > > > > > > + * A packed header agfl has 119 entries and the older unpacked format > > > > > > + * has one less. > > > > > > + */ > > > > > > + if (agfl_size < 118 || agfl_size > 119) > > > > > > + agfl_size = XFS_AGFL_SIZE(mp); > > > > > > + > > > > > > + return agfl_size; > > > > > > +} > > > > > > + > > > > > > +static bool > > > > > > +xfs_agfl_need_padfix( > > > > > > + struct xfs_mount *mp, > > > > > > + struct xfs_agf *agf) > > > > > > +{ > > > > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > > > > + int l = be32_to_cpu(agf->agf_fllast); > > > > > > + int c = be32_to_cpu(agf->agf_flcount); > > > > > > + > > > > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > > > > + return false; > > > > > > + > > > > > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > > > > > +} > > > > > > + > > > > > > +static int > > > > > > +xfs_agfl_check_padfix( > > > > > > + struct xfs_trans *tp, > > > > > > + struct xfs_buf *agbp, > > > > > > + struct xfs_buf *agflbp, > > > > > > + struct xfs_perag *pag) > > > > > > +{ > > > > > > + struct xfs_mount *mp = tp->t_mountp; > > > > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > > > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > > + int ofirst, olast, osize; > > > > > > + int nfirst, nlast; > > > > > > + int logflags = 0; > > > > > > + int startoff = 0; > > > > > > + > > > > > > + if (!pag->pagf_needpadfix) > > > > > > + return 0; > > > > > > + > > > > > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > > > > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > > > > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > > > > > + > > > > > > + /* > > > > > > + * If the on-disk agfl is smaller than what the kernel expects, the > > > > > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > > > > > + * first valid block into the gap and bump the pointer. > > > > > > + */ > > > > > > + if (osize < agfl_size) { > > > > > > + ASSERT(pag->pagf_flcount != 0); > > > > > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > > > > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > > > > > + nfirst++; > > > > > > + goto done; > > > > > > + } > > > > > > + > > > > > > + /* > > > > > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > > > > > + * expects. If empty, just fix up the first and last pointers. If not, > > > > > > + * move the inaccessible block to the end of the valid range. > > > > > > + */ > > > > > > + nfirst = do_mod(nfirst, agfl_size); > > > > > > + if (pag->pagf_flcount == 0) { > > > > > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > > > > > + goto done; > > > > > > + } > > > > > > + if (nlast != agfl_size) > > > > > > + nlast++; > > > > > > + nlast = do_mod(nlast, agfl_size); > > > > > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > > > > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > > > > > + > > > > > > +done: > > > > > > + if (nfirst != ofirst) { > > > > > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > > > > > + logflags |= XFS_AGF_FLFIRST; > > > > > > + } > > > > > > + if (nlast != olast) { > > > > > > + agf->agf_fllast = cpu_to_be32(nlast); > > > > > > + logflags |= XFS_AGF_FLLAST; > > > > > > + } > > > > > > + if (startoff) { > > > > > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > > > > > + xfs_trans_log_buf(tp, agflbp, startoff, > > > > > > + startoff + sizeof(xfs_agblock_t) - 1); > > > > > > + } > > > > > > + if (logflags) > > > > > > + xfs_alloc_log_agf(tp, agbp, logflags); > > > > > > + > > > > > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > > > > > + pag->pagf_needpadfix = false; > > > > > > + > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > +/* > > > > > > * Decide whether to use this allocation group for this allocation. > > > > > > * If so, fix up the btree freelist's size. > > > > > > */ > > > > > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > > > > > if (error) > > > > > > return error; > > > > > > > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > > + if (error) { > > > > > > + xfs_perag_put(pag); > > > > > > + return error; > > > > > > + } > > > > > > > > > > > > /* > > > > > > * Get the block number and update the data structures. > > > > > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > > > > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > > > > > agf->agf_flfirst = 0; > > > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > be32_add_cpu(&agf->agf_flcount, -1); > > > > > > xfs_trans_agflist_delta(tp, -1); > > > > > > pag->pagf_flcount--; > > > > > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > > > > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > > > > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > > > > > return error; > > > > > > + > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > > + if (error) { > > > > > > + xfs_perag_put(pag); > > > > > > + return error; > > > > > > + } > > > > > > + > > > > > > be32_add_cpu(&agf->agf_fllast, 1); > > > > > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > > > > > agf->agf_fllast = 0; > > > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > be32_add_cpu(&agf->agf_flcount, 1); > > > > > > xfs_trans_agflist_delta(tp, 1); > > > > > > pag->pagf_flcount++; > > > > > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > > > > > pag->pagb_count = 0; > > > > > > pag->pagb_tree = RB_ROOT; > > > > > > pag->pagf_init = 1; > > > > > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > > > > > } > > > > > > #ifdef DEBUG > > > > > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > > > > index e0792d036be2..78a6377a9b38 100644 > > > > > > --- a/fs/xfs/xfs_mount.h > > > > > > +++ b/fs/xfs/xfs_mount.h > > > > > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > > > > > char pagi_inodeok; /* The agi is ok for inodes */ > > > > > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > > > > > /* # of levels in bno & cnt btree */ > > > > > > + bool pagf_needpadfix; > > > > > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > > > > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > > > > > xfs_extlen_t pagf_longest; /* longest free space */ > > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > > > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > > > > > --- a/fs/xfs/xfs_trace.h > > > > > > +++ b/fs/xfs/xfs_trace.h > > > > > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > > > > > __entry->logflags) > > > > > > ); > > > > > > > > > > > > +TRACE_EVENT(xfs_agfl_padfix, > > > > > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > > > > > + TP_ARGS(mp, osize, nsize), > > > > > > + TP_STRUCT__entry( > > > > > > + __field(dev_t, dev) > > > > > > + __field(int, osize) > > > > > > + __field(int, nsize) > > > > > > + ), > > > > > > + TP_fast_assign( > > > > > > + __entry->dev = mp->m_super->s_dev; > > > > > > + __entry->osize = osize; > > > > > > + __entry->nsize = nsize; > > > > > > + ), > > > > > > + TP_printk("dev %d:%d old size %d new size %d", > > > > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > > > > + __entry->osize, __entry->nsize) > > > > > > +); > > > > > > + > > > > > > #endif /* _TRACE_XFS_H */ > > > > > > > > > > > > #undef TRACE_INCLUDE_PATH > > > > > > -- > > > > > > 2.13.6 > > > > > > > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > > > > Carlos > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > the body of a message to majordomo@vger.kernel.org > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 21:20 ` Brian Foster @ 2018-03-09 22:37 ` Darrick J. Wong 2018-03-12 13:11 ` Brian Foster 0 siblings, 1 reply; 19+ messages in thread From: Darrick J. Wong @ 2018-03-09 22:37 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Eric Sandeen On Fri, Mar 09, 2018 at 04:20:31PM -0500, Brian Foster wrote: > On Fri, Mar 09, 2018 at 11:08:32AM -0800, Darrick J. Wong wrote: > > On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > > > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > > > > On Fri, Mar 09, 2018 at 08:16:28AM -0500, Brian Foster wrote: > > > > > On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > > > > > > Hi, > > > > > > > > > > > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > > > > > > Disliked-by: Brian Foster <bfoster@redhat.com> > > > > > > > --- > > > > > > > > > > > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > > > > > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > > > > > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > > > > > > > > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > > > > > > actual corruption for a padding mismatch as much as possible. E.g., > > > > > > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > > > > > > corruption that looks like an agfl with 120 entries on a packed kernel, > > > > > > > etc. > > > > > > > > > > > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > > > > > > that in either case it's impossible to completely eliminate the risk of > > > > > > > confusing corruption with a padding mismatch so long as we're doing a > > > > > > > manual agfl fixup. The more I think about that the more I really dislike > > > > > > > doing this. :( > > > > > > > > > > > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > > > > > > existence of 'xfs_repair -d' is a good enough last resort for those > > > > > > > users who might be bit by unexpected padding issues on a typical > > > > > > > upgrade. If so, we could fall back to a safer mount-time detection model > > > > > > > that enforces a read-only mount and let the user run repair. The > > > > > > > supposition is that those who aren't prepared to repair via a ramdisk or > > > > > > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > > > > > > read-only provided agfl padding is the only inconsistency. > > > > > > > > > > > > > > Eric points out that we can still write an unmount record for a > > > > > > > read-only mount, but I'm not sure that would be a problem if repair only > > > > > > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > > > > > > a recovery issue or it needs to be reformatted to update the LSN, both > > > > > > > of which seem to qualify as "you have more problems than agfl padding > > > > > > > and need to run repair anyways" to me. Thoughts? > > > > > > > > > > > > > > > > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > > > > > or at least me learning something new. > > > > > > > > > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > > > > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > > > > > > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > > > > > Merge it together in a 64bit field when reading it from disk, or split it when > > > > > > writing to? > > > > > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > > > > > bit systems, and avoid such risk of confusion when trying to discern between a > > > > > > corrupted agfl and a padding mismatch. > > > > > > > > > > > > > > > > I'm not following how you'd accomplish the latter..? We already have the > > > > > packed attribute in place, so the padding is fixed with that. This > > > > > effort has to do with trying to fix up an agfl written by an older > > > > > kernel without the padding fix. My understanding is that the xfs_agfl > > > > > header looks exactly the same on-disk in either case, the issue is a > > > > > broken size calculation that causes the older kernel to not see/use one > > > > > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > > > > > the same on-disk structure, it has no means to know whether the content > > > > > of the final slot is a valid block or a "gap" left by an older kernel > > > > > other than to check whether flcount matches the active count from > > > > > flfirst -> fllast (and that's where potential confusion over a padding > > > > > issue vs other corruption comes into play). > > > > > > > > Your understanding is correct. > > > > > > > > Sez me, who is watching the fsck.xfs -f discussion just in case that can > > > > be turned into a viable option quickly. > > > > > > > > > > Thanks. > > > > > > > ..and wondering what if we /did/ just implement Dave's suggestion from > > > > years ago where if the flcount doesn't match we just reset the agfl and > > > > call fix_freelist to refill it with fresh blocks... it would suck to > > > > leak blocks, though. Obviously, if the fs has rmap enabled then we can > > > > just rebuild it on the spot via xfs_repair_agfl() (see patch on mailing > > > > list). > > > > > > > > > > I wasn't initially too fond of this idea, but thinking about it a bit > > > more, there is definitely some value in terms of determinism. We know > > > we'll just leak some blocks vs. riskily swizzling around a corrupted > > > agfl. We've had the whole lingering unlinked inode thing in the past > > > without that being such a huge problem, as a point of comparison to > > > something like leaked blocks. Most users are likely to hit ENOSPC before > > > utilizing every last block anyways. > > > > I don't think the unlinked inode leftovers is a good comparison, those > > can leave gigabytes of space unreclaimed, and I thought Eric's patch to > > make it unconditional was a reaction to someone actually hitting ENOSPC. > > Also those unlinked inodes should all have valid metadata. > > > > Good point, the inode leaks can definitely have a larger impact in that > regard. I was thinking more from the perspective of leaking the physical > inodes moreso than the blocks they track. > > > However, the borked AGFLs have questionable metadata (is it corrupt, or > > is it just improperly padded?) and the number of blocks is probably > > small, so leaking ~10 blocks per AG probably isn't going to kill us. > > Even if the agfl is totally full on a 4k sector disk, 1000 blocks is > > still only 4MB. > > > > Yep. More of an issue if users go back and forth causing multiple such > resets, but honestly, if we warn loudly enough about it and users don't > run repair in that situation you get what you get. I bet that'll happen more than we want it to. :/ > > > It also sounds like there was some general performance related concern > > > over mount time scans. My initial distaste for that was more from a > > > code/overkill perspective, I hadn't really considered the performance > > > aspect. > > > > > > Given that, perhaps an on-demand reset with a "Corrupted AGFL, tossed > > > $flcount blocks, unmount and run xfs_repair if you ever want to see them > > > again" warning might not be so bad. It works the same whether the kernel > > > is packed, unpacked or the agfl is just borked, and therefore the test > > > matrix is much simpler as well. Hmmmm... > > > > Something occurred to me late yesterday while talking to Eric about his > > patch to do the unlinked inode recovery. For a reflinked filesystem, we > > already iterate every AG at mount time to clean out all the stale CoW > > staging extents. Given that we already take the performance hit to walk > > all the AGs, perhaps it makes sense to combine the stale cow, unlinked > > inode, and bad-pad agf scans into a single pass? > > > > Is that necessary on every refcountbt=1 mount or only after an unclean > shutdown? (If the former, I'm wondering if the better question is how > could we limit that going forward..? ;). It should only be necessary to go scanning for stale cows if we had to recover the log, but for the initial merge I figured it better to be conservative and always scan for crud rather than risk letting it accumulate like the unlinked inodes. :) That said, we don't clean out cow mappings when we freeze the fs, so on the assumption that freezing leaves us with a clean log, I thought it necessary to scan for stale cows at any rw-mount/remount-rw opportunity. That assumption turned out not to be true, but otoh Eric is trying to push us towards not having to recover snapshots, I think. I'm also unenthusiastic about flushing what could be gigabytes worth of cow preallocations on every fs freeze just to reduce mount time if we happen to crash while the fs is frozen. The increased fragmentation that can happen due to the flush hits us every day, whereas crashes are infrequent. > With regard to unlinked inodes, I've been thinking about turning that > into a background operation for a little while, I just haven't got > around to it yet. While the initial patches makes sense, I'm wondering > if that's something we could ultimately kick off at mount time that > doesn't necessarily have to be synchronous with the mount itself (but > this is all not really paged in my head atm). This doesn't seem too difficult, since we can walk the unlinked list and zap the inodes that have zero usage count, I think. Though there's always going to be people who complain about weird slowdowns after mount if we make it asynchronous -- look at the ext4 lazy initialization stuff. :) CoW is harder since we don't track whether some cow fork extent maps to the block or if it's stale and can go away now. > In my mind, it seems a bit of a shame to do an explicit mount time scan > only to use a risky hack (like the reset, or either of our previous > patches, etc.) when we could do something clean like enforce a read-only > mount. The latter would obviously require serialization with the mount. > With an on-demand approach, we probably don't want to randomly > transition to read-only or shutdown, so the reset + warn thing seems > more amenable to me. <nod> I've been mulling over rewriting your previous rfc patch that put all the scanning/lifting in {get,put}_freelist but having it reset the agfl instead of doing the swizzling stuff. > I'm not against the mount scan for this particular issue as much as > turning it into this infrastructural thing where we seemingly keep > adding crap to it over time. The potential for abuse of that over time > makes me cringe a bit, but tbh I haven't gone and actually tested mount > time scans on large AG count systems to see if it's really that bad. It's not too bad if you aren't growfs happy but it is noticeable on my slow disky laptops. Admittedly I haven't tried it on these monster 10000 AG filesystems I keep hearing about. :P > > Granted you do make the point that we've been leaking unlinked inodes on > > rootfses for ages with little trouble, but in the long run I think we'd > > prefer to clean up the filesystem leftovers. > > > > ISTM that the long run solution for the AGFL thing is online repair..? That is the goal. :) --D > Brian > > > --D > > > > > Brian > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > > But still, just as reinforcement, this is just a guess, and I have no idea if > > > > > > this is feasible or not, but in case I'm completely nuts, I'll learn something > > > > > > new :) > > > > > > > > > > > > Cheers. > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > > > > > > fs/xfs/xfs_mount.h | 1 + > > > > > > > fs/xfs/xfs_trace.h | 18 ++++++ > > > > > > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > > > > > index c02781a4c091..31330996e31c 100644 > > > > > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > > > > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > > > > > > } > > > > > > > > > > > > > > /* > > > > > > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > > > > > > + * padding is only significant if the agfl wraps around the end or refers to an > > > > > > > + * invalid first/last value. > > > > > > > + */ > > > > > > > +static int > > > > > > > +xfs_agfl_ondisk_size( > > > > > > > + struct xfs_mount *mp, > > > > > > > + int first, > > > > > > > + int last, > > > > > > > + int count) > > > > > > > +{ > > > > > > > + int active = count; > > > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > > > + bool wrapped = (first > last) ? true : false; > > > > > > > + > > > > > > > + if (count && last >= first) > > > > > > > + active = last - first + 1; > > > > > > > + else if (count) > > > > > > > + active = agfl_size - first + last + 1; > > > > > > > + > > > > > > > + if (wrapped && active == count + 1) > > > > > > > + agfl_size--; > > > > > > > + else if ((wrapped && active == count - 1) || > > > > > > > + first == agfl_size || last == agfl_size) > > > > > > > + agfl_size++; > > > > > > > + > > > > > > > + /* > > > > > > > + * We can't discern the packing problem from certain forms of corruption > > > > > > > + * that may look exactly the same. To minimize the chance of mistaking > > > > > > > + * corruption for a size mismatch, clamp the size to known valid values. > > > > > > > + * A packed header agfl has 119 entries and the older unpacked format > > > > > > > + * has one less. > > > > > > > + */ > > > > > > > + if (agfl_size < 118 || agfl_size > 119) > > > > > > > + agfl_size = XFS_AGFL_SIZE(mp); > > > > > > > + > > > > > > > + return agfl_size; > > > > > > > +} > > > > > > > + > > > > > > > +static bool > > > > > > > +xfs_agfl_need_padfix( > > > > > > > + struct xfs_mount *mp, > > > > > > > + struct xfs_agf *agf) > > > > > > > +{ > > > > > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > > > > > + int l = be32_to_cpu(agf->agf_fllast); > > > > > > > + int c = be32_to_cpu(agf->agf_flcount); > > > > > > > + > > > > > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > > > > > + return false; > > > > > > > + > > > > > > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > > > > > > +} > > > > > > > + > > > > > > > +static int > > > > > > > +xfs_agfl_check_padfix( > > > > > > > + struct xfs_trans *tp, > > > > > > > + struct xfs_buf *agbp, > > > > > > > + struct xfs_buf *agflbp, > > > > > > > + struct xfs_perag *pag) > > > > > > > +{ > > > > > > > + struct xfs_mount *mp = tp->t_mountp; > > > > > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > > > > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > > > + int ofirst, olast, osize; > > > > > > > + int nfirst, nlast; > > > > > > > + int logflags = 0; > > > > > > > + int startoff = 0; > > > > > > > + > > > > > > > + if (!pag->pagf_needpadfix) > > > > > > > + return 0; > > > > > > > + > > > > > > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > > > > > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > > > > > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > > > > > > + > > > > > > > + /* > > > > > > > + * If the on-disk agfl is smaller than what the kernel expects, the > > > > > > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > > > > > > + * first valid block into the gap and bump the pointer. > > > > > > > + */ > > > > > > > + if (osize < agfl_size) { > > > > > > > + ASSERT(pag->pagf_flcount != 0); > > > > > > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > > > > > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > > > > > > + nfirst++; > > > > > > > + goto done; > > > > > > > + } > > > > > > > + > > > > > > > + /* > > > > > > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > > > > > > + * expects. If empty, just fix up the first and last pointers. If not, > > > > > > > + * move the inaccessible block to the end of the valid range. > > > > > > > + */ > > > > > > > + nfirst = do_mod(nfirst, agfl_size); > > > > > > > + if (pag->pagf_flcount == 0) { > > > > > > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > > > > > > + goto done; > > > > > > > + } > > > > > > > + if (nlast != agfl_size) > > > > > > > + nlast++; > > > > > > > + nlast = do_mod(nlast, agfl_size); > > > > > > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > > > > > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > > > > > > + > > > > > > > +done: > > > > > > > + if (nfirst != ofirst) { > > > > > > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > > > > > > + logflags |= XFS_AGF_FLFIRST; > > > > > > > + } > > > > > > > + if (nlast != olast) { > > > > > > > + agf->agf_fllast = cpu_to_be32(nlast); > > > > > > > + logflags |= XFS_AGF_FLLAST; > > > > > > > + } > > > > > > > + if (startoff) { > > > > > > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > > > > > > + xfs_trans_log_buf(tp, agflbp, startoff, > > > > > > > + startoff + sizeof(xfs_agblock_t) - 1); > > > > > > > + } > > > > > > > + if (logflags) > > > > > > > + xfs_alloc_log_agf(tp, agbp, logflags); > > > > > > > + > > > > > > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > > > > > > + pag->pagf_needpadfix = false; > > > > > > > + > > > > > > > + return 0; > > > > > > > +} > > > > > > > + > > > > > > > +/* > > > > > > > * Decide whether to use this allocation group for this allocation. > > > > > > > * If so, fix up the btree freelist's size. > > > > > > > */ > > > > > > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > > > > > > if (error) > > > > > > > return error; > > > > > > > > > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > > > + if (error) { > > > > > > > + xfs_perag_put(pag); > > > > > > > + return error; > > > > > > > + } > > > > > > > > > > > > > > /* > > > > > > > * Get the block number and update the data structures. > > > > > > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > > > > > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > > > > > > agf->agf_flfirst = 0; > > > > > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > be32_add_cpu(&agf->agf_flcount, -1); > > > > > > > xfs_trans_agflist_delta(tp, -1); > > > > > > > pag->pagf_flcount--; > > > > > > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > > > > > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > > > > > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > > > > > > return error; > > > > > > > + > > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > > > + if (error) { > > > > > > > + xfs_perag_put(pag); > > > > > > > + return error; > > > > > > > + } > > > > > > > + > > > > > > > be32_add_cpu(&agf->agf_fllast, 1); > > > > > > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > > > > > > agf->agf_fllast = 0; > > > > > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > be32_add_cpu(&agf->agf_flcount, 1); > > > > > > > xfs_trans_agflist_delta(tp, 1); > > > > > > > pag->pagf_flcount++; > > > > > > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > > > > > > pag->pagb_count = 0; > > > > > > > pag->pagb_tree = RB_ROOT; > > > > > > > pag->pagf_init = 1; > > > > > > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > > > > > > } > > > > > > > #ifdef DEBUG > > > > > > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > > > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > > > > > index e0792d036be2..78a6377a9b38 100644 > > > > > > > --- a/fs/xfs/xfs_mount.h > > > > > > > +++ b/fs/xfs/xfs_mount.h > > > > > > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > > > > > > char pagi_inodeok; /* The agi is ok for inodes */ > > > > > > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > > > > > > /* # of levels in bno & cnt btree */ > > > > > > > + bool pagf_needpadfix; > > > > > > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > > > > > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > > > > > > xfs_extlen_t pagf_longest; /* longest free space */ > > > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > > > > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > > > > > > --- a/fs/xfs/xfs_trace.h > > > > > > > +++ b/fs/xfs/xfs_trace.h > > > > > > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > > > > > > __entry->logflags) > > > > > > > ); > > > > > > > > > > > > > > +TRACE_EVENT(xfs_agfl_padfix, > > > > > > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > > > > > > + TP_ARGS(mp, osize, nsize), > > > > > > > + TP_STRUCT__entry( > > > > > > > + __field(dev_t, dev) > > > > > > > + __field(int, osize) > > > > > > > + __field(int, nsize) > > > > > > > + ), > > > > > > > + TP_fast_assign( > > > > > > > + __entry->dev = mp->m_super->s_dev; > > > > > > > + __entry->osize = osize; > > > > > > > + __entry->nsize = nsize; > > > > > > > + ), > > > > > > > + TP_printk("dev %d:%d old size %d new size %d", > > > > > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > > > > > + __entry->osize, __entry->nsize) > > > > > > > +); > > > > > > > + > > > > > > > #endif /* _TRACE_XFS_H */ > > > > > > > > > > > > > > #undef TRACE_INCLUDE_PATH > > > > > > > -- > > > > > > > 2.13.6 > > > > > > > > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > > > > > Carlos > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > the body of a message to majordomo@vger.kernel.org > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 22:37 ` Darrick J. Wong @ 2018-03-12 13:11 ` Brian Foster 2018-03-12 17:35 ` Brian Foster 0 siblings, 1 reply; 19+ messages in thread From: Brian Foster @ 2018-03-12 13:11 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, Eric Sandeen On Fri, Mar 09, 2018 at 02:37:57PM -0800, Darrick J. Wong wrote: > On Fri, Mar 09, 2018 at 04:20:31PM -0500, Brian Foster wrote: > > On Fri, Mar 09, 2018 at 11:08:32AM -0800, Darrick J. Wong wrote: > > > On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > > > > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: ... > > > > > However, the borked AGFLs have questionable metadata (is it corrupt, or > > > is it just improperly padded?) and the number of blocks is probably > > > small, so leaking ~10 blocks per AG probably isn't going to kill us. > > > Even if the agfl is totally full on a 4k sector disk, 1000 blocks is > > > still only 4MB. > > > > > > > Yep. More of an issue if users go back and forth causing multiple such > > resets, but honestly, if we warn loudly enough about it and users don't > > run repair in that situation you get what you get. > > I bet that'll happen more than we want it to. :/ > Yeah, but that argument can be applied no matter what we do. How many users are going to continue to bounce back and forth between differently padded kernels that don't have any of this magic we're currently designing? Probably more than we want. ;) All this work isn't really fixing a bug IMO. It's more of a mitigation strategy to turn a known problem with nasty side effects into something more deterministic/predictable. Given that, I think there's a line for each potential solution that's just not worth crossing. For example, I don't think it's a big deal if the best available solution makes it difficult/impossible to handle the case on a filesystem that was moved with a dirty log or if we can't cover the case where the fs doesn't have enough blocks to reset/refill an agfl. I don't think it's worth changing high level behavior for the sake of this problem and I don't think it's worth doing something that's overly (subjective) complex (which isn't to say the problem is easy to solve, clearly it is not). > > > > It also sounds like there was some general performance related concern > > > > over mount time scans. My initial distaste for that was more from a > > > > code/overkill perspective, I hadn't really considered the performance > > > > aspect. > > > > > > > > Given that, perhaps an on-demand reset with a "Corrupted AGFL, tossed > > > > $flcount blocks, unmount and run xfs_repair if you ever want to see them > > > > again" warning might not be so bad. It works the same whether the kernel > > > > is packed, unpacked or the agfl is just borked, and therefore the test > > > > matrix is much simpler as well. Hmmmm... > > > > > > Something occurred to me late yesterday while talking to Eric about his > > > patch to do the unlinked inode recovery. For a reflinked filesystem, we > > > already iterate every AG at mount time to clean out all the stale CoW > > > staging extents. Given that we already take the performance hit to walk > > > all the AGs, perhaps it makes sense to combine the stale cow, unlinked > > > inode, and bad-pad agf scans into a single pass? > > > > > > > Is that necessary on every refcountbt=1 mount or only after an unclean > > shutdown? (If the former, I'm wondering if the better question is how > > could we limit that going forward..? ;). > > It should only be necessary to go scanning for stale cows if we had to > recover the log, but for the initial merge I figured it better to be > conservative and always scan for crud rather than risk letting it > accumulate like the unlinked inodes. :) > Ok. Regardless of whether we'd continue to do that long term or not, that kind of speaks to the point of not adding more mount scan stuff purely based on the fact that we already do it for something else. > That said, we don't clean out cow mappings when we freeze the fs, so > on the assumption that freezing leaves us with a clean log, I thought it > necessary to scan for stale cows at any rw-mount/remount-rw opportunity. > That assumption turned out not to be true, but otoh Eric is trying to > push us towards not having to recover snapshots, I think. > Yeah.. sounds familiar. IIRC, the historical reason for dirtying the log on freeze was to ensure unlinked inode cleanup after a crash.. > I'm also unenthusiastic about flushing what could be gigabytes worth of > cow preallocations on every fs freeze just to reduce mount time if we > happen to crash while the fs is frozen. The increased fragmentation > that can happen due to the flush hits us every day, whereas crashes are > infrequent. > Indeed, makes sense. Though on the flipside, we'll already reclaim cow blocks for any inode that isn't dirty on the 5 minute background scanner. An fs freeze is likely less frequent than that, no? I guess it depends on how active the fs is with cow operations. One simple option might be to conditionally dirty the log if cowblocks are present and perhaps consider a tunable for users who want to run a cowblocks scan on freeze. Anyways, that's probably a separate discussion. > > With regard to unlinked inodes, I've been thinking about turning that > > into a background operation for a little while, I just haven't got > > around to it yet. While the initial patches makes sense, I'm wondering > > if that's something we could ultimately kick off at mount time that > > doesn't necessarily have to be synchronous with the mount itself (but > > this is all not really paged in my head atm). > > This doesn't seem too difficult, since we can walk the unlinked list and > zap the inodes that have zero usage count, I think. Though there's > always going to be people who complain about weird slowdowns after > mount if we make it asynchronous -- look at the ext4 lazy initialization > stuff. :) > That's not quite how I would envision it working. I'd expect it to work something like the background eofblocks reclaim and I'd also expect everything to still be cleaned up properly on a clean unmount. Alas, I haven't looked into the details yet so it's probably not worth getting too much into the weeds for something that doesn't exist yet. > CoW is harder since we don't track whether some cow fork extent maps to > the block or if it's stale and can go away now. > > > In my mind, it seems a bit of a shame to do an explicit mount time scan > > only to use a risky hack (like the reset, or either of our previous > > patches, etc.) when we could do something clean like enforce a read-only > > mount. The latter would obviously require serialization with the mount. > > With an on-demand approach, we probably don't want to randomly > > transition to read-only or shutdown, so the reset + warn thing seems > > more amenable to me. > > <nod> I've been mulling over rewriting your previous rfc patch that put > all the scanning/lifting in {get,put}_freelist but having it reset the > agfl instead of doing the swizzling stuff. > Something to be careful of... emptying the agfl obviously means the subsequent fixup is a btree block allocation. That may limit the context of where the fixup can be performed. IOW, deferring it to _get_freelist() might no longer be safe. Instead, I think we'd have to implement it such that the on-disk flcount is completely untrusted when the mismatch is detected. Part of me is wondering whether we could actually do a tx-less reset (not refill) on agfl read. Would a subsequent fix_freelist() log all the right bits to maintain crash consistency? We could just repeat the reset if nothing changes before a crash. Otherwise, perhaps we'd have to run it in fix_freelist() context. > > I'm not against the mount scan for this particular issue as much as > > turning it into this infrastructural thing where we seemingly keep > > adding crap to it over time. The potential for abuse of that over time > > makes me cringe a bit, but tbh I haven't gone and actually tested mount > > time scans on large AG count systems to see if it's really that bad. > > It's not too bad if you aren't growfs happy but it is noticeable on my > slow disky laptops. Admittedly I haven't tried it on these monster > 10000 AG filesystems I keep hearing about. :P > I think those setups tend to come from configurations where users want to construct small base/installation images and grow them on deployment for whatever reason. That tends to come with its own set of problems. Brian > > > Granted you do make the point that we've been leaking unlinked inodes on > > > rootfses for ages with little trouble, but in the long run I think we'd > > > prefer to clean up the filesystem leftovers. > > > > > > > ISTM that the long run solution for the AGFL thing is online repair..? > > That is the goal. :) > > --D > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > > --D > > > > > > > > > > > Brian > > > > > > > > > > > > > But still, just as reinforcement, this is just a guess, and I have no idea if > > > > > > > this is feasible or not, but in case I'm completely nuts, I'll learn something > > > > > > > new :) > > > > > > > > > > > > > > Cheers. > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > > > > > > > fs/xfs/xfs_mount.h | 1 + > > > > > > > > fs/xfs/xfs_trace.h | 18 ++++++ > > > > > > > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > > > > > > index c02781a4c091..31330996e31c 100644 > > > > > > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > > > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > > > > > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > > > > > > > } > > > > > > > > > > > > > > > > /* > > > > > > > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > > > > > > > + * padding is only significant if the agfl wraps around the end or refers to an > > > > > > > > + * invalid first/last value. > > > > > > > > + */ > > > > > > > > +static int > > > > > > > > +xfs_agfl_ondisk_size( > > > > > > > > + struct xfs_mount *mp, > > > > > > > > + int first, > > > > > > > > + int last, > > > > > > > > + int count) > > > > > > > > +{ > > > > > > > > + int active = count; > > > > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > > > > + bool wrapped = (first > last) ? true : false; > > > > > > > > + > > > > > > > > + if (count && last >= first) > > > > > > > > + active = last - first + 1; > > > > > > > > + else if (count) > > > > > > > > + active = agfl_size - first + last + 1; > > > > > > > > + > > > > > > > > + if (wrapped && active == count + 1) > > > > > > > > + agfl_size--; > > > > > > > > + else if ((wrapped && active == count - 1) || > > > > > > > > + first == agfl_size || last == agfl_size) > > > > > > > > + agfl_size++; > > > > > > > > + > > > > > > > > + /* > > > > > > > > + * We can't discern the packing problem from certain forms of corruption > > > > > > > > + * that may look exactly the same. To minimize the chance of mistaking > > > > > > > > + * corruption for a size mismatch, clamp the size to known valid values. > > > > > > > > + * A packed header agfl has 119 entries and the older unpacked format > > > > > > > > + * has one less. > > > > > > > > + */ > > > > > > > > + if (agfl_size < 118 || agfl_size > 119) > > > > > > > > + agfl_size = XFS_AGFL_SIZE(mp); > > > > > > > > + > > > > > > > > + return agfl_size; > > > > > > > > +} > > > > > > > > + > > > > > > > > +static bool > > > > > > > > +xfs_agfl_need_padfix( > > > > > > > > + struct xfs_mount *mp, > > > > > > > > + struct xfs_agf *agf) > > > > > > > > +{ > > > > > > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > > > > > > + int l = be32_to_cpu(agf->agf_fllast); > > > > > > > > + int c = be32_to_cpu(agf->agf_flcount); > > > > > > > > + > > > > > > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > > > > > > + return false; > > > > > > > > + > > > > > > > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > > > > > > > +} > > > > > > > > + > > > > > > > > +static int > > > > > > > > +xfs_agfl_check_padfix( > > > > > > > > + struct xfs_trans *tp, > > > > > > > > + struct xfs_buf *agbp, > > > > > > > > + struct xfs_buf *agflbp, > > > > > > > > + struct xfs_perag *pag) > > > > > > > > +{ > > > > > > > > + struct xfs_mount *mp = tp->t_mountp; > > > > > > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > > > > > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > > > > > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > > > > > > + int ofirst, olast, osize; > > > > > > > > + int nfirst, nlast; > > > > > > > > + int logflags = 0; > > > > > > > > + int startoff = 0; > > > > > > > > + > > > > > > > > + if (!pag->pagf_needpadfix) > > > > > > > > + return 0; > > > > > > > > + > > > > > > > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > > > > > > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > > > > > > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > > > > > > > + > > > > > > > > + /* > > > > > > > > + * If the on-disk agfl is smaller than what the kernel expects, the > > > > > > > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > > > > > > > + * first valid block into the gap and bump the pointer. > > > > > > > > + */ > > > > > > > > + if (osize < agfl_size) { > > > > > > > > + ASSERT(pag->pagf_flcount != 0); > > > > > > > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > > > > > > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > > > > > > > + nfirst++; > > > > > > > > + goto done; > > > > > > > > + } > > > > > > > > + > > > > > > > > + /* > > > > > > > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > > > > > > > + * expects. If empty, just fix up the first and last pointers. If not, > > > > > > > > + * move the inaccessible block to the end of the valid range. > > > > > > > > + */ > > > > > > > > + nfirst = do_mod(nfirst, agfl_size); > > > > > > > > + if (pag->pagf_flcount == 0) { > > > > > > > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > > > > > > > + goto done; > > > > > > > > + } > > > > > > > > + if (nlast != agfl_size) > > > > > > > > + nlast++; > > > > > > > > + nlast = do_mod(nlast, agfl_size); > > > > > > > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > > > > > > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > > > > > > > + > > > > > > > > +done: > > > > > > > > + if (nfirst != ofirst) { > > > > > > > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > > > > > > > + logflags |= XFS_AGF_FLFIRST; > > > > > > > > + } > > > > > > > > + if (nlast != olast) { > > > > > > > > + agf->agf_fllast = cpu_to_be32(nlast); > > > > > > > > + logflags |= XFS_AGF_FLLAST; > > > > > > > > + } > > > > > > > > + if (startoff) { > > > > > > > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > > > > > > > + xfs_trans_log_buf(tp, agflbp, startoff, > > > > > > > > + startoff + sizeof(xfs_agblock_t) - 1); > > > > > > > > + } > > > > > > > > + if (logflags) > > > > > > > > + xfs_alloc_log_agf(tp, agbp, logflags); > > > > > > > > + > > > > > > > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > > > > > > > + pag->pagf_needpadfix = false; > > > > > > > > + > > > > > > > > + return 0; > > > > > > > > +} > > > > > > > > + > > > > > > > > +/* > > > > > > > > * Decide whether to use this allocation group for this allocation. > > > > > > > > * If so, fix up the btree freelist's size. > > > > > > > > */ > > > > > > > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > > > > > > > if (error) > > > > > > > > return error; > > > > > > > > > > > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > > > > + if (error) { > > > > > > > > + xfs_perag_put(pag); > > > > > > > > + return error; > > > > > > > > + } > > > > > > > > > > > > > > > > /* > > > > > > > > * Get the block number and update the data structures. > > > > > > > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > > > > > > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > > > > > > > agf->agf_flfirst = 0; > > > > > > > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > > be32_add_cpu(&agf->agf_flcount, -1); > > > > > > > > xfs_trans_agflist_delta(tp, -1); > > > > > > > > pag->pagf_flcount--; > > > > > > > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > > > > > > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > > > > > > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > > > > > > > return error; > > > > > > > > + > > > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > > > > > > + if (error) { > > > > > > > > + xfs_perag_put(pag); > > > > > > > > + return error; > > > > > > > > + } > > > > > > > > + > > > > > > > > be32_add_cpu(&agf->agf_fllast, 1); > > > > > > > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > > > > > > > agf->agf_fllast = 0; > > > > > > > > > > > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > > > > > > be32_add_cpu(&agf->agf_flcount, 1); > > > > > > > > xfs_trans_agflist_delta(tp, 1); > > > > > > > > pag->pagf_flcount++; > > > > > > > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > > > > > > > pag->pagb_count = 0; > > > > > > > > pag->pagb_tree = RB_ROOT; > > > > > > > > pag->pagf_init = 1; > > > > > > > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > > > > > > > } > > > > > > > > #ifdef DEBUG > > > > > > > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > > > > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > > > > > > index e0792d036be2..78a6377a9b38 100644 > > > > > > > > --- a/fs/xfs/xfs_mount.h > > > > > > > > +++ b/fs/xfs/xfs_mount.h > > > > > > > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > > > > > > > char pagi_inodeok; /* The agi is ok for inodes */ > > > > > > > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > > > > > > > /* # of levels in bno & cnt btree */ > > > > > > > > + bool pagf_needpadfix; > > > > > > > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > > > > > > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > > > > > > > xfs_extlen_t pagf_longest; /* longest free space */ > > > > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > > > > > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > > > > > > > --- a/fs/xfs/xfs_trace.h > > > > > > > > +++ b/fs/xfs/xfs_trace.h > > > > > > > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > > > > > > > __entry->logflags) > > > > > > > > ); > > > > > > > > > > > > > > > > +TRACE_EVENT(xfs_agfl_padfix, > > > > > > > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > > > > > > > + TP_ARGS(mp, osize, nsize), > > > > > > > > + TP_STRUCT__entry( > > > > > > > > + __field(dev_t, dev) > > > > > > > > + __field(int, osize) > > > > > > > > + __field(int, nsize) > > > > > > > > + ), > > > > > > > > + TP_fast_assign( > > > > > > > > + __entry->dev = mp->m_super->s_dev; > > > > > > > > + __entry->osize = osize; > > > > > > > > + __entry->nsize = nsize; > > > > > > > > + ), > > > > > > > > + TP_printk("dev %d:%d old size %d new size %d", > > > > > > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > > > > > > + __entry->osize, __entry->nsize) > > > > > > > > +); > > > > > > > > + > > > > > > > > #endif /* _TRACE_XFS_H */ > > > > > > > > > > > > > > > > #undef TRACE_INCLUDE_PATH > > > > > > > > -- > > > > > > > > 2.13.6 > > > > > > > > > > > > > > > > -- > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > -- > > > > > > > Carlos > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-12 13:11 ` Brian Foster @ 2018-03-12 17:35 ` Brian Foster 2018-03-12 21:14 ` Dave Chinner 2018-03-14 3:07 ` Dave Chiluk 0 siblings, 2 replies; 19+ messages in thread From: Brian Foster @ 2018-03-12 17:35 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, Eric Sandeen On Mon, Mar 12, 2018 at 09:11:58AM -0400, Brian Foster wrote: > On Fri, Mar 09, 2018 at 02:37:57PM -0800, Darrick J. Wong wrote: > > On Fri, Mar 09, 2018 at 04:20:31PM -0500, Brian Foster wrote: > > > On Fri, Mar 09, 2018 at 11:08:32AM -0800, Darrick J. Wong wrote: > > > > On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > > > > > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: ... > > <nod> I've been mulling over rewriting your previous rfc patch that put > > all the scanning/lifting in {get,put}_freelist but having it reset the > > agfl instead of doing the swizzling stuff. > > > > Something to be careful of... emptying the agfl obviously means the > subsequent fixup is a btree block allocation. That may limit the context > of where the fixup can be performed. IOW, deferring it to > _get_freelist() might no longer be safe. Instead, I think we'd have to > implement it such that the on-disk flcount is completely untrusted when > the mismatch is detected. > ... Here's a variant of that patch that does a reset. It's definitely much simpler. Thoughts? Brian --- 8< --- diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index c02781a4c091..7d313bb4677d 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2053,6 +2053,59 @@ xfs_alloc_space_available( return true; } +static bool +xfs_agf_verify_flcount( + struct xfs_mount *mp, + struct xfs_agf *agf) +{ + int f = be32_to_cpu(agf->agf_flfirst); + int l = be32_to_cpu(agf->agf_fllast); + int c = be32_to_cpu(agf->agf_flcount); + int active = c; + int agfl_size = XFS_AGFL_SIZE(mp); + + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return true; + + if (c && l >= f) + active = l - f + 1; + else if (c) + active = agfl_size - f + l + 1; + + if (active != c) + return false; + if (f >= agfl_size || l >= agfl_size) + return false; + + return true; +} + +static void +xfs_agfl_reset( + struct xfs_trans *tp, + struct xfs_buf *agbp, + struct xfs_perag *pag) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); + + if (!pag->pagf_needreset) + return; + + trace_xfs_agfl_reset(pag); + xfs_warn(mp, "agfl reset agno %u flcount %d", pag->pag_agno, + pag->pagf_flcount); + + agf->agf_flfirst = 0; + agf->agf_fllast = cpu_to_be32(XFS_AGFL_SIZE(mp) - 1); + agf->agf_flcount = 0; + xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLLAST | + XFS_AGF_FLCOUNT); + + pag->pagf_flcount = 0; + pag->pagf_needreset = false; +} + /* * Decide whether to use this allocation group for this allocation. * If so, fix up the btree freelist's size. @@ -2119,6 +2172,9 @@ xfs_alloc_fix_freelist( if (!xfs_alloc_space_available(args, need, flags)) goto out_agbp_relse; + if (pag->pagf_needreset) + xfs_agfl_reset(tp, agbp, pag); + /* * Make the freelist shorter if it's too long. * @@ -2588,6 +2644,7 @@ xfs_alloc_read_agf( pag->pagb_count = 0; pag->pagb_tree = RB_ROOT; pag->pagf_init = 1; + pag->pagf_needreset = !xfs_agf_verify_flcount(mp, agf); } #ifdef DEBUG else if (!XFS_FORCED_SHUTDOWN(mp)) { diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index e0792d036be2..5dd36920b7d6 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -353,6 +353,7 @@ typedef struct xfs_perag { char pagi_inodeok; /* The agi is ok for inodes */ uint8_t pagf_levels[XFS_BTNUM_AGF]; /* # of levels in bno & cnt btree */ + bool pagf_needreset; uint32_t pagf_flcount; /* count of blocks in freelist */ xfs_extlen_t pagf_freeblks; /* total free blocks */ xfs_extlen_t pagf_longest; /* longest free space */ diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 945de08af7ba..678d602dc400 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, __entry->logflags) ); +TRACE_EVENT(xfs_agfl_reset, + TP_PROTO(struct xfs_perag *pag), + TP_ARGS(pag), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(int, flcount) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->flcount = pag->pagf_flcount; + ), + TP_printk("dev %d:%d agno %u flcount %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, __entry->flcount) +); + #endif /* _TRACE_XFS_H */ #undef TRACE_INCLUDE_PATH ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-12 17:35 ` Brian Foster @ 2018-03-12 21:14 ` Dave Chinner 2018-03-13 11:27 ` Brian Foster 2018-03-14 3:07 ` Dave Chiluk 1 sibling, 1 reply; 19+ messages in thread From: Dave Chinner @ 2018-03-12 21:14 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, Eric Sandeen On Mon, Mar 12, 2018 at 01:35:53PM -0400, Brian Foster wrote: > On Mon, Mar 12, 2018 at 09:11:58AM -0400, Brian Foster wrote: > > On Fri, Mar 09, 2018 at 02:37:57PM -0800, Darrick J. Wong wrote: > > > On Fri, Mar 09, 2018 at 04:20:31PM -0500, Brian Foster wrote: > > > > On Fri, Mar 09, 2018 at 11:08:32AM -0800, Darrick J. Wong wrote: > > > > > On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > > > > > > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > ... > > > <nod> I've been mulling over rewriting your previous rfc patch that put > > > all the scanning/lifting in {get,put}_freelist but having it reset the > > > agfl instead of doing the swizzling stuff. > > > > > > > Something to be careful of... emptying the agfl obviously means the > > subsequent fixup is a btree block allocation. That may limit the context > > of where the fixup can be performed. IOW, deferring it to > > _get_freelist() might no longer be safe. Instead, I think we'd have to > > implement it such that the on-disk flcount is completely untrusted when > > the mismatch is detected. > > > ... > > Here's a variant of that patch that does a reset. It's definitely much > simpler. Thoughts? I like it - it is so much simpler than the other proposals, and it's done on demand rather than by a brute-force mount/unmount scan. > Brian > > --- 8< --- > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index c02781a4c091..7d313bb4677d 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -2053,6 +2053,59 @@ xfs_alloc_space_available( > return true; > } > > +static bool > +xfs_agf_verify_flcount( > + struct xfs_mount *mp, > + struct xfs_agf *agf) Needs a comment explaining what it's checking and the return value. Might actually read better as "xfs_agf_need_flcount_reset()" returning true if fixup is required, especially at the caller site... > +{ > + int f = be32_to_cpu(agf->agf_flfirst); > + int l = be32_to_cpu(agf->agf_fllast); > + int c = be32_to_cpu(agf->agf_flcount); These should probably be uint32_t - they are unsigned on disk. > + int active = c; > + int agfl_size = XFS_AGFL_SIZE(mp); > + > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > + return true; > + > + if (c && l >= f) > + active = l - f + 1; > + else if (c) > + active = agfl_size - f + l + 1; else active = 0; To move all the initialisation of active into the one logic statement and not make it dependent on the original value in the AGF? > + if (active != c) > + return false; > + if (f >= agfl_size || l >= agfl_size) > + return false; Shouldn't we be range checking these first? Also should probably checking c the same way. > + > + return true; > +} > + > +static void > +xfs_agfl_reset( > + struct xfs_trans *tp, > + struct xfs_buf *agbp, > + struct xfs_perag *pag) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > + > + if (!pag->pagf_needreset) > + return; > + > + trace_xfs_agfl_reset(pag); > + xfs_warn(mp, "agfl reset agno %u flcount %d", pag->pag_agno, > + pag->pagf_flcount); > + > + agf->agf_flfirst = 0; > + agf->agf_fllast = cpu_to_be32(XFS_AGFL_SIZE(mp) - 1); > + agf->agf_flcount = 0; > + xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLLAST | > + XFS_AGF_FLCOUNT); > + > + pag->pagf_flcount = 0; > + pag->pagf_needreset = false; > +} > + > /* > * Decide whether to use this allocation group for this allocation. > * If so, fix up the btree freelist's size. > @@ -2119,6 +2172,9 @@ xfs_alloc_fix_freelist( > if (!xfs_alloc_space_available(args, need, flags)) > goto out_agbp_relse; > > + if (pag->pagf_needreset) > + xfs_agfl_reset(tp, agbp, pag); > + > /* > * Make the freelist shorter if it's too long. > * > @@ -2588,6 +2644,7 @@ xfs_alloc_read_agf( > pag->pagb_count = 0; > pag->pagb_tree = RB_ROOT; > pag->pagf_init = 1; > + pag->pagf_needreset = !xfs_agf_verify_flcount(mp, agf); > } > #ifdef DEBUG > else if (!XFS_FORCED_SHUTDOWN(mp)) { > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index e0792d036be2..5dd36920b7d6 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > char pagi_inodeok; /* The agi is ok for inodes */ > uint8_t pagf_levels[XFS_BTNUM_AGF]; > /* # of levels in bno & cnt btree */ > + bool pagf_needreset; > uint32_t pagf_flcount; /* count of blocks in freelist */ > xfs_extlen_t pagf_freeblks; /* total free blocks */ > xfs_extlen_t pagf_longest; /* longest free space */ > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 945de08af7ba..678d602dc400 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > __entry->logflags) > ); > > +TRACE_EVENT(xfs_agfl_reset, > + TP_PROTO(struct xfs_perag *pag), > + TP_ARGS(pag), > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(xfs_agnumber_t, agno) > + __field(int, flcount) > + ), > + TP_fast_assign( > + __entry->dev = pag->pag_mount->m_super->s_dev; > + __entry->agno = pag->pag_agno; > + __entry->flcount = pag->pagf_flcount; > + ), > + TP_printk("dev %d:%d agno %u flcount %d", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + __entry->agno, __entry->flcount) > +); Do we need a new tracepoint? Wouldn't it be better to just call trace_xfs_agf() which will dump all the information in the AGF just before we do the reset? We'll know it's a reset from the caller information that is dumped by that tracepoint.... > + > #endif /* _TRACE_XFS_H */ > > #undef TRACE_INCLUDE_PATH > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-12 21:14 ` Dave Chinner @ 2018-03-13 11:27 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2018-03-13 11:27 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, Eric Sandeen On Tue, Mar 13, 2018 at 08:14:55AM +1100, Dave Chinner wrote: > On Mon, Mar 12, 2018 at 01:35:53PM -0400, Brian Foster wrote: > > On Mon, Mar 12, 2018 at 09:11:58AM -0400, Brian Foster wrote: > > > On Fri, Mar 09, 2018 at 02:37:57PM -0800, Darrick J. Wong wrote: > > > > On Fri, Mar 09, 2018 at 04:20:31PM -0500, Brian Foster wrote: > > > > > On Fri, Mar 09, 2018 at 11:08:32AM -0800, Darrick J. Wong wrote: > > > > > > On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > > > > > > > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > > ... > > > > <nod> I've been mulling over rewriting your previous rfc patch that put > > > > all the scanning/lifting in {get,put}_freelist but having it reset the > > > > agfl instead of doing the swizzling stuff. > > > > > > > > > > Something to be careful of... emptying the agfl obviously means the > > > subsequent fixup is a btree block allocation. That may limit the context > > > of where the fixup can be performed. IOW, deferring it to > > > _get_freelist() might no longer be safe. Instead, I think we'd have to > > > implement it such that the on-disk flcount is completely untrusted when > > > the mismatch is detected. > > > > > ... > > > > Here's a variant of that patch that does a reset. It's definitely much > > simpler. Thoughts? > > I like it - it is so much simpler than the other proposals, and it's > done on demand rather than by a brute-force mount/unmount scan. > > > Brian > > > > --- 8< --- > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > index c02781a4c091..7d313bb4677d 100644 > > --- a/fs/xfs/libxfs/xfs_alloc.c > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > @@ -2053,6 +2053,59 @@ xfs_alloc_space_available( > > return true; > > } > > > > +static bool > > +xfs_agf_verify_flcount( > > + struct xfs_mount *mp, > > + struct xfs_agf *agf) > > Needs a comment explaining what it's checking and the return value. > Might actually read better as "xfs_agf_need_flcount_reset()" > returning true if fixup is required, especially at the caller > site... > I changed it from something like that (see the original patch), but more just because I expected the simpler logic would now cover more general things than just the padding issue rather than any technical reason. The _need_reset() name probably accomplishes the same thing so I'll go back to the earlier logic. I plan to add some actual comments and fix up the warning message into something more usable. > > +{ > > + int f = be32_to_cpu(agf->agf_flfirst); > > + int l = be32_to_cpu(agf->agf_fllast); > > + int c = be32_to_cpu(agf->agf_flcount); > > These should probably be uint32_t - they are unsigned on disk. > Ok. > > + int active = c; > > + int agfl_size = XFS_AGFL_SIZE(mp); > > + > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > + return true; > > + > > + if (c && l >= f) > > + active = l - f + 1; > > + else if (c) > > + active = agfl_size - f + l + 1; > > else > active = 0; > > To move all the initialisation of active into the one logic > statement and not make it dependent on the original value in the > AGF? > I think that should work. > > + if (active != c) > > + return false; > > + if (f >= agfl_size || l >= agfl_size) > > + return false; > > Shouldn't we be range checking these first? Also should probably > checking c the same way. > I'll move the l/f checks to before the active calculation. We can also add a check for c > agfl_size but note that technically these are superfluous checks because they are checked in the read verifier. The f/l range checks were there to cover the case of a packed -> unpacked conversion. That would also require a downstream read verifier tweak to handle a +1 sized valid agfl range. The verifier tweak is not included here because going backwards as such is not something we want to support upstream. IOW, the read verifier dictates a subset of general agfl corruption that this fixup is allowed to handle. All that said, the extra checks don't hurt and it's probably smart to consistently check the fields we're going to feed into calculations. I'll add a count check with a comment for added context. > > + > > + return true; > > +} > > + ... > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > index 945de08af7ba..678d602dc400 100644 > > --- a/fs/xfs/xfs_trace.h > > +++ b/fs/xfs/xfs_trace.h > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > __entry->logflags) > > ); > > > > +TRACE_EVENT(xfs_agfl_reset, > > + TP_PROTO(struct xfs_perag *pag), > > + TP_ARGS(pag), > > + TP_STRUCT__entry( > > + __field(dev_t, dev) > > + __field(xfs_agnumber_t, agno) > > + __field(int, flcount) > > + ), > > + TP_fast_assign( > > + __entry->dev = pag->pag_mount->m_super->s_dev; > > + __entry->agno = pag->pag_agno; > > + __entry->flcount = pag->pagf_flcount; > > + ), > > + TP_printk("dev %d:%d agno %u flcount %d", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + __entry->agno, __entry->flcount) > > +); > > Do we need a new tracepoint? Wouldn't it be better to just > call trace_xfs_agf() which will dump all the information in the AGF > just before we do the reset? We'll know it's a reset from the caller > information that is dumped by that tracepoint.... > Yeah, I missed that tracepoint. I may see if I can turn that into an event class so we can still use a unique name. trace_xfs_agf() looks like it refers to logging the agf which is already triggered by this path after we reset the agfl pointers. Otherwise I'll just add a pre-reset call so we have the original field values. Thanks for the feedback.. Brian > > + > > #endif /* _TRACE_XFS_H */ > > > > #undef TRACE_INCLUDE_PATH > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-12 17:35 ` Brian Foster 2018-03-12 21:14 ` Dave Chinner @ 2018-03-14 3:07 ` Dave Chiluk 2018-03-14 11:02 ` Brian Foster 1 sibling, 1 reply; 19+ messages in thread From: Dave Chiluk @ 2018-03-14 3:07 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, Eric Sandeen On Mon, Mar 12, 2018 at 12:35 PM, Brian Foster <bfoster@redhat.com> wrote: > > On Mon, Mar 12, 2018 at 09:11:58AM -0400, Brian Foster wrote: > Here's a variant of that patch that does a reset. It's definitely much > simpler. Thoughts? > > Brian > > --- 8< --- > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index c02781a4c091..7d313bb4677d 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -2053,6 +2053,59 @@ xfs_alloc_space_available( > return true; > } > > +static bool > +xfs_agf_verify_flcount( > + struct xfs_mount *mp, > + struct xfs_agf *agf) > +{ > + int f = be32_to_cpu(agf->agf_flfirst); > + int l = be32_to_cpu(agf->agf_fllast); > + int c = be32_to_cpu(agf->agf_flcount); > + int active = c; > + int agfl_size = XFS_AGFL_SIZE(mp); > + > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > + return true; > + > + if (c && l >= f) > + active = l - f + 1; > + else if (c) > + active = agfl_size - f + l + 1; > + > + if (active != c) > + return false; > + if (f >= agfl_size || l >= agfl_size) > + return false; > + > + return true; > +} > + > +static void > +xfs_agfl_reset( > + struct xfs_trans *tp, > + struct xfs_buf *agbp, > + struct xfs_perag *pag) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > + > + if (!pag->pagf_needreset) > + return; > + > + trace_xfs_agfl_reset(pag); > + xfs_warn(mp, "agfl reset agno %u flcount %d", pag->pag_agno, > + pag->pagf_flcount); > + Before completely leaking the entirety of the agfl couldn't we nicely release and recover all blocks but the 119th first? That way we'd only be leaking the possibly problematic 119th item? I understand we would lose the benefit of being able to recover from otherwise corrupt AGFLs. If we are going to blindly leak blocks wouldn't an xfs_repair recover these leaked blocks? I think it would be perfectly fine to leak these blocks if it means not crashing and then recover them at one's convenience with an xfs_repair. > + agf->agf_flfirst = 0; > + agf->agf_fllast = cpu_to_be32(XFS_AGFL_SIZE(mp) - 1); > + agf->agf_flcount = 0; Also I was under the impression that we should pre-allocate blocks in the agfl for fast allocation of free b+tree nodes. Wouldn't we want to pre-allocate some blocks as would be done by xfs_repair (I have a feeling someone is going to tell me where this happens elsewhere in the codebase or can be handled at block run time with little ill effect)? If I'm correct in either case I'd appreciate a Reviewed by: Dave Chiluk <chiluk+linuxxfs@indeed.com> Thanks, Dave > + xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLLAST | > + XFS_AGF_FLCOUNT); > + > + pag->pagf_flcount = 0; > + pag->pagf_needreset = false; > +} > + > /* ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-14 3:07 ` Dave Chiluk @ 2018-03-14 11:02 ` Brian Foster 2018-03-14 15:28 ` Darrick J. Wong 0 siblings, 1 reply; 19+ messages in thread From: Brian Foster @ 2018-03-14 11:02 UTC (permalink / raw) To: Dave Chiluk; +Cc: Darrick J. Wong, linux-xfs, Eric Sandeen On Tue, Mar 13, 2018 at 10:07:27PM -0500, Dave Chiluk wrote: > On Mon, Mar 12, 2018 at 12:35 PM, Brian Foster <bfoster@redhat.com> wrote: > > > > On Mon, Mar 12, 2018 at 09:11:58AM -0400, Brian Foster wrote: > > Here's a variant of that patch that does a reset. It's definitely much > > simpler. Thoughts? > > > > Brian > > > > --- 8< --- > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > index c02781a4c091..7d313bb4677d 100644 > > --- a/fs/xfs/libxfs/xfs_alloc.c > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > @@ -2053,6 +2053,59 @@ xfs_alloc_space_available( > > return true; > > } > > > > +static bool > > +xfs_agf_verify_flcount( > > + struct xfs_mount *mp, > > + struct xfs_agf *agf) > > +{ > > + int f = be32_to_cpu(agf->agf_flfirst); > > + int l = be32_to_cpu(agf->agf_fllast); > > + int c = be32_to_cpu(agf->agf_flcount); > > + int active = c; > > + int agfl_size = XFS_AGFL_SIZE(mp); > > + > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > + return true; > > + > > + if (c && l >= f) > > + active = l - f + 1; > > + else if (c) > > + active = agfl_size - f + l + 1; > > + > > + if (active != c) > > + return false; > > + if (f >= agfl_size || l >= agfl_size) > > + return false; > > + > > + return true; > > +} > > + > > +static void > > +xfs_agfl_reset( > > + struct xfs_trans *tp, > > + struct xfs_buf *agbp, > > + struct xfs_perag *pag) > > +{ > > + struct xfs_mount *mp = tp->t_mountp; > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > + > > + if (!pag->pagf_needreset) > > + return; > > + > > + trace_xfs_agfl_reset(pag); > > + xfs_warn(mp, "agfl reset agno %u flcount %d", pag->pag_agno, > > + pag->pagf_flcount); > > + > > Before completely leaking the entirety of the agfl couldn't we nicely > release and recover all blocks but the 119th first? That way we'd > only be leaking the possibly problematic 119th item? I understand we > would lose the benefit of being able to recover from otherwise corrupt > AGFLs. > This is mostly covered in the discussions over the previously explored methods. The synopsis is that yes, we could try to do something like that, but the point of this approach is that we don't have to trust the agfl content at all. This simplifies and genericizes the logic because every kernel already knows how to populate a sane agfl from an empty one. > If we are going to blindly leak blocks wouldn't an xfs_repair recover > these leaked blocks? I think it would be perfectly fine to leak these > blocks if it means not crashing and then recover them at one's > convenience with an xfs_repair. > Yes, an xfs_repair is necessary. An xfs_repair was already necessary to fix the padding mismatch or whatever else might have been wrong. This changes the side effect of the problem from a crash into a free space accounting inconsistency. > > + agf->agf_flfirst = 0; > > + agf->agf_fllast = cpu_to_be32(XFS_AGFL_SIZE(mp) - 1); > > + agf->agf_flcount = 0; > > Also I was under the impression that we should pre-allocate blocks in > the agfl for fast allocation of free b+tree nodes. Wouldn't we want > to pre-allocate some blocks as would be done by xfs_repair (I have a > feeling someone is going to tell me where this happens elsewhere in > the codebase or can be handled at block run time with little ill > effect)? > The function that calls the reset (xfs_alloc_fix_freelist()) will repopulate it once it sees that it is empty. > If I'm correct in either case I'd appreciate a > Reviewed by: Dave Chiluk <chiluk+linuxxfs@indeed.com> > I'm going to defer this until posting a legitimate patch because it has changed a bit (though not fundamentally). This post was more of a first pass to sanity check the idea. I'd appreciate another look once a legitimate v1 is posted.. thanks! Brian > Thanks, > Dave > > > > + xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLLAST | > > + XFS_AGF_FLCOUNT); > > + > > + pag->pagf_flcount = 0; > > + pag->pagf_needreset = false; > > +} > > + > > /* > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-14 11:02 ` Brian Foster @ 2018-03-14 15:28 ` Darrick J. Wong 0 siblings, 0 replies; 19+ messages in thread From: Darrick J. Wong @ 2018-03-14 15:28 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chiluk, linux-xfs, Eric Sandeen On Wed, Mar 14, 2018 at 07:02:33AM -0400, Brian Foster wrote: > On Tue, Mar 13, 2018 at 10:07:27PM -0500, Dave Chiluk wrote: > > On Mon, Mar 12, 2018 at 12:35 PM, Brian Foster <bfoster@redhat.com> wrote: > > > > > > On Mon, Mar 12, 2018 at 09:11:58AM -0400, Brian Foster wrote: > > > Here's a variant of that patch that does a reset. It's definitely much > > > simpler. Thoughts? > > > > > > Brian > > > > > > --- 8< --- > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > index c02781a4c091..7d313bb4677d 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > @@ -2053,6 +2053,59 @@ xfs_alloc_space_available( > > > return true; > > > } > > > > > > +static bool > > > +xfs_agf_verify_flcount( > > > + struct xfs_mount *mp, > > > + struct xfs_agf *agf) > > > +{ > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > + int l = be32_to_cpu(agf->agf_fllast); > > > + int c = be32_to_cpu(agf->agf_flcount); > > > + int active = c; > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > + > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > + return true; > > > + > > > + if (c && l >= f) > > > + active = l - f + 1; > > > + else if (c) > > > + active = agfl_size - f + l + 1; > > > + > > > + if (active != c) > > > + return false; > > > + if (f >= agfl_size || l >= agfl_size) > > > + return false; > > > + > > > + return true; > > > +} > > > + > > > +static void > > > +xfs_agfl_reset( > > > + struct xfs_trans *tp, > > > + struct xfs_buf *agbp, > > > + struct xfs_perag *pag) > > > +{ > > > + struct xfs_mount *mp = tp->t_mountp; > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > + > > > + if (!pag->pagf_needreset) > > > + return; > > > + > > > + trace_xfs_agfl_reset(pag); > > > + xfs_warn(mp, "agfl reset agno %u flcount %d", pag->pag_agno, > > > + pag->pagf_flcount); > > > + > > > > Before completely leaking the entirety of the agfl couldn't we nicely > > release and recover all blocks but the 119th first? That way we'd > > only be leaking the possibly problematic 119th item? I understand we > > would lose the benefit of being able to recover from otherwise corrupt > > AGFLs. > > > > This is mostly covered in the discussions over the previously explored > methods. The synopsis is that yes, we could try to do something like > that, but the point of this approach is that we don't have to trust the > agfl content at all. This simplifies and genericizes the logic because > every kernel already knows how to populate a sane agfl from an empty > one. > > > If we are going to blindly leak blocks wouldn't an xfs_repair recover > > these leaked blocks? I think it would be perfectly fine to leak these > > blocks if it means not crashing and then recover them at one's > > convenience with an xfs_repair. > > > > Yes, an xfs_repair is necessary. An xfs_repair was already necessary to > fix the padding mismatch or whatever else might have been wrong. This > changes the side effect of the problem from a crash into a free space > accounting inconsistency. > > > > + agf->agf_flfirst = 0; > > > + agf->agf_fllast = cpu_to_be32(XFS_AGFL_SIZE(mp) - 1); > > > + agf->agf_flcount = 0; > > > > Also I was under the impression that we should pre-allocate blocks in > > the agfl for fast allocation of free b+tree nodes. Wouldn't we want > > to pre-allocate some blocks as would be done by xfs_repair (I have a > > feeling someone is going to tell me where this happens elsewhere in > > the codebase or can be handled at block run time with little ill > > effect)? > > > > The function that calls the reset (xfs_alloc_fix_freelist()) will > repopulate it once it sees that it is empty. > > > If I'm correct in either case I'd appreciate a > > Reviewed by: Dave Chiluk <chiluk+linuxxfs@indeed.com> > > > > I'm going to defer this until posting a legitimate patch because it has > changed a bit (though not fundamentally). This post was more of a first > pass to sanity check the idea. I'd appreciate another look once a > legitimate v1 is posted.. thanks! While you're at it, I already applied the xfs_agfl_size decapitalization to for-next, so please do that here too. --D > Brian > > > Thanks, > > Dave > > > > > > > + xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLLAST | > > > + XFS_AGF_FLCOUNT); > > > + > > > + pag->pagf_flcount = 0; > > > + pag->pagf_needreset = false; > > > +} > > > + > > > /* > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 18:37 ` Brian Foster 2018-03-09 19:08 ` Darrick J. Wong @ 2018-03-09 22:10 ` Dave Chinner 2018-03-12 13:26 ` Brian Foster 1 sibling, 1 reply; 19+ messages in thread From: Dave Chinner @ 2018-03-09 22:10 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > > On Fri, Mar 09, 2018 at 08:16:28AM -0500, Brian Foster wrote: > > > On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > > > > Hi, > > > > > > > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > > > > Disliked-by: Brian Foster <bfoster@redhat.com> > > > > > --- > > > > > > > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > > > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > > > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > > > > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > > > > actual corruption for a padding mismatch as much as possible. E.g., > > > > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > > > > corruption that looks like an agfl with 120 entries on a packed kernel, > > > > > etc. > > > > > > > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > > > > that in either case it's impossible to completely eliminate the risk of > > > > > confusing corruption with a padding mismatch so long as we're doing a > > > > > manual agfl fixup. The more I think about that the more I really dislike > > > > > doing this. :( > > > > > > > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > > > > existence of 'xfs_repair -d' is a good enough last resort for those > > > > > users who might be bit by unexpected padding issues on a typical > > > > > upgrade. If so, we could fall back to a safer mount-time detection model > > > > > that enforces a read-only mount and let the user run repair. The > > > > > supposition is that those who aren't prepared to repair via a ramdisk or > > > > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > > > > read-only provided agfl padding is the only inconsistency. > > > > > > > > > > Eric points out that we can still write an unmount record for a > > > > > read-only mount, but I'm not sure that would be a problem if repair only > > > > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > > > > a recovery issue or it needs to be reformatted to update the LSN, both > > > > > of which seem to qualify as "you have more problems than agfl padding > > > > > and need to run repair anyways" to me. Thoughts? > > > > > > > > > > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > > > or at least me learning something new. > > > > > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > > > Merge it together in a 64bit field when reading it from disk, or split it when > > > > writing to? > > > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > > > bit systems, and avoid such risk of confusion when trying to discern between a > > > > corrupted agfl and a padding mismatch. > > > > > > > > > > I'm not following how you'd accomplish the latter..? We already have the > > > packed attribute in place, so the padding is fixed with that. This > > > effort has to do with trying to fix up an agfl written by an older > > > kernel without the padding fix. My understanding is that the xfs_agfl > > > header looks exactly the same on-disk in either case, the issue is a > > > broken size calculation that causes the older kernel to not see/use one > > > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > > > the same on-disk structure, it has no means to know whether the content > > > of the final slot is a valid block or a "gap" left by an older kernel > > > other than to check whether flcount matches the active count from > > > flfirst -> fllast (and that's where potential confusion over a padding > > > issue vs other corruption comes into play). > > > > Your understanding is correct. > > > > Sez me, who is watching the fsck.xfs -f discussion just in case that can > > be turned into a viable option quickly. > > > > Thanks. > > > ..and wondering what if we /did/ just implement Dave's suggestion from > > years ago where if the flcount doesn't match we just reset the agfl and > > call fix_freelist to refill it with fresh blocks... it would suck to > > leak blocks, though. Obviously, if the fs has rmap enabled then we can > > just rebuild it on the spot via xfs_repair_agfl() (see patch on mailing > > list). > > > > I wasn't initially too fond of this idea, but thinking about it a bit > more, there is definitely some value in terms of determinism. We know > we'll just leak some blocks vs. riskily swizzling around a corrupted > agfl. And, most importantly: it's trivial to backport to other kernels. The we simply don't have to worry where the filesystem has been mounted - if we detect a suspect situation for the running kernel, we just let go of the free list and rebuild it. > Given that, perhaps an on-demand reset with a "Corrupted AGFL, tossed > $flcount blocks, unmount and run xfs_repair if you ever want to see them > again" warning might not be so bad. It works the same whether the kernel > is packed, unpacked or the agfl is just borked, and therefore the test > matrix is much simpler as well. Hmmmm... Yup, exactly my thoughts. The more I read about the hoops we're considering jumping through to work around this problem, the more I like this solution.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 22:10 ` Dave Chinner @ 2018-03-12 13:26 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2018-03-12 13:26 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs On Sat, Mar 10, 2018 at 09:10:45AM +1100, Dave Chinner wrote: > On Fri, Mar 09, 2018 at 01:37:28PM -0500, Brian Foster wrote: > > On Fri, Mar 09, 2018 at 09:33:18AM -0800, Darrick J. Wong wrote: > > > On Fri, Mar 09, 2018 at 08:16:28AM -0500, Brian Foster wrote: > > > > On Thu, Mar 08, 2018 at 03:03:54PM +0100, Carlos Maiolino wrote: > > > > > Hi, > > > > > > > > > > On Wed, Mar 07, 2018 at 02:24:51PM -0500, Brian Foster wrote: > > > > > > Disliked-by: Brian Foster <bfoster@redhat.com> > > > > > > --- > > > > > > > > > > > > Sent as RFC for the time being. This tests Ok on a straight xfstests run > > > > > > and also seems to pass Darrick's agfl fixup tester (thanks) both on > > > > > > upstream and on a rhel7 kernel with some minor supporting hacks. > > > > > > > > > > > > I tried to tighten up the logic a bit to reduce the odds of mistaking > > > > > > actual corruption for a padding mismatch as much as possible. E.g., > > > > > > limit to cases where the agfl is wrapped, make sure we don't mistake a > > > > > > corruption that looks like an agfl with 120 entries on a packed kernel, > > > > > > etc. > > > > > > > > > > > > While I do prefer an on-demand fixup approach to a mount time scan, ISTM > > > > > > that in either case it's impossible to completely eliminate the risk of > > > > > > confusing corruption with a padding mismatch so long as we're doing a > > > > > > manual agfl fixup. The more I think about that the more I really dislike > > > > > > doing this. :( > > > > > > > > > > > > After some IRC discussion with djwong and sandeen, I'm wondering if the > > > > > > existence of 'xfs_repair -d' is a good enough last resort for those > > > > > > users who might be bit by unexpected padding issues on a typical > > > > > > upgrade. If so, we could fall back to a safer mount-time detection model > > > > > > that enforces a read-only mount and let the user run repair. The > > > > > > supposition is that those who aren't prepared to repair via a ramdisk or > > > > > > whatever should be able to 'xfs_repair -d' a rootfs that is mounted > > > > > > read-only provided agfl padding is the only inconsistency. > > > > > > > > > > > > Eric points out that we can still write an unmount record for a > > > > > > read-only mount, but I'm not sure that would be a problem if repair only > > > > > > needs to fix the agfl. xfs_repair shouldn't touch the log unless there's > > > > > > a recovery issue or it needs to be reformatted to update the LSN, both > > > > > > of which seem to qualify as "you have more problems than agfl padding > > > > > > and need to run repair anyways" to me. Thoughts? > > > > > > > > > > > > > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > > > > or at least me learning something new. > > > > > > > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > > > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > > > > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > > > > Merge it together in a 64bit field when reading it from disk, or split it when > > > > > writing to? > > > > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > > > > bit systems, and avoid such risk of confusion when trying to discern between a > > > > > corrupted agfl and a padding mismatch. > > > > > > > > > > > > > I'm not following how you'd accomplish the latter..? We already have the > > > > packed attribute in place, so the padding is fixed with that. This > > > > effort has to do with trying to fix up an agfl written by an older > > > > kernel without the padding fix. My understanding is that the xfs_agfl > > > > header looks exactly the same on-disk in either case, the issue is a > > > > broken size calculation that causes the older kernel to not see/use one > > > > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > > > > the same on-disk structure, it has no means to know whether the content > > > > of the final slot is a valid block or a "gap" left by an older kernel > > > > other than to check whether flcount matches the active count from > > > > flfirst -> fllast (and that's where potential confusion over a padding > > > > issue vs other corruption comes into play). > > > > > > Your understanding is correct. > > > > > > Sez me, who is watching the fsck.xfs -f discussion just in case that can > > > be turned into a viable option quickly. > > > > > > > Thanks. > > > > > ..and wondering what if we /did/ just implement Dave's suggestion from > > > years ago where if the flcount doesn't match we just reset the agfl and > > > call fix_freelist to refill it with fresh blocks... it would suck to > > > leak blocks, though. Obviously, if the fs has rmap enabled then we can > > > just rebuild it on the spot via xfs_repair_agfl() (see patch on mailing > > > list). > > > > > > > I wasn't initially too fond of this idea, but thinking about it a bit > > more, there is definitely some value in terms of determinism. We know > > we'll just leak some blocks vs. riskily swizzling around a corrupted > > agfl. > > And, most importantly: it's trivial to backport to other kernels. > The we simply don't have to worry where the filesystem has been > mounted - if we detect a suspect situation for the running kernel, > we just let go of the free list and rebuild it. > Yeah.. an "agfl is corrupted" condition is much more simple to reason about in the various situations and the fixup is generic. > > Given that, perhaps an on-demand reset with a "Corrupted AGFL, tossed > > $flcount blocks, unmount and run xfs_repair if you ever want to see them > > again" warning might not be so bad. It works the same whether the kernel > > is packed, unpacked or the agfl is just borked, and therefore the test > > matrix is much simpler as well. Hmmmm... > > Yup, exactly my thoughts. The more I read about the hoops we're > considering jumping through to work around this problem, the more I > like this solution.... > I think I agree. We're essentially broadening the scope of the fixup mechanism to cover runtime abandonment of a corrupted agfl. The padding issue is just a particular variant of corruption that we happened to cause ourselves and thus motivated a runtime fixup. In all such cases, a corrupted agfl leads to fs corruption, an eventual crash or leaking blocks, so throwing away some number of blocks intentionally isn't much worse in the best case (and is distinctly better in the common case). Thanks for the idea and feedback. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand 2018-03-09 13:16 ` Brian Foster 2018-03-09 17:33 ` Darrick J. Wong @ 2018-03-12 13:29 ` Carlos Maiolino 1 sibling, 0 replies; 19+ messages in thread From: Carlos Maiolino @ 2018-03-12 13:29 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs > > > > Sorry if this may sound stupid, but in the possibility this can help the issue, > > or at least me learning something new. > > > > ISTM this issue is all related to the way xfs_agfl packing. I read the commit > > log where packed attribute was added to xfs_agfl, and I was wondering... > > > > What are the implications of breaking up the lsn field in xfs_agfl, in 2 __be32? > > Merge it together in a 64bit field when reading it from disk, or split it when > > writing to? > > It seems to me this would avoid the size difference we are seeing now in 32/64 > > bit systems, and avoid such risk of confusion when trying to discern between a > > corrupted agfl and a padding mismatch. > > > > I'm not following how you'd accomplish the latter..? We already have the > packed attribute in place, so the padding is fixed with that. This > effort has to do with trying to fix up an agfl written by an older > kernel without the padding fix. My understanding is that the xfs_agfl > header looks exactly the same on-disk in either case, the issue is a > broken size calculation that causes the older kernel to not see/use one > last slot in the agfl. If the agfl has wrapped and a newer kernel loads > the same on-disk structure, it has no means to know whether the content > of the final slot is a valid block or a "gap" left by an older kernel > other than to check whether flcount matches the active count from > flfirst -> fllast (and that's where potential confusion over a padding > issue vs other corruption comes into play). > Nevermind, I missed some context and was misinterpreting the overall issue. Sorry the noise. > Brian > > > But still, just as reinforcement, this is just a guess, and I have no idea if > > this is feasible or not, but in case I'm completely nuts, I'll learn something > > new :) > > > > Cheers. > > > > > Brian > > > > > > fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++- > > > fs/xfs/xfs_mount.h | 1 + > > > fs/xfs/xfs_trace.h | 18 ++++++ > > > 3 files changed, 164 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > index c02781a4c091..31330996e31c 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > @@ -2054,6 +2054,136 @@ xfs_alloc_space_available( > > > } > > > > > > /* > > > + * Estimate the on-disk agfl size based on the agf state. A size mismatch due to > > > + * padding is only significant if the agfl wraps around the end or refers to an > > > + * invalid first/last value. > > > + */ > > > +static int > > > +xfs_agfl_ondisk_size( > > > + struct xfs_mount *mp, > > > + int first, > > > + int last, > > > + int count) > > > +{ > > > + int active = count; > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > + bool wrapped = (first > last) ? true : false; > > > + > > > + if (count && last >= first) > > > + active = last - first + 1; > > > + else if (count) > > > + active = agfl_size - first + last + 1; > > > + > > > + if (wrapped && active == count + 1) > > > + agfl_size--; > > > + else if ((wrapped && active == count - 1) || > > > + first == agfl_size || last == agfl_size) > > > + agfl_size++; > > > + > > > + /* > > > + * We can't discern the packing problem from certain forms of corruption > > > + * that may look exactly the same. To minimize the chance of mistaking > > > + * corruption for a size mismatch, clamp the size to known valid values. > > > + * A packed header agfl has 119 entries and the older unpacked format > > > + * has one less. > > > + */ > > > + if (agfl_size < 118 || agfl_size > 119) > > > + agfl_size = XFS_AGFL_SIZE(mp); > > > + > > > + return agfl_size; > > > +} > > > + > > > +static bool > > > +xfs_agfl_need_padfix( > > > + struct xfs_mount *mp, > > > + struct xfs_agf *agf) > > > +{ > > > + int f = be32_to_cpu(agf->agf_flfirst); > > > + int l = be32_to_cpu(agf->agf_fllast); > > > + int c = be32_to_cpu(agf->agf_flcount); > > > + > > > + if (!xfs_sb_version_hascrc(&mp->m_sb)) > > > + return false; > > > + > > > + return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp); > > > +} > > > + > > > +static int > > > +xfs_agfl_check_padfix( > > > + struct xfs_trans *tp, > > > + struct xfs_buf *agbp, > > > + struct xfs_buf *agflbp, > > > + struct xfs_perag *pag) > > > +{ > > > + struct xfs_mount *mp = tp->t_mountp; > > > + struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > > > + __be32 *agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > > + int agfl_size = XFS_AGFL_SIZE(mp); > > > + int ofirst, olast, osize; > > > + int nfirst, nlast; > > > + int logflags = 0; > > > + int startoff = 0; > > > + > > > + if (!pag->pagf_needpadfix) > > > + return 0; > > > + > > > + ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > > > + olast = nlast = be32_to_cpu(agf->agf_fllast); > > > + osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > > > + > > > + /* > > > + * If the on-disk agfl is smaller than what the kernel expects, the > > > + * last slot of the on-disk agfl is a gap with bogus data. Move the > > > + * first valid block into the gap and bump the pointer. > > > + */ > > > + if (osize < agfl_size) { > > > + ASSERT(pag->pagf_flcount != 0); > > > + agfl_bno[agfl_size - 1] = agfl_bno[ofirst]; > > > + startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr; > > > + nfirst++; > > > + goto done; > > > + } > > > + > > > + /* > > > + * Otherwise, the on-disk agfl is larger than what the current kernel > > > + * expects. If empty, just fix up the first and last pointers. If not, > > > + * move the inaccessible block to the end of the valid range. > > > + */ > > > + nfirst = do_mod(nfirst, agfl_size); > > > + if (pag->pagf_flcount == 0) { > > > + nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > > > + goto done; > > > + } > > > + if (nlast != agfl_size) > > > + nlast++; > > > + nlast = do_mod(nlast, agfl_size); > > > + agfl_bno[nlast] = agfl_bno[osize - 1]; > > > + startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr; > > > + > > > +done: > > > + if (nfirst != ofirst) { > > > + agf->agf_flfirst = cpu_to_be32(nfirst); > > > + logflags |= XFS_AGF_FLFIRST; > > > + } > > > + if (nlast != olast) { > > > + agf->agf_fllast = cpu_to_be32(nlast); > > > + logflags |= XFS_AGF_FLLAST; > > > + } > > > + if (startoff) { > > > + xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > > > + xfs_trans_log_buf(tp, agflbp, startoff, > > > + startoff + sizeof(xfs_agblock_t) - 1); > > > + } > > > + if (logflags) > > > + xfs_alloc_log_agf(tp, agbp, logflags); > > > + > > > + trace_xfs_agfl_padfix(mp, osize, agfl_size); > > > + pag->pagf_needpadfix = false; > > > + > > > + return 0; > > > +} > > > + > > > +/* > > > * Decide whether to use this allocation group for this allocation. > > > * If so, fix up the btree freelist's size. > > > */ > > > @@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist( > > > if (error) > > > return error; > > > > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > + if (error) { > > > + xfs_perag_put(pag); > > > + return error; > > > + } > > > > > > /* > > > * Get the block number and update the data structures. > > > @@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist( > > > if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp)) > > > agf->agf_flfirst = 0; > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > be32_add_cpu(&agf->agf_flcount, -1); > > > xfs_trans_agflist_delta(tp, -1); > > > pag->pagf_flcount--; > > > @@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist( > > > if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp, > > > be32_to_cpu(agf->agf_seqno), &agflbp))) > > > return error; > > > + > > > + pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > + error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag); > > > + if (error) { > > > + xfs_perag_put(pag); > > > + return error; > > > + } > > > + > > > be32_add_cpu(&agf->agf_fllast, 1); > > > if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp)) > > > agf->agf_fllast = 0; > > > > > > - pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno)); > > > be32_add_cpu(&agf->agf_flcount, 1); > > > xfs_trans_agflist_delta(tp, 1); > > > pag->pagf_flcount++; > > > @@ -2588,6 +2730,7 @@ xfs_alloc_read_agf( > > > pag->pagb_count = 0; > > > pag->pagb_tree = RB_ROOT; > > > pag->pagf_init = 1; > > > + pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf); > > > } > > > #ifdef DEBUG > > > else if (!XFS_FORCED_SHUTDOWN(mp)) { > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > index e0792d036be2..78a6377a9b38 100644 > > > --- a/fs/xfs/xfs_mount.h > > > +++ b/fs/xfs/xfs_mount.h > > > @@ -353,6 +353,7 @@ typedef struct xfs_perag { > > > char pagi_inodeok; /* The agi is ok for inodes */ > > > uint8_t pagf_levels[XFS_BTNUM_AGF]; > > > /* # of levels in bno & cnt btree */ > > > + bool pagf_needpadfix; > > > uint32_t pagf_flcount; /* count of blocks in freelist */ > > > xfs_extlen_t pagf_freeblks; /* total free blocks */ > > > xfs_extlen_t pagf_longest; /* longest free space */ > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > index 945de08af7ba..c7a3bcd6cc4a 100644 > > > --- a/fs/xfs/xfs_trace.h > > > +++ b/fs/xfs/xfs_trace.h > > > @@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc, > > > __entry->logflags) > > > ); > > > > > > +TRACE_EVENT(xfs_agfl_padfix, > > > + TP_PROTO(struct xfs_mount *mp, int osize, int nsize), > > > + TP_ARGS(mp, osize, nsize), > > > + TP_STRUCT__entry( > > > + __field(dev_t, dev) > > > + __field(int, osize) > > > + __field(int, nsize) > > > + ), > > > + TP_fast_assign( > > > + __entry->dev = mp->m_super->s_dev; > > > + __entry->osize = osize; > > > + __entry->nsize = nsize; > > > + ), > > > + TP_printk("dev %d:%d old size %d new size %d", > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > + __entry->osize, __entry->nsize) > > > +); > > > + > > > #endif /* _TRACE_XFS_H */ > > > > > > #undef TRACE_INCLUDE_PATH > > > -- > > > 2.13.6 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Carlos > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Carlos ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-03-14 15:28 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-03-07 19:24 [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand Brian Foster 2018-03-08 14:03 ` Carlos Maiolino 2018-03-08 14:15 ` Carlos Maiolino 2018-03-09 13:16 ` Brian Foster 2018-03-09 17:33 ` Darrick J. Wong 2018-03-09 18:37 ` Brian Foster 2018-03-09 19:08 ` Darrick J. Wong 2018-03-09 21:20 ` Brian Foster 2018-03-09 22:37 ` Darrick J. Wong 2018-03-12 13:11 ` Brian Foster 2018-03-12 17:35 ` Brian Foster 2018-03-12 21:14 ` Dave Chinner 2018-03-13 11:27 ` Brian Foster 2018-03-14 3:07 ` Dave Chiluk 2018-03-14 11:02 ` Brian Foster 2018-03-14 15:28 ` Darrick J. Wong 2018-03-09 22:10 ` Dave Chinner 2018-03-12 13:26 ` Brian Foster 2018-03-12 13:29 ` Carlos Maiolino
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.