All of lore.kernel.org
 help / color / mirror / Atom feed
* v5 filesystem corruption due to log recovery lsn ordering
@ 2015-08-19 18:39 Brian Foster
  2015-08-19 22:44 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2015-08-19 18:39 UTC (permalink / raw)
  To: xfs

Hi all,

Here's another issue I've run into from recent log recovery testing...

Many on-disk data structures for v5 filesystems have the LSN from the
last modification stamped the associated header. As of the following
commit, log recovery compares the recovery item LSN against the LSN of
the on-disk structure to avoid restoration of stale contents:

  50d5c8d xfs: check LSN ordering for v5 superblocks during recovery

This presumably addresses some problems where recovery of the stale
contents leads to CRC failure. The problem here is that xfs_repair
clears the log (even when the fs is clean) and resets the current LSN on
the next mount. This creates a situation where logging is ineffective
for any structure that has not yet been modified since the current LSN
was reset.

I'm not quite sure how pervasive this is in practice, but the following
is a corruption reproducer for directory buffers:

- mkfs (-m crc=1), mount and fsstress a filesystem for a bit such that
  the LSN increases a decent amount (e.g., several log cycles or so).

	# cat /sys/fs/xfs/dm-3/log/log_*
	3:9378
	3:9376

- Kill fsstress, create a new directory and populate with some files:

	# mkdir /mnt/dir
	# for i in $(seq 0 999); do touch /mnt/dir/$i; done

- Unmount the fs, run xfs_repair, mount the fs and verify the LSN has
  been reset:

	# cat /sys/fs/xfs/dm-3/log/log_*
	1:2
	1:2

- Remove a file from the previously created directory and immediately
  shutdown the fs, flushing the log:

	# rm -f /mnt/dir/0; ~/xfstests-dev/src/godown -f /mnt/
	# umount /mnt

- Remount the fs to replay the log. Unmount and repair once more:

	# mount <dev> /mnt; umount /mnt
	# xfs_repair -n <dev>
	...
	imap claims in-use inode 3082 is free, would correct imap
	...

... and the filesystem is inconsistent. This occurs because the log
recovery records are tagged with an LSN based on the reset value of
(1:2) and the buffers to be recovered that hadn't yet been rewritten
before the shutdown have an LSN from around the time the fsstress was
stopped. The target buffer is incorrectly seen as "newer" than the
recovery item, and thus recovery of this buffer is skipped.

Note that the resulting behavior is not always consistent. I have seen
log recovery ignore the file removal such that the fs is consistent and
the modification is simply lost. The original instance I hit on a
separate machine caused repair to complain about and fix the directory
rather than the imap, but that could have been a repair thing.

The larger question is how to resolve this problem? I don't think this
is something that is ultimately addressed in xfs_repair. Even if we
stopped clearing the log, that doesn't help users who might have had to
forcibly zero the log to recover a filesystem. Another option in theory
might be to unconditionally reset the LSN of everything on disk, but
that sounds like overkill just to preserve the current kernel
workaround.

It sounds more to me that we have to adjust this behavior on the kernel
side. That said, the original commit presumably addresses some log
recovery shutdown problems that we do not want to reintroduce. I haven't
yet wrapped my head around what that original problem was, but I wanted
to get this reported. If the issue was early buffer I/O submission,
perhaps we need a new mechanism to defer this I/O submission until a
point that CRC verification is expected to pass (or otherwise generate a
filesystem error)? Or perhaps do something similar with CRC
verification? Any other thoughts, issues or things I might have missed
here?

Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: v5 filesystem corruption due to log recovery lsn ordering
  2015-08-19 18:39 v5 filesystem corruption due to log recovery lsn ordering Brian Foster
@ 2015-08-19 22:44 ` Dave Chinner
  2015-08-20 16:25   ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2015-08-19 22:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Aug 19, 2015 at 02:39:05PM -0400, Brian Foster wrote:
> Hi all,
> 
> Here's another issue I've run into from recent log recovery testing...
> 
> Many on-disk data structures for v5 filesystems have the LSN from the
> last modification stamped the associated header. As of the following
> commit, log recovery compares the recovery item LSN against the LSN of
> the on-disk structure to avoid restoration of stale contents:
> 
>   50d5c8d xfs: check LSN ordering for v5 superblocks during recovery
> 
> This presumably addresses some problems where recovery of the stale
> contents leads to CRC failure. The problem here is that xfs_repair
> clears the log (even when the fs is clean) and resets the current LSN on
> the next mount. This creates a situation where logging is ineffective
> for any structure that has not yet been modified since the current LSN
> was reset.

Well, that was a bit of an oversight...

....

> 
> The larger question is how to resolve this problem? I don't think this
> is something that is ultimately addressed in xfs_repair. Even if we
> stopped clearing the log, that doesn't help users who might have had to
> forcibly zero the log to recover a filesystem. Another option in theory
> might be to unconditionally reset the LSN of everything on disk, but
> that sounds like overkill just to preserve the current kernel
> workaround.

Well, it's relatively easy to detect a log that has been zeroed if
the cycle count is more than a cycle or two lower than the LSN in
important metadata, but I'm not sure we can reliably detect that.

> It sounds more to me that we have to adjust this behavior on the kernel
> side. That said, the original commit presumably addresses some log
> recovery shutdown problems that we do not want to reintroduce. I haven't
> yet wrapped my head around what that original problem was, but I wanted
> to get this reported. If the issue was early buffer I/O submission,
> perhaps we need a new mechanism to defer this I/O submission until a
> point that CRC verification is expected to pass (or otherwise generate a
> filesystem error)? Or perhaps do something similar with CRC
> verification? Any other thoughts, issues or things I might have missed
> here?

The issue that the LSN ordering fixes is that of unsynchronised
recovery of different log records that contain the same objects.
e.g. ordering of inode chunk allocation (in buffers) vs inode object
modification (in inode items). v4 filesystems have a serious problem
where inode chunk allocation can be run after the inode item
modifications, resulting in recovery "losing" file size updates that
sync flushed to the log.

i.e. create just the right number of small files, sync, crash and
recovery gives a number of zero length files in certain inode chunks
because the ordering of item recovery was wrong.

Another problem with inode logging is the flushiter field, which is
used to try to avoid replaying changes in the log that have already
been flushed to disk. This could also lead to lost inode
modifications after a sync because the flushiter is reset to zero
after each time the inode item is recovered. This was mostly avoided
by logging all inode modifications and using delayed logging, but
could still occur...

There was a long history of these sorts of problems occurring (I
first analysed the inode allocation/inode item update failure mode
back in 2006), and I found several other possible issues like this
to do with the inode flushiter at the same time. I also suspected
that there were problems with directory recovery due to the same
inode item vs buffer item ordering issues, but could never pin them
down.

So the solution was to record the LSN of the last modification in
every item as it is written to disk, thereby ensuring we knew
exactly what transaction the item was last modified in. This means
we can skip modifications in transaction recovery that are already
on disk.

----

The first thing we need to do is not zero the log in xfs_repair when
the log is clean to minimise future exposure to this issue on
existing systems.

Then, on the kernel side, we need is a reliable way to detect that
the log head/tail pointers have been reset in the kernel. This means
we can - at minimum - issue a warning during log recovery that this
has been detected.

Finally, we need to work out how to handle recovery in the situation
that the log has been zeroed and the filesystem has a mix of new and
old, stale LSNs. I think the simplest way around this is not to
handle it in log recovery at all, but to avoid it altogether.

That is, when we find the log head/tail point to a zeroed log, we
pull the current LSN from, say, the superblock (and maybe other
metadata such as AG headers) and initialise the log head/tail to the
cycle number + some offset so that every new transaction is
guaranteed to have a cycle number more recent than any other LSN in
the filesystem and ordering is preserved, even if the log has been
zeroed.

This means dirty log recovery requires no changes at all, we only
need to change xlog_recover() to detect the empty, clean log and
set:

	l_curr_cycle
	l_curr_block
	l_last_sync_lsn
	l_tail_lsn
	reserve_head
	write_head

appropriately for the new cycle number we've given the log. This is
pretty much how it is already done in xlog_find_tail() with the
initialisation information coming from the log record found at the
head of the log - we're just making it up a different source. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: v5 filesystem corruption due to log recovery lsn ordering
  2015-08-19 22:44 ` Dave Chinner
@ 2015-08-20 16:25   ` Brian Foster
  2015-08-21  0:34     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2015-08-20 16:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Aug 20, 2015 at 08:44:53AM +1000, Dave Chinner wrote:
> On Wed, Aug 19, 2015 at 02:39:05PM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > Here's another issue I've run into from recent log recovery testing...
> > 
> > Many on-disk data structures for v5 filesystems have the LSN from the
> > last modification stamped the associated header. As of the following
> > commit, log recovery compares the recovery item LSN against the LSN of
> > the on-disk structure to avoid restoration of stale contents:
> > 
> >   50d5c8d xfs: check LSN ordering for v5 superblocks during recovery
> > 
> > This presumably addresses some problems where recovery of the stale
> > contents leads to CRC failure. The problem here is that xfs_repair
> > clears the log (even when the fs is clean) and resets the current LSN on
> > the next mount. This creates a situation where logging is ineffective
> > for any structure that has not yet been modified since the current LSN
> > was reset.
> 
> Well, that was a bit of an oversight...
> 
> ....
> 
> > 
> > The larger question is how to resolve this problem? I don't think this
> > is something that is ultimately addressed in xfs_repair. Even if we
> > stopped clearing the log, that doesn't help users who might have had to
> > forcibly zero the log to recover a filesystem. Another option in theory
> > might be to unconditionally reset the LSN of everything on disk, but
> > that sounds like overkill just to preserve the current kernel
> > workaround.
> 
> Well, it's relatively easy to detect a log that has been zeroed if
> the cycle count is more than a cycle or two lower than the LSN in
> important metadata, but I'm not sure we can reliably detect that.
> 
> > It sounds more to me that we have to adjust this behavior on the kernel
> > side. That said, the original commit presumably addresses some log
> > recovery shutdown problems that we do not want to reintroduce. I haven't
> > yet wrapped my head around what that original problem was, but I wanted
> > to get this reported. If the issue was early buffer I/O submission,
> > perhaps we need a new mechanism to defer this I/O submission until a
> > point that CRC verification is expected to pass (or otherwise generate a
> > filesystem error)? Or perhaps do something similar with CRC
> > verification? Any other thoughts, issues or things I might have missed
> > here?
> 
> The issue that the LSN ordering fixes is that of unsynchronised
> recovery of different log records that contain the same objects.
> e.g. ordering of inode chunk allocation (in buffers) vs inode object
> modification (in inode items). v4 filesystems have a serious problem
> where inode chunk allocation can be run after the inode item
> modifications, resulting in recovery "losing" file size updates that
> sync flushed to the log.
> 

Hmm, so I would have expected these kind of operations to generally
occur in order. I'm clearly still missing some context on the overall
log item lifecycle to understand how this might occur.

Using the inode chunk allocation and inode modification example...
clearly the transactions have to commit in order because the inodes must
be allocated before they can be used/modified. At what point after that
is reordering possible? Are we talking about reordering of the items
from the cil/log buffers to the on-disk log, or reordering of the items
during recovery (or both)?

> i.e. create just the right number of small files, sync, crash and
> recovery gives a number of zero length files in certain inode chunks
> because the ordering of item recovery was wrong.
> 
> Another problem with inode logging is the flushiter field, which is
> used to try to avoid replaying changes in the log that have already
> been flushed to disk. This could also lead to lost inode
> modifications after a sync because the flushiter is reset to zero
> after each time the inode item is recovered. This was mostly avoided
> by logging all inode modifications and using delayed logging, but
> could still occur...
> 
> There was a long history of these sorts of problems occurring (I
> first analysed the inode allocation/inode item update failure mode
> back in 2006), and I found several other possible issues like this
> to do with the inode flushiter at the same time. I also suspected
> that there were problems with directory recovery due to the same
> inode item vs buffer item ordering issues, but could never pin them
> down.
> 
> So the solution was to record the LSN of the last modification in
> every item as it is written to disk, thereby ensuring we knew
> exactly what transaction the item was last modified in. This means
> we can skip modifications in transaction recovery that are already
> on disk.
> 

Given the reordering is possible (despite my lingering questions wrt to
exactly how, above) this makes sense as a mechanism to address that
problem.

> ----
> 
> The first thing we need to do is not zero the log in xfs_repair when
> the log is clean to minimise future exposure to this issue on
> existing systems.
> 

Eh, I'm not really sure this helps us that much. We still have to deal
with the issue so long as current xfsprogs versions are out there. We
also have no real way of knowing whether a filesystem could have been
affected by the problem or not. FWIW, xfs_metadump also results in
similar behavior, which means that technically it's possible for
different behavior on a restored metadump from the original fs. That's
less critical, but clearly not ideal and something to be aware of.

That said, I think it is somewhat strange for xfs_repair to zero the log
unconditionally, this issue aside. So I'm not really against that change
in general. I just think we need a kernel fix first and foremost.

> Then, on the kernel side, we need is a reliable way to detect that
> the log head/tail pointers have been reset in the kernel. This means
> we can - at minimum - issue a warning during log recovery that this
> has been detected.
> 
> Finally, we need to work out how to handle recovery in the situation
> that the log has been zeroed and the filesystem has a mix of new and
> old, stale LSNs. I think the simplest way around this is not to
> handle it in log recovery at all, but to avoid it altogether.
> 
> That is, when we find the log head/tail point to a zeroed log, we
> pull the current LSN from, say, the superblock (and maybe other
> metadata such as AG headers) and initialise the log head/tail to the
> cycle number + some offset so that every new transaction is
> guaranteed to have a cycle number more recent than any other LSN in
> the filesystem and ordering is preserved, even if the log has been
> zeroed.
> 

That's an interesting idea, but I wonder if it truly fixes the problem
or just makes it potentially more difficult to reproduce. One concern is
that even with this in place, all it takes to reintroduce the problem is
for a filesystem to run a bit on an older kernel in a situation where
the LSN is reset and enough modification has occurred to update whatever
key metadata we determine as important with the reset LSNs. A mount
alone is enough to pollute the superblock in this manner. Further
modification is probably necessary to affect the agi/agf headers,
however. It might not be the most likely scenario, but what is more
concerning is that if it does occur, it's completely invisible to our
detection on updated kernels. Would we want to consider a new
ro-incompat feature bit for this mechanism to prevent that?

Another concern is that we're assuming that the key metadata will always
have the most recent LSN. I think the closest thing to a guarantee we
have of that is the superblock being updated on umount and every so
often by xfs_log_worker() to cover the log. After a bit of playing
around, I'm not so sure that is enough. Not all workloads result in
superblock or AG header updates. Further, such an adverse workload that
runs in a constant manner seems to defer covering the log indefinitely
(not to mention that the frequency is also user-configurable via /proc).

Consider the following example along with some observations:

- mkfs, mount
- create a directory d1, populated with files
- create a directory d2, populated with files and hard links
- run a constant non-sb, non-ag header updating workload (e.g., repeated
  file truncates that do not involve allocation) on d1
- wait a bit...

The superblock and/or AG headers are updated to LSN X at some point here
due to the inode allocations for the directory creations and whatnot.
The constant truncate workload continuously pushes the log forward
without ever updating the LSN of any AGI/AGF or the superblock.

After some time passes, the current LSN pushes forward to some increased
value Y. At that point:

- unlink a hard link from directory d2
- wait a bit once more...

After some more time passes, the d2 directory blocks are written back
and the LSN of those blocks is updated to Y. Note that the superblock
and AGI/AGF headers are still at LSN X. From here:

- shutdown the fs, umount
- repair and force zero the log
- mount the fs

So now we have reset the LSN according to the log. Presumably the mount
now inspects the superblock and each AG header and inherits the largest
LSN plus some offset as the current LSN: LSN X+N. Directory d2 still has
LSN Y, however, and we have no guarantee that N > Y-X. In other words, I
think we can end up right back where we started. Make a modification to
directory d2 at this point, crash the fs, and recovery might or might
not replay a log record with LSN X+N against a target directory buffer
with LSN Y.

Again, that's a contrived and probably unlikely scenario, but could be
extremely difficult to detect or root cause if it ever does occur. :/
Thoughts? Is there something further we can do to mitigate this, or am I
missing something else?

Brian

P.S.,

After running through the above, I noticed that xfs_repair zeroes the
LSN of the agi/agf and associated btrees and whatnot. I also noticed
that lazy sb counters can avoid superblock updates for many allocation
operations (xfs_trans_mod_sb()). While this probably mitigates ag
corruption due to this issue, I suspect this means that the above
workload might actually be able to get away with some allocations
without inhibiting the ability reproduce.

Also, a random thought: I wonder if an update to the log zeroing
mechanism to ensure that a subsequent mount picked up the LSN where it
left off would be enough to get around much of this. That could mean
stamping the log appropriately in repair, or adding something like a new
continue_lsn field in the superblock to be used by anybody who zeroes
the log and picked up on the next mount, etc...

> This means dirty log recovery requires no changes at all, we only
> need to change xlog_recover() to detect the empty, clean log and
> set:
> 
> 	l_curr_cycle
> 	l_curr_block
> 	l_last_sync_lsn
> 	l_tail_lsn
> 	reserve_head
> 	write_head
> 
> appropriately for the new cycle number we've given the log. This is
> pretty much how it is already done in xlog_find_tail() with the
> initialisation information coming from the log record found at the
> head of the log - we're just making it up a different source. ;)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: v5 filesystem corruption due to log recovery lsn ordering
  2015-08-20 16:25   ` Brian Foster
@ 2015-08-21  0:34     ` Dave Chinner
  2015-08-21 14:39       ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2015-08-21  0:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Aug 20, 2015 at 12:25:29PM -0400, Brian Foster wrote:
> On Thu, Aug 20, 2015 at 08:44:53AM +1000, Dave Chinner wrote:
> > On Wed, Aug 19, 2015 at 02:39:05PM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > Here's another issue I've run into from recent log recovery testing...
> > > 
> > > Many on-disk data structures for v5 filesystems have the LSN from the
> > > last modification stamped the associated header. As of the following
> > > commit, log recovery compares the recovery item LSN against the LSN of
> > > the on-disk structure to avoid restoration of stale contents:
> > > 
> > >   50d5c8d xfs: check LSN ordering for v5 superblocks during recovery
> > > 
> > > This presumably addresses some problems where recovery of the stale
> > > contents leads to CRC failure. The problem here is that xfs_repair
> > > clears the log (even when the fs is clean) and resets the current LSN on
> > > the next mount. This creates a situation where logging is ineffective
> > > for any structure that has not yet been modified since the current LSN
> > > was reset.
> > 
> > Well, that was a bit of an oversight...
> > 
> > ....
> > 
> > > 
> > > The larger question is how to resolve this problem? I don't think this
> > > is something that is ultimately addressed in xfs_repair. Even if we
> > > stopped clearing the log, that doesn't help users who might have had to
stori> > > forcibly zero the log to recover a filesystem. Another option in theory
> > > might be to unconditionally reset the LSN of everything on disk, but
> > > that sounds like overkill just to preserve the current kernel
> > > workaround.
> > 
> > Well, it's relatively easy to detect a log that has been zeroed if
> > the cycle count is more than a cycle or two lower than the LSN in
> > important metadata, but I'm not sure we can reliably detect that.
> > 
> > > It sounds more to me that we have to adjust this behavior on the kernel
> > > side. That said, the original commit presumably addresses some log
> > > recovery shutdown problems that we do not want to reintroduce. I haven't
> > > yet wrapped my head around what that original problem was, but I wanted
> > > to get this reported. If the issue was early buffer I/O submission,
> > > perhaps we need a new mechanism to defer this I/O submission until a
> > > point that CRC verification is expected to pass (or otherwise generate a
> > > filesystem error)? Or perhaps do something similar with CRC
> > > verification? Any other thoughts, issues or things I might have missed
> > > here?
> > 
> > The issue that the LSN ordering fixes is that of unsynchronised
> > recovery of different log records that contain the same objects.
> > e.g. ordering of inode chunk allocation (in buffers) vs inode object
> > modification (in inode items). v4 filesystems have a serious problem
> > where inode chunk allocation can be run after the inode item
> > modifications, resulting in recovery "losing" file size updates that
> > sync flushed to the log.
> > 
> 
> Hmm, so I would have expected these kind of operations to generally
> occur in order. I'm clearly still missing some context on the overall
> log item lifecycle to understand how this might occur.

They do now for v5 filesystems - all the changes that led up to
where we are now got us to this point...

> Using the inode chunk allocation and inode modification example...
> clearly the transactions have to commit in order because the inodes must
> be allocated before they can be used/modified. At what point after that
> is reordering possible? Are we talking about reordering of the items
> from the cil/log buffers to the on-disk log, or reordering of the items
> during recovery (or both)?

Both. Look at inode allocation buffers, which are pinned in the AIL
so they can't be moved forward by subsequent modifications (e.g.
inode unlinks) until the newly initialised buffers have been written
to disk. Look at them again in log recovery, where they are
re-ordered to the start of a transaction so they are replayed before
inode log items and inode unlink buffer operations.

> > i.e. create just the right number of small files, sync, crash and
> > recovery gives a number of zero length files in certain inode chunks
> > because the ordering of item recovery was wrong.
> > 
> > Another problem with inode logging is the flushiter field, which is
> > used to try to avoid replaying changes in the log that have already
> > been flushed to disk. This could also lead to lost inode
> > modifications after a sync because the flushiter is reset to zero
> > after each time the inode item is recovered. This was mostly avoided
> > by logging all inode modifications and using delayed logging, but
> > could still occur...
> > 
> > There was a long history of these sorts of problems occurring (I
> > first analysed the inode allocation/inode item update failure mode
> > back in 2006), and I found several other possible issues like this
> > to do with the inode flushiter at the same time. I also suspected
> > that there were problems with directory recovery due to the same
> > inode item vs buffer item ordering issues, but could never pin them
> > down.
> > 
> > So the solution was to record the LSN of the last modification in
> > every item as it is written to disk, thereby ensuring we knew
> > exactly what transaction the item was last modified in. This means
> > we can skip modifications in transaction recovery that are already
> > on disk.
> > 
> 
> Given the reordering is possible (despite my lingering questions wrt to
> exactly how, above) this makes sense as a mechanism to address that
> problem.

Yup, that's what we have done in the past with hacks and algorithm
fixes. What the LSN stamping also gives us iis the freedom to move
beyond logging every change we make in full to the log. e.g. intent
based logging using ordered buffers that move through the AIL and
hence pin the log until they are written (e.g. swap extents BMBT
owner change).

> > ----
> > 
> > The first thing we need to do is not zero the log in xfs_repair when
> > the log is clean to minimise future exposure to this issue on
> > existing systems.
> > 
> 
> Eh, I'm not really sure this helps us that much. We still have to deal
> with the issue so long as current xfsprogs versions are out there. We
> also have no real way of knowing whether a filesystem could have been
> affected by the problem or not.

Yup, so let's take steps to ensure that the future xfsprogs don't
cause us problems, and then we can also tell users to upgrade
to the latest xfsprogs to solve the issue.

> FWIW, xfs_metadump also results in
> similar behavior, which means that technically it's possible for
> different behavior on a restored metadump from the original fs. That's
> less critical, but clearly not ideal and something to be aware of.

So we fix metadump, too.

> That said, I think it is somewhat strange for xfs_repair to zero the log
> unconditionally, this issue aside. So I'm not really against that change
> in general. I just think we need a kernel fix first and foremost.

Most users won't/can't upgrade to a kernel that will solve the
problem - it's often much easier to get them a userspace package
that minimises their exposure to a problem caused by their current
userspace package....

> > Then, on the kernel side, we need is a reliable way to detect that
> > the log head/tail pointers have been reset in the kernel. This means
> > we can - at minimum - issue a warning during log recovery that this
> > has been detected.
> > 
> > Finally, we need to work out how to handle recovery in the situation
> > that the log has been zeroed and the filesystem has a mix of new and
> > old, stale LSNs. I think the simplest way around this is not to
> > handle it in log recovery at all, but to avoid it altogether.
> > 
> > That is, when we find the log head/tail point to a zeroed log, we
> > pull the current LSN from, say, the superblock (and maybe other
> > metadata such as AG headers) and initialise the log head/tail to the
> > cycle number + some offset so that every new transaction is
> > guaranteed to have a cycle number more recent than any other LSN in
> > the filesystem and ordering is preserved, even if the log has been
> > zeroed.
> > 
> 
> That's an interesting idea, but I wonder if it truly fixes the problem
> or just makes it potentially more difficult to reproduce. One concern is
> that even with this in place, all it takes to reintroduce the problem is
> for a filesystem to run a bit on an older kernel in a situation where
> the LSN is reset and enough modification has occurred to update whatever
> key metadata we determine as important with the reset LSNs.

We can't prevent that - that horse has already bolted.

> A mount
> alone is enough to pollute the superblock in this manner. Further
> modification is probably necessary to affect the agi/agf headers,
> however. It might not be the most likely scenario, but what is more
> concerning is that if it does occur, it's completely invisible to our
> detection on updated kernels. Would we want to consider a new
> ro-incompat feature bit for this mechanism to prevent that?

We could, but preventing people from downgrading from new kernels
because of a problem 99.9% of current users are never going to see?
And with a fixed and upgraded userspace, it will be problem that
they'll never see, either?

> Another concern is that we're assuming that the key metadata will always
> have the most recent LSN. I think the closest thing to a guarantee we
> have of that is the superblock being updated on umount and every so
> often by xfs_log_worker() to cover the log. After a bit of playing
> around, I'm not so sure that is enough.

Sure, I just threw it out as a way to get a more recent LSN. The
only way to reliably get the highest LSN is to walk all the metadata
in the filesystem. We can't do that at mount time, so perhaps it
is best to just refuse to mount read-write and issue a warning to
upgrade xfsprogs and re-run xfs_repair.

> So now we have reset the LSN according to the log. Presumably the mount
> now inspects the superblock and each AG header and inherits the largest
> LSN plus some offset as the current LSN: LSN X+N. Directory d2 still has
> LSN Y, however, and we have no guarantee that N > Y-X. In other words, I
> think we can end up right back where we started. Make a modification to
> directory d2 at this point, crash the fs, and recovery might or might
> not replay a log record with LSN X+N against a target directory buffer
> with LSN Y.

These scenarios all result from running xfs_repair. It's the source
of the problem, so let's treat it primarily as a bug in xfs_repair.
The number of users that run xfs_repair are very small compared to
the wider user-base, so having the kernel say "upgrade xfsprogs
and repair" is probably the most reliable way to fix the problem.

So, to repair: have it record LSNs as it walks the metadata and
at the start of phase 5 have it select a new cycle number higher
than the highest LSN found in the filesystem. Stamp the log with
that cycle number as well as all metadata that is being rebuilt
in phase 5, and all the superblocks.

Problem solved.

> Again, that's a contrived and probably unlikely scenario, but could be
> extremely difficult to detect or root cause if it ever does occur. :/
> Thoughts? Is there something further we can do to mitigate this, or am I
> missing something else?

IMO, you're over-complicating the situation. We can't go back and
change history, nor make people upgrade to systems that won't suffer
from this problem. Hence we're just going to have to live with it,
just like all the other log recovery bugs we've lived with since
finding and fixing them.

If we do the simple thing and just detect zeroed logs in the kernel
and refuse read-write mounts, that's something we can easily
backport to all stable kernels that we support CRCs on (i.e. 3.16 to
present). And with a fixed xfs_repair, the problem goes away as
users and distros pull new in the new package, even if they are
running old kernels.

> Also, a random thought: I wonder if an update to the log zeroing
> mechanism to ensure that a subsequent mount picked up the LSN where it
> left off would be enough to get around much of this. That could mean
> stamping the log appropriately in repair, or adding something like a new
> continue_lsn field in the superblock to be used by anybody who zeroes
> the log and picked up on the next mount, etc...

Sure - that's just a variant my first suggestion (of don't zero the
log in xfs_repair). Remember that you said "I don't think that helps
us very much". :)

And, FWIW, my "read the LSN from metadata and use the highest" idea
above is just a further refinement of the two, so we can fix
zeroed logs from existing xfs_repair binaries without needing to do
anything extra in the kernel other than detect a zeroed log...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: v5 filesystem corruption due to log recovery lsn ordering
  2015-08-21  0:34     ` Dave Chinner
@ 2015-08-21 14:39       ` Brian Foster
  2015-08-24  0:10         ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2015-08-21 14:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Aug 21, 2015 at 10:34:49AM +1000, Dave Chinner wrote:
> On Thu, Aug 20, 2015 at 12:25:29PM -0400, Brian Foster wrote:
> > On Thu, Aug 20, 2015 at 08:44:53AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 19, 2015 at 02:39:05PM -0400, Brian Foster wrote:
...
> > Using the inode chunk allocation and inode modification example...
> > clearly the transactions have to commit in order because the inodes must
> > be allocated before they can be used/modified. At what point after that
> > is reordering possible? Are we talking about reordering of the items
> > from the cil/log buffers to the on-disk log, or reordering of the items
> > during recovery (or both)?
> 
> Both. Look at inode allocation buffers, which are pinned in the AIL
> so they can't be moved forward by subsequent modifications (e.g.
> inode unlinks) until the newly initialised buffers have been written
> to disk. Look at them again in log recovery, where they are
> re-ordered to the start of a transaction so they are replayed before
> inode log items and inode unlink buffer operations.
> 

Thanks, I'll dig into this...

... 
> 
> Most users won't/can't upgrade to a kernel that will solve the
> problem - it's often much easier to get them a userspace package
> that minimises their exposure to a problem caused by their current
> userspace package....
> 

Perhaps... I think it's debateable that it's caused by the userspace
package, but I really don't care as much where it's fixed as opposed to
that the overall fix is comprehensive.

I wasn't initially expecting a userspace fix because the recovery
reordering mechanism seemed like the problem. A userspace change seems
reasonable with the approach of maintaining the current log recovery
behavior and fixing up the current LSN such that it is always correct.

> > > Then, on the kernel side, we need is a reliable way to detect that
> > > the log head/tail pointers have been reset in the kernel. This means
> > > we can - at minimum - issue a warning during log recovery that this
> > > has been detected.
> > > 
> > > Finally, we need to work out how to handle recovery in the situation
> > > that the log has been zeroed and the filesystem has a mix of new and
> > > old, stale LSNs. I think the simplest way around this is not to
> > > handle it in log recovery at all, but to avoid it altogether.
> > > 
> > > That is, when we find the log head/tail point to a zeroed log, we
> > > pull the current LSN from, say, the superblock (and maybe other
> > > metadata such as AG headers) and initialise the log head/tail to the
> > > cycle number + some offset so that every new transaction is
> > > guaranteed to have a cycle number more recent than any other LSN in
> > > the filesystem and ordering is preserved, even if the log has been
> > > zeroed.
> > > 
> > 
> > That's an interesting idea, but I wonder if it truly fixes the problem
> > or just makes it potentially more difficult to reproduce. One concern is
> > that even with this in place, all it takes to reintroduce the problem is
> > for a filesystem to run a bit on an older kernel in a situation where
> > the LSN is reset and enough modification has occurred to update whatever
> > key metadata we determine as important with the reset LSNs.
> 
> We can't prevent that - that horse has already bolted.
> 

Well, the feature bit can potentially do that. I'm not terribly fond of
it (overkill) and the goal was't to fix older kernels so much as to
prevent the problem from being reintroduced on newer kernels going
forward (provided newer kernels implemented this mount-time LSN
inheritance scheme).

In other words, the proposed solution depends on trust of the metadata
LSNs. One way we can trust those metadata LSNs is to prevent older
kernels/xfsprogs from messing with them.

> > A mount
> > alone is enough to pollute the superblock in this manner. Further
> > modification is probably necessary to affect the agi/agf headers,
> > however. It might not be the most likely scenario, but what is more
> > concerning is that if it does occur, it's completely invisible to our
> > detection on updated kernels. Would we want to consider a new
> > ro-incompat feature bit for this mechanism to prevent that?
> 
> We could, but preventing people from downgrading from new kernels
> because of a problem 99.9% of current users are never going to see?
> And with a fixed and upgraded userspace, it will be problem that
> they'll never see, either?
> 

I'm just trying to explore what might be necessary for a comprehensive
fix, even if we didn't ultimately implement all the way through (e.g.,
if it wasn't worth the usability/compatibility tradeoffs, etc.).

That said, I don't really agree with the assertion that upgraded
userspace alone sufficiently alleviates the problem. A filesystem that's
been around a while and has more recently been repaired is susceptible
to this problem for the foreseeable future, upgraded userspace or not,
until either all metadata with stale LSNs is updated or the problem is
identified and repaired explicitly.

> > Another concern is that we're assuming that the key metadata will always
> > have the most recent LSN. I think the closest thing to a guarantee we
> > have of that is the superblock being updated on umount and every so
> > often by xfs_log_worker() to cover the log. After a bit of playing
> > around, I'm not so sure that is enough.
> 
> Sure, I just threw it out as a way to get a more recent LSN. The
> only way to reliably get the highest LSN is to walk all the metadata
> in the filesystem. We can't do that at mount time, so perhaps it
> is best to just refuse to mount read-write and issue a warning to
> upgrade xfsprogs and re-run xfs_repair.
> 

Indeed, that sounds like a nicer approach to me. We're really only
considering the superblock at mount time though (note that repair resets
AG header LSNs), so we're not going to be guaranteed the mount fails for
all affected fs'. What do you think about also firing a (one-time,
probably) warning if we ever update a bit of metadata with a smaller
LSN?

That bit of metadata is now "fixed" once written out, but the point is
that the log item for this metadata was probably not effective and to
notify the user that the fs' LSNs are potentially out of whack and a
repair is in order.

> > So now we have reset the LSN according to the log. Presumably the mount
> > now inspects the superblock and each AG header and inherits the largest
> > LSN plus some offset as the current LSN: LSN X+N. Directory d2 still has
> > LSN Y, however, and we have no guarantee that N > Y-X. In other words, I
> > think we can end up right back where we started. Make a modification to
> > directory d2 at this point, crash the fs, and recovery might or might
> > not replay a log record with LSN X+N against a target directory buffer
> > with LSN Y.
> 
> These scenarios all result from running xfs_repair. It's the source
> of the problem, so let's treat it primarily as a bug in xfs_repair.
> The number of users that run xfs_repair are very small compared to
> the wider user-base, so having the kernel say "upgrade xfsprogs
> and repair" is probably the most reliable way to fix the problem.
> 

That sounds fine to me.

> So, to repair: have it record LSNs as it walks the metadata and
> at the start of phase 5 have it select a new cycle number higher
> than the highest LSN found in the filesystem. Stamp the log with
> that cycle number as well as all metadata that is being rebuilt
> in phase 5, and all the superblocks.
> 
> Problem solved.
> 

Yeah, I was thinking of just using the current head of the log, as
described in my random note. Using the LSN from the metadata is probably
necessary in the cases where the log might be corrupted, however.

> > Again, that's a contrived and probably unlikely scenario, but could be
> > extremely difficult to detect or root cause if it ever does occur. :/
> > Thoughts? Is there something further we can do to mitigate this, or am I
> > missing something else?
> 
> IMO, you're over-complicating the situation. We can't go back and
> change history, nor make people upgrade to systems that won't suffer
> from this problem. Hence we're just going to have to live with it,
> just like all the other log recovery bugs we've lived with since
> finding and fixing them.
> 
> If we do the simple thing and just detect zeroed logs in the kernel
> and refuse read-write mounts, that's something we can easily
> backport to all stable kernels that we support CRCs on (i.e. 3.16 to
> present). And with a fixed xfs_repair, the problem goes away as
> users and distros pull new in the new package, even if they are
> running old kernels.
> 

The situation is what it is, I'm just trying to cover all the angles. :)
I'm not claiming to have the right fix, so don't read too much into my
initial ideas. I'm throwing those out there hoping you'll come up with
improvements. ;) As noted above, the goal was not explicitly to prevent
downgrading and things as opposed to preventing the problem going
forward (e.g., if an fs can't be mounted with an old kernel, then it
can't be broken in this manner for newer kernels).

The solution of detection in the kernel and a fix in repair sounds
pretty good to me provided the detection is robust (e.g., the point
above wrt a warning). Then we have something in current kernels that can
effectively handle the problem regardless of how it might have been
(re)introduced and it's fairly simple as well.

> > Also, a random thought: I wonder if an update to the log zeroing
> > mechanism to ensure that a subsequent mount picked up the LSN where it
> > left off would be enough to get around much of this. That could mean
> > stamping the log appropriately in repair, or adding something like a new
> > continue_lsn field in the superblock to be used by anybody who zeroes
> > the log and picked up on the next mount, etc...
> 
> Sure - that's just a variant my first suggestion (of don't zero the
> log in xfs_repair). Remember that you said "I don't think that helps
> us very much". :)
> 

The first suggestion was to not zero the log by default as a means of
problem mitigation. I still don't think that helps us very much. :)
(IMO, neither does doing the above without the kernel detection side of
things.)

In fact, I think we've reached a prospective solution that technically
could preserve the current repair log clearing behavior. At least, I
view that as a separate change at this point. The only difference now is
that "clearing the log" means to stamp in log records that effectively
bookmark the current LSN for v5 filesystems.

As usual, thanks for the feedback.

Brian

> And, FWIW, my "read the LSN from metadata and use the highest" idea
> above is just a further refinement of the two, so we can fix
> zeroed logs from existing xfs_repair binaries without needing to do
> anything extra in the kernel other than detect a zeroed log...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: v5 filesystem corruption due to log recovery lsn ordering
  2015-08-21 14:39       ` Brian Foster
@ 2015-08-24  0:10         ` Dave Chinner
  2015-08-24 12:34           ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2015-08-24  0:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Aug 21, 2015 at 10:39:23AM -0400, Brian Foster wrote:
> On Fri, Aug 21, 2015 at 10:34:49AM +1000, Dave Chinner wrote:
> > On Thu, Aug 20, 2015 at 12:25:29PM -0400, Brian Foster wrote:
> > > On Thu, Aug 20, 2015 at 08:44:53AM +1000, Dave Chinner wrote:
> > > That's an interesting idea, but I wonder if it truly fixes the problem
> > > or just makes it potentially more difficult to reproduce. One concern is
> > > that even with this in place, all it takes to reintroduce the problem is
> > > for a filesystem to run a bit on an older kernel in a situation where
> > > the LSN is reset and enough modification has occurred to update whatever
> > > key metadata we determine as important with the reset LSNs.
> > 
> > We can't prevent that - that horse has already bolted.
> 
> Well, the feature bit can potentially do that. I'm not terribly fond of
> it (overkill) and the goal was't to fix older kernels so much as to
> prevent the problem from being reintroduced on newer kernels going
> forward (provided newer kernels implemented this mount-time LSN
> inheritance scheme).

We can prevent the problem of old userspace on newer kernels by
detecting a zeroed log - we don't need a feature bit for that.

> In other words, the proposed solution depends on trust of the metadata
> LSNs. One way we can trust those metadata LSNs is to prevent older
> kernels/xfsprogs from messing with them.

Well, it's old userspace we need to worry about here - kernels don't
zero the log...

> That said, I don't really agree with the assertion that upgraded
> userspace alone sufficiently alleviates the problem. A filesystem that's
> been around a while and has more recently been repaired is susceptible
> to this problem for the foreseeable future, upgraded userspace or not,
> until either all metadata with stale LSNs is updated or the problem is
> identified and repaired explicitly.

But there's nothing we can do in userspace to fix that. More on that
below.

> > > Another concern is that we're assuming that the key metadata will always
> > > have the most recent LSN. I think the closest thing to a guarantee we
> > > have of that is the superblock being updated on umount and every so
> > > often by xfs_log_worker() to cover the log. After a bit of playing
> > > around, I'm not so sure that is enough.
> > 
> > Sure, I just threw it out as a way to get a more recent LSN. The
> > only way to reliably get the highest LSN is to walk all the metadata
> > in the filesystem. We can't do that at mount time, so perhaps it
> > is best to just refuse to mount read-write and issue a warning to
> > upgrade xfsprogs and re-run xfs_repair.
> > 
> 
> Indeed, that sounds like a nicer approach to me. We're really only
> considering the superblock at mount time though (note that repair resets
> AG header LSNs), so we're not going to be guaranteed the mount fails for
> all affected fs'. What do you think about also firing a (one-time,
> probably) warning if we ever update a bit of metadata with a smaller
> LSN?

Larger LSN, you mean? ;)

But, yes, that is what I was thinking for solving the "horse has
already bolted" condition you described above. We can detect LSNs
beyond the current log in the metadata we read from disk and issue a
warning.  e.g in each metadata verifiers we add a:

	if (xfs_verify_lsn(mp, hdr->lsn))
		return false;

then we can warn immediately that this is a problem. If we discover
it during log recovery (i.e. lsn beyond the current head in the log
in a block that passes all the other metadata self checks) then we
can shut down or force the filesystem into read-only mode so that
the user has to update xfsprogs and run repair to fix it because we
can't safely run log recovery at this point.

> That bit of metadata is now "fixed" once written out, but the point is
> that the log item for this metadata was probably not effective and to
> notify the user that the fs' LSNs are potentially out of whack and a
> repair is in order.

That's true - if we don't hit a bad lsn in recovery, we just issue a
one-time warning from the verifiers and continue onwards. If the
user is lucky, the problem will fix itself before it is ever a
problem for recovery.

> > So, to repair: have it record LSNs as it walks the metadata and
> > at the start of phase 5 have it select a new cycle number higher
> > than the highest LSN found in the filesystem. Stamp the log with
> > that cycle number as well as all metadata that is being rebuilt
> > in phase 5, and all the superblocks.
> > 
> > Problem solved.
> > 
> 
> Yeah, I was thinking of just using the current head of the log, as
> described in my random note. Using the LSN from the metadata is probably
> necessary in the cases where the log might be corrupted, however.

And as a minor detail, I'd take the cycle number from the highest
LSN and bump it a few cycles just to make sure that the new LSN is
well beyond anything in the filesystem...

> The solution of detection in the kernel and a fix in repair sounds
> pretty good to me provided the detection is robust (e.g., the point
> above wrt a warning). Then we have something in current kernels that can
> effectively handle the problem regardless of how it might have been
> (re)introduced and it's fairly simple as well.

*nod*

I think we have a plan :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: v5 filesystem corruption due to log recovery lsn ordering
  2015-08-24  0:10         ` Dave Chinner
@ 2015-08-24 12:34           ` Brian Foster
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2015-08-24 12:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Aug 24, 2015 at 10:10:35AM +1000, Dave Chinner wrote:
> On Fri, Aug 21, 2015 at 10:39:23AM -0400, Brian Foster wrote:
> > On Fri, Aug 21, 2015 at 10:34:49AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 20, 2015 at 12:25:29PM -0400, Brian Foster wrote:
> > > > On Thu, Aug 20, 2015 at 08:44:53AM +1000, Dave Chinner wrote:
> > > > That's an interesting idea, but I wonder if it truly fixes the problem
> > > > or just makes it potentially more difficult to reproduce. One concern is
> > > > that even with this in place, all it takes to reintroduce the problem is
> > > > for a filesystem to run a bit on an older kernel in a situation where
> > > > the LSN is reset and enough modification has occurred to update whatever
> > > > key metadata we determine as important with the reset LSNs.
> > > 
> > > We can't prevent that - that horse has already bolted.
> > 
> > Well, the feature bit can potentially do that. I'm not terribly fond of
> > it (overkill) and the goal was't to fix older kernels so much as to
> > prevent the problem from being reintroduced on newer kernels going
> > forward (provided newer kernels implemented this mount-time LSN
> > inheritance scheme).
> 
> We can prevent the problem of old userspace on newer kernels by
> detecting a zeroed log - we don't need a feature bit for that.
> 
> > In other words, the proposed solution depends on trust of the metadata
> > LSNs. One way we can trust those metadata LSNs is to prevent older
> > kernels/xfsprogs from messing with them.
> 
> Well, it's old userspace we need to worry about here - kernels don't
> zero the log...
> 

The concern was just that older kernels could reset the LSNs of the key
metadata inspected at mount time and thus defeat the mechanism on new
kernels. This doesn't matter so much now that we'll have a more broad
warning mechanism in place.

> > That said, I don't really agree with the assertion that upgraded
> > userspace alone sufficiently alleviates the problem. A filesystem that's
> > been around a while and has more recently been repaired is susceptible
> > to this problem for the foreseeable future, upgraded userspace or not,
> > until either all metadata with stale LSNs is updated or the problem is
> > identified and repaired explicitly.
> 
> But there's nothing we can do in userspace to fix that. More on that
> below.
> 

Right...

> > > > Another concern is that we're assuming that the key metadata will always
> > > > have the most recent LSN. I think the closest thing to a guarantee we
> > > > have of that is the superblock being updated on umount and every so
> > > > often by xfs_log_worker() to cover the log. After a bit of playing
> > > > around, I'm not so sure that is enough.
> > > 
> > > Sure, I just threw it out as a way to get a more recent LSN. The
> > > only way to reliably get the highest LSN is to walk all the metadata
> > > in the filesystem. We can't do that at mount time, so perhaps it
> > > is best to just refuse to mount read-write and issue a warning to
> > > upgrade xfsprogs and re-run xfs_repair.
> > > 
> > 
> > Indeed, that sounds like a nicer approach to me. We're really only
> > considering the superblock at mount time though (note that repair resets
> > AG header LSNs), so we're not going to be guaranteed the mount fails for
> > all affected fs'. What do you think about also firing a (one-time,
> > probably) warning if we ever update a bit of metadata with a smaller
> > LSN?
> 
> Larger LSN, you mean? ;)
> 

Well, aren't incremental LSN updates the normal course of things? E.g.,
metadata is updated, the log pushes forward, the metadata is updated
again with a larger LSN.

Oh, I see. We're just looking at the logic differently. My thought above
was to fire a warning on the action of updating a piece of metadata with
a smaller current LSN. E.g., detect in the write verifier that the
metadata LSN has gone backwards...

> But, yes, that is what I was thinking for solving the "horse has
> already bolted" condition you described above. We can detect LSNs
> beyond the current log in the metadata we read from disk and issue a
> warning.  e.g in each metadata verifiers we add a:
> 
> 	if (xfs_verify_lsn(mp, hdr->lsn))
> 		return false;
> 

... whereas you're suggesting to update read verifiers to detect that
metadata LSN is valid according to the log, which means the warning
fires if the metadata LSN is larger than the current LSN.

So the validation logic sounds the same, but it sounds like a better
idea to check in the read verifier. That provides far more reliable
detection in that modifications aren't necessary to detect the problem.

> then we can warn immediately that this is a problem. If we discover
> it during log recovery (i.e. lsn beyond the current head in the log
> in a block that passes all the other metadata self checks) then we
> can shut down or force the filesystem into read-only mode so that
> the user has to update xfsprogs and run repair to fix it because we
> can't safely run log recovery at this point.
> 

Good idea.

> > That bit of metadata is now "fixed" once written out, but the point is
> > that the log item for this metadata was probably not effective and to
> > notify the user that the fs' LSNs are potentially out of whack and a
> > repair is in order.
> 
> That's true - if we don't hit a bad lsn in recovery, we just issue a
> one-time warning from the verifiers and continue onwards. If the
> user is lucky, the problem will fix itself before it is ever a
> problem for recovery.
> 
> > > So, to repair: have it record LSNs as it walks the metadata and
> > > at the start of phase 5 have it select a new cycle number higher
> > > than the highest LSN found in the filesystem. Stamp the log with
> > > that cycle number as well as all metadata that is being rebuilt
> > > in phase 5, and all the superblocks.
> > > 
> > > Problem solved.
> > > 
> > 
> > Yeah, I was thinking of just using the current head of the log, as
> > described in my random note. Using the LSN from the metadata is probably
> > necessary in the cases where the log might be corrupted, however.
> 
> And as a minor detail, I'd take the cycle number from the highest
> LSN and bump it a few cycles just to make sure that the new LSN is
> well beyond anything in the filesystem...
> 

Indeed, my current thought was to just bump the log up to the start of
the next cycle. We can get more into that when I have some code...

> > The solution of detection in the kernel and a fix in repair sounds
> > pretty good to me provided the detection is robust (e.g., the point
> > above wrt a warning). Then we have something in current kernels that can
> > effectively handle the problem regardless of how it might have been
> > (re)introduced and it's fairly simple as well.
> 
> *nod*
> 
> I think we have a plan :)
> 

Thanks!

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-08-24 12:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-19 18:39 v5 filesystem corruption due to log recovery lsn ordering Brian Foster
2015-08-19 22:44 ` Dave Chinner
2015-08-20 16:25   ` Brian Foster
2015-08-21  0:34     ` Dave Chinner
2015-08-21 14:39       ` Brian Foster
2015-08-24  0:10         ` Dave Chinner
2015-08-24 12:34           ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.