linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
       [not found] <MWHPR10MB1486754F03696347F4E7FEE5A3209@MWHPR10MB1486.namprd10.prod.outlook.com>
@ 2022-10-10 15:40 ` Darrick Wong
  2022-10-10 23:29   ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick Wong @ 2022-10-10 15:40 UTC (permalink / raw)
  To: Srikanth C S, linux-xfs; +Cc: Rajesh Sivaramasubramaniom, Junxiao Bi

LGTM, want to send this to the upstream list to start that discussion?

--D

________________________________________
From: Srikanth C S <srikanth.c.s@oracle.com>
Sent: Monday, October 10, 2022 08:24
To: linux-xfs@vger.kernel.org; Darrick Wong
Cc: Rajesh Sivaramasubramaniom; Junxiao Bi
Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair

fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
possible that when the machine crashes, the fs is in inconsistent
state with the journal log not yet replayed. This can put the
machine into rescue shell. To address this problem, mount and
umount the fs before running xfs_repair.

Run xfs_repair -e when fsck.mode=force and repair=auto or yes.
If fsck.mode=force and fsck.repair=no, run xfs_repair -n without
replaying the logs.

Signed-off-by: Srikanth C S <srikanth.c.s@oracle.com>
---
 fsck/xfs_fsck.sh | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
index 6af0f22..21a8c19 100755
--- a/fsck/xfs_fsck.sh
+++ b/fsck/xfs_fsck.sh
@@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then
 fi

 if $FORCE; then
-       xfs_repair -e $DEV
-       repair2fsck_code $?
+       if $AUTO; then
+               xfs_repair -e $DEV
+                error=$?
+                if [ $error -eq 2 ]; then
+                        echo "Replaying log for $DEV"
+                        mkdir -p /tmp/tmp_mnt
+                        mount $DEV /tmp/tmp_mnt
+                        umount /tmp/tmp_mnt
+                        xfs_repair -e $DEV
+                        error=$?
+                        rmdir /tmp/tmp_mnt
+                fi
+        else
+                #fsck.mode=force is set but fsck.repair=no
+                xfs_repair -n $DEV
+                error=$?
+        fi
+       repair2fsck_code $error
         exit $?
 fi

--
1.8.3.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
  2022-10-10 15:40 ` [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair Darrick Wong
@ 2022-10-10 23:29   ` Dave Chinner
  2022-10-11  0:30     ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2022-10-10 23:29 UTC (permalink / raw)
  To: Darrick Wong
  Cc: Srikanth C S, linux-xfs, Rajesh Sivaramasubramaniom, Junxiao Bi

On Mon, Oct 10, 2022 at 03:40:51PM +0000, Darrick Wong wrote:
> LGTM, want to send this to the upstream list to start that discussion?
> 
> --D
> 
> ________________________________________
> From: Srikanth C S <srikanth.c.s@oracle.com>
> Sent: Monday, October 10, 2022 08:24
> To: linux-xfs@vger.kernel.org; Darrick Wong
> Cc: Rajesh Sivaramasubramaniom; Junxiao Bi
> Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
> 
> fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
> possible that when the machine crashes, the fs is in inconsistent
> state with the journal log not yet replayed. This can put the
> machine into rescue shell. To address this problem, mount and
> umount the fs before running xfs_repair.

What's the purpose of forcing xfs_repair to be run on every boot?
The whole point of having a journalling filesystem is to avoid
needing to run fsck on every boot.

I get why one might want to occasionally force a repair check on
boot (e.g. to repair a problem with the root filesystem), but this
is a -rescue operation- and really shouldn't be occurring
automatically on every boot or after a kernel crash.

If it is only occurring during rescue operations, then why is it a problem
dumping out to a shell for the admin performing rescue
operations to deal with this directly? e.g. if the fs has a
corrupted journal, then a mount cycle will not fix the problem and
the admin will still get dumped into a rescue shell to fix the
problem manually.

Hence I don't really know why anyone would be configuring their
systems like this:

> Run xfs_repair -e when fsck.mode=force and repair=auto or yes.

as it makes no sense at all for a journalling filesystem.

> If fsck.mode=force and fsck.repair=no, run xfs_repair -n without
> replaying the logs.

Nor is it clear why anyone would want force a boot time fsck and
then not repair the damage that might be found....

More explanation, please!

> Signed-off-by: Srikanth C S <srikanth.c.s@oracle.com>
> ---
>  fsck/xfs_fsck.sh | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> index 6af0f22..21a8c19 100755
> --- a/fsck/xfs_fsck.sh
> +++ b/fsck/xfs_fsck.sh
> @@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then
>  fi
> 
>  if $FORCE; then
> -       xfs_repair -e $DEV
> -       repair2fsck_code $?
> +       if $AUTO; then
> +               xfs_repair -e $DEV
> +                error=$?
> +                if [ $error -eq 2 ]; then
> +                        echo "Replaying log for $DEV"
> +                        mkdir -p /tmp/tmp_mnt
> +                        mount $DEV /tmp/tmp_mnt
> +                        umount /tmp/tmp_mnt
> +                        xfs_repair -e $DEV
> +                        error=$?
> +                        rmdir /tmp/tmp_mnt
> +                fi
> +        else
> +                #fsck.mode=force is set but fsck.repair=no
> +                xfs_repair -n $DEV
> +                error=$?
> +        fi
> +       repair2fsck_code $error
>          exit $?
>  fi

As a side note, the patch has damaged whitespace....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
  2022-10-10 23:29   ` Dave Chinner
@ 2022-10-11  0:30     ` Darrick J. Wong
  2022-10-11  4:26       ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2022-10-11  0:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick Wong, Srikanth C S, linux-xfs,
	Rajesh Sivaramasubramaniom, Junxiao Bi

On Tue, Oct 11, 2022 at 10:29:32AM +1100, Dave Chinner wrote:
> On Mon, Oct 10, 2022 at 03:40:51PM +0000, Darrick Wong wrote:
> > LGTM, want to send this to the upstream list to start that discussion?

UGH, so I thought this was an internal thread, but it turns out that
linux-xfs has been cc'd for a while but none of the messages made it to
lore.

I'll fill in some missing context below.

> > --D
> > 
> > ________________________________________
> > From: Srikanth C S <srikanth.c.s@oracle.com>
> > Sent: Monday, October 10, 2022 08:24
> > To: linux-xfs@vger.kernel.org; Darrick Wong
> > Cc: Rajesh Sivaramasubramaniom; Junxiao Bi
> > Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
> > 
> > fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
> > possible that when the machine crashes, the fs is in inconsistent
> > state with the journal log not yet replayed. This can put the
> > machine into rescue shell. To address this problem, mount and
> > umount the fs before running xfs_repair.
> 
> What's the purpose of forcing xfs_repair to be run on every boot?
> The whole point of having a journalling filesystem is to avoid
> needing to run fsck on every boot.

I don't think repair-at-every-boot here is the overall goal for our
customer base.

We've had some <cough> major support events over the last few months.
There are two things going on here: some of the problems are due to are
datacenters abending, and the rest of it are the pile of data corruption
problems that you and I and Chandan have been working through for months
now.

This means that customer support has 30,000 VMs to reboot.  90% of the
systems involved seem to have survived more or less intact, but that
leaves 10% of them with latent errors, unmountable root filesystems,
etc.  They probably have even more than that, but I don't recommend
inviting the G-men for a visit to find out the real sum.

Since these machines are remotely manageable, support /can/ inject the
kernel command line with 'fsck.mode=force' to kick off xfs_repair if the
machine won't come up or if they suspect there might be deeper issues
with latent errors in the fs metadata, which is what they did to try to
get everyone running ASAP while anticipating any future problems...

> I get why one might want to occasionally force a repair check on
> boot (e.g. to repair a problem with the root filesystem), but this
> is a -rescue operation- and really shouldn't be occurring
> automatically on every boot or after a kernel crash.
>
> If it is only occurring during rescue operations, then why is it a problem
> dumping out to a shell for the admin performing rescue
> operations to deal with this directly? e.g. if the fs has a
> corrupted journal, then a mount cycle will not fix the problem and
> the admin will still get dumped into a rescue shell to fix the
> problem manually.

...however, most of those filesystems in the abended datacenter had
dirty logs, so repair failed and dumped all those machines to the
emergency shell.  3000 machines * 15 minutes per ticket is a lot of
downtime and a lot of manual labor.

Support would really like a means to automate as much of this as they
can.  They had assumed that fsck.repair=yes would DTRT to try to recover
and/or repair the fs, and were surprised to discover that this script
would not even try to recover the log.

> Hence I don't really know why anyone would be configuring their
> systems like this:

Some of the machines are VM hypervisors where unexpected crashes due to
the rootfs going offline unexpectedly create unplanned downtime for a
lot of machines.  It would be less disruptive to our cloud fleet overall
for a hypervisor take a little longer to boot if it's running fsck,
since the rootfs is the bare minimum needed to get the smartnic started.
For this case it might be preferable to configure this permanently,
since there aren't /that/ many VM hypervisors.  Nobody else gets
fsck.mode=force in the grub config.

> > Run xfs_repair -e when fsck.mode=force and repair=auto or yes.

(If it were me writing the patch, I'd have made repair=auto detect a
dirty log and continue the boot without running repair at all...)

> as it makes no sense at all for a journalling filesystem.
> 
> > If fsck.mode=force and fsck.repair=no, run xfs_repair -n without
> > replaying the logs.
> 
> Nor is it clear why anyone would want force a boot time fsck and
> then not repair the damage that might be found....

That part's my fault, I suggested that we should fix the script so that
repair=no selects dry run mode like you might expect.

> More explanation, please!

Frankly, I /don't/ want to expend a lot of time wringing our hands over
how exactly do we hammer 2022 XFS tools into 1994 ext2 behavioral
semantics.

What they really want is online fsck to detect and fix problems in real
time, but I can't seem to engage the community on how exactly do we land
this thing now that I've finished writing it.

--D

> > Signed-off-by: Srikanth C S <srikanth.c.s@oracle.com>
> > ---
> >  fsck/xfs_fsck.sh | 20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> > index 6af0f22..21a8c19 100755
> > --- a/fsck/xfs_fsck.sh
> > +++ b/fsck/xfs_fsck.sh
> > @@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then
> >  fi
> > 
> >  if $FORCE; then
> > -       xfs_repair -e $DEV
> > -       repair2fsck_code $?
> > +       if $AUTO; then
> > +               xfs_repair -e $DEV
> > +                error=$?
> > +                if [ $error -eq 2 ]; then
> > +                        echo "Replaying log for $DEV"
> > +                        mkdir -p /tmp/tmp_mnt
> > +                        mount $DEV /tmp/tmp_mnt
> > +                        umount /tmp/tmp_mnt
> > +                        xfs_repair -e $DEV
> > +                        error=$?
> > +                        rmdir /tmp/tmp_mnt
> > +                fi
> > +        else
> > +                #fsck.mode=force is set but fsck.repair=no
> > +                xfs_repair -n $DEV
> > +                error=$?
> > +        fi
> > +       repair2fsck_code $error
> >          exit $?
> >  fi
> 
> As a side note, the patch has damaged whitespace....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
  2022-10-11  0:30     ` Darrick J. Wong
@ 2022-10-11  4:26       ` Dave Chinner
  2022-10-11 16:08         ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2022-10-11  4:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Darrick Wong, Srikanth C S, linux-xfs,
	Rajesh Sivaramasubramaniom, Junxiao Bi

On Mon, Oct 10, 2022 at 05:30:48PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 11, 2022 at 10:29:32AM +1100, Dave Chinner wrote:
> > On Mon, Oct 10, 2022 at 03:40:51PM +0000, Darrick Wong wrote:
> > > LGTM, want to send this to the upstream list to start that discussion?
> 
> UGH, so I thought this was an internal thread, but it turns out that
> linux-xfs has been cc'd for a while but none of the messages made it to
> lore.
> 
> I'll fill in some missing context below.
> 
> > > --D
> > > 
> > > ________________________________________
> > > From: Srikanth C S <srikanth.c.s@oracle.com>
> > > Sent: Monday, October 10, 2022 08:24
> > > To: linux-xfs@vger.kernel.org; Darrick Wong
> > > Cc: Rajesh Sivaramasubramaniom; Junxiao Bi
> > > Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
> > > 
> > > fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
> > > possible that when the machine crashes, the fs is in inconsistent
> > > state with the journal log not yet replayed. This can put the
> > > machine into rescue shell. To address this problem, mount and
> > > umount the fs before running xfs_repair.
> > 
> > What's the purpose of forcing xfs_repair to be run on every boot?
> > The whole point of having a journalling filesystem is to avoid
> > needing to run fsck on every boot.
> 
> I don't think repair-at-every-boot here is the overall goal for our
> customer base.
> 
> We've had some <cough> major support events over the last few months.
> There are two things going on here: some of the problems are due to are
> datacenters abending, and the rest of it are the pile of data corruption

/me has to look up what "abending" means in this context, because I
don't think you mean that the data denter was "Absent By Enforced
Net Deprivation"...

Ah, it's an ancient abbreviation from the early days of IBM
mainframes meaning "abnormal end of task". i.e. something crashed.

> problems that you and I and Chandan have been working through for months
> now.
> 
> This means that customer support has 30,000 VMs to reboot.  90% of the
> systems involved seem to have survived more or less intact, but that
> leaves 10% of them with latent errors, unmountable root filesystems,
> etc.  They probably have even more than that, but I don't recommend
> inviting the G-men for a visit to find out the real sum.
> 
> Since these machines are remotely manageable, support /can/ inject the
> kernel command line with 'fsck.mode=force' to kick off xfs_repair if the
> machine won't come up or if they suspect there might be deeper issues
> with latent errors in the fs metadata, which is what they did to try to
> get everyone running ASAP while anticipating any future problems...

This context is kinda important :)

[....]

> > More explanation, please!
> 
> Frankly, I /don't/ want to expend a lot of time wringing our hands over
> how exactly do we hammer 2022 XFS tools into 1994 ext2 behavioral
> semantics.

Neither do I - I just want to understand the problem that the
change is trying to solve. The commit message gave no indication of
why the change was being proposed; commit messages need to tell the
whole story, not describe the code change being made.  If it started
like:

"After a recent data center crash, we recently had to recover root
filesystems on several thousand VMs via a boot time fsck. Many of
them failed unnecessarily because.... "

then it becomes self evident that replaying the log in these
situations is necessary.

> What they really want is online fsck to detect and fix problems in real
> time, but I can't seem to engage the community on how exactly do we land
> this thing now that I've finished writing it.

I only want to look at the interfacing with the runtime structures
and locking to ensure we don't back ourselves into a nasty corner.
This is really only a small part of the bigger online repair
patchset.  Once we've got that sorted, I think we should just commit
the rest of it as it stands (i.e. tested but unreviewed) and fix as
needed, just like we do with the userspace xfs_repair code...

> --D
> 
> > > Signed-off-by: Srikanth C S <srikanth.c.s@oracle.com>
> > > ---
> > >  fsck/xfs_fsck.sh | 20 ++++++++++++++++++--
> > >  1 file changed, 18 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> > > index 6af0f22..21a8c19 100755
> > > --- a/fsck/xfs_fsck.sh
> > > +++ b/fsck/xfs_fsck.sh
> > > @@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then
> > >  fi
> > > 
> > >  if $FORCE; then
> > > -       xfs_repair -e $DEV
> > > -       repair2fsck_code $?
> > > +       if $AUTO; then
> > > +               xfs_repair -e $DEV
> > > +                error=$?
> > > +                if [ $error -eq 2 ]; then
> > > +                        echo "Replaying log for $DEV"
> > > +                        mkdir -p /tmp/tmp_mnt
> > > +                        mount $DEV /tmp/tmp_mnt

Need to handle mount failure here - this will need to drop to
a rescue shell for manual recovery, I think.

Also, what happens if there are mount options set like quota?
Doesn't this make more work for the recovery process if we elide
things like that here?

> > > +                        umount /tmp/tmp_mnt
> > > +                        xfs_repair -e $DEV
> > > +                        error=$?
> > > +                        rmdir /tmp/tmp_mnt
> > > +                fi
> > > +        else
> > > +                #fsck.mode=force is set but fsck.repair=no
> > > +                xfs_repair -n $DEV
> > > +                error=$?
> > > +        fi

I think that adding a "check only" mode should be a separate patch.
We have to consider different things here, such as it will run on an
unrecovered filesystem and potentially report it damaged when, in
fact, all that is needed is log recovery to run....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
  2022-10-11  4:26       ` Dave Chinner
@ 2022-10-11 16:08         ` Darrick J. Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Darrick J. Wong @ 2022-10-11 16:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick Wong, Srikanth C S, linux-xfs,
	Rajesh Sivaramasubramaniom, Junxiao Bi

On Tue, Oct 11, 2022 at 03:26:02PM +1100, Dave Chinner wrote:
> On Mon, Oct 10, 2022 at 05:30:48PM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 11, 2022 at 10:29:32AM +1100, Dave Chinner wrote:
> > > On Mon, Oct 10, 2022 at 03:40:51PM +0000, Darrick Wong wrote:
> > > > LGTM, want to send this to the upstream list to start that discussion?
> > 
> > UGH, so I thought this was an internal thread, but it turns out that
> > linux-xfs has been cc'd for a while but none of the messages made it to
> > lore.
> > 
> > I'll fill in some missing context below.
> > 
> > > > --D
> > > > 
> > > > ________________________________________
> > > > From: Srikanth C S <srikanth.c.s@oracle.com>
> > > > Sent: Monday, October 10, 2022 08:24
> > > > To: linux-xfs@vger.kernel.org; Darrick Wong
> > > > Cc: Rajesh Sivaramasubramaniom; Junxiao Bi
> > > > Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
> > > > 
> > > > fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
> > > > possible that when the machine crashes, the fs is in inconsistent
> > > > state with the journal log not yet replayed. This can put the
> > > > machine into rescue shell. To address this problem, mount and
> > > > umount the fs before running xfs_repair.
> > > 
> > > What's the purpose of forcing xfs_repair to be run on every boot?
> > > The whole point of having a journalling filesystem is to avoid
> > > needing to run fsck on every boot.
> > 
> > I don't think repair-at-every-boot here is the overall goal for our
> > customer base.
> > 
> > We've had some <cough> major support events over the last few months.
> > There are two things going on here: some of the problems are due to are
> > datacenters abending, and the rest of it are the pile of data corruption
> 
> /me has to look up what "abending" means in this context, because I
> don't think you mean that the data denter was "Absent By Enforced
> Net Deprivation"...
> 
> Ah, it's an ancient abbreviation from the early days of IBM
> mainframes meaning "abnormal end of task". i.e. something crashed.

Yes.

> > problems that you and I and Chandan have been working through for months
> > now.
> > 
> > This means that customer support has 30,000 VMs to reboot.  90% of the
> > systems involved seem to have survived more or less intact, but that
> > leaves 10% of them with latent errors, unmountable root filesystems,
> > etc.  They probably have even more than that, but I don't recommend
> > inviting the G-men for a visit to find out the real sum.
> > 
> > Since these machines are remotely manageable, support /can/ inject the
> > kernel command line with 'fsck.mode=force' to kick off xfs_repair if the
> > machine won't come up or if they suspect there might be deeper issues
> > with latent errors in the fs metadata, which is what they did to try to
> > get everyone running ASAP while anticipating any future problems...
> 
> This context is kinda important :)
> 
> [....]
> 
> > > More explanation, please!
> > 
> > Frankly, I /don't/ want to expend a lot of time wringing our hands over
> > how exactly do we hammer 2022 XFS tools into 1994 ext2 behavioral
> > semantics.
> 
> Neither do I - I just want to understand the problem that the
> change is trying to solve. The commit message gave no indication of
> why the change was being proposed; commit messages need to tell the
> whole story, not describe the code change being made.  If it started
> like:
> 
> "After a recent data center crash, we recently had to recover root
> filesystems on several thousand VMs via a boot time fsck. Many of
> them failed unnecessarily because.... "
> 
> then it becomes self evident that replaying the log in these
> situations is necessary.

<nod> Srikanth, would you reword the patch message to include some of
the details about why we want this patch?

> > What they really want is online fsck to detect and fix problems in real
> > time, but I can't seem to engage the community on how exactly do we land
> > this thing now that I've finished writing it.
> 
> I only want to look at the interfacing with the runtime structures
> and locking to ensure we don't back ourselves into a nasty corner.
> This is really only a small part of the bigger online repair
> patchset.  Once we've got that sorted, I think we should just commit
> the rest of it as it stands (i.e. tested but unreviewed) and fix as
> needed, just like we do with the userspace xfs_repair code...

Ok.

> > --D
> > 
> > > > Signed-off-by: Srikanth C S <srikanth.c.s@oracle.com>
> > > > ---
> > > >  fsck/xfs_fsck.sh | 20 ++++++++++++++++++--
> > > >  1 file changed, 18 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> > > > index 6af0f22..21a8c19 100755
> > > > --- a/fsck/xfs_fsck.sh
> > > > +++ b/fsck/xfs_fsck.sh
> > > > @@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then
> > > >  fi
> > > > 
> > > >  if $FORCE; then
> > > > -       xfs_repair -e $DEV
> > > > -       repair2fsck_code $?
> > > > +       if $AUTO; then
> > > > +               xfs_repair -e $DEV
> > > > +                error=$?
> > > > +                if [ $error -eq 2 ]; then

To reiterate something earlier in the thread: I would only do the
mount/umount if the caller passes -fy.  I would drop to the emergency
shell for -fa or -f.

(Also: whitespace damage)

> > > > +                        echo "Replaying log for $DEV"
> > > > +                        mkdir -p /tmp/tmp_mnt
> > > > +                        mount $DEV /tmp/tmp_mnt
> 
> Need to handle mount failure here - this will need to drop to
> a rescue shell for manual recovery, I think.

Yep.

> Also, what happens if there are mount options set like quota?
> Doesn't this make more work for the recovery process if we elide
> things like that here?

Agreed, I think we need to extract rootflags= from /proc/cmdline.

> > > > +                        umount /tmp/tmp_mnt
> > > > +                        xfs_repair -e $DEV
> > > > +                        error=$?
> > > > +                        rmdir /tmp/tmp_mnt
> > > > +                fi
> > > > +        else
> > > > +                #fsck.mode=force is set but fsck.repair=no
> > > > +                xfs_repair -n $DEV
> > > > +                error=$?
> > > > +        fi
> 
> I think that adding a "check only" mode should be a separate patch.
> We have to consider different things here, such as it will run on an
> unrecovered filesystem and potentially report it damaged when, in
> fact, all that is needed is log recovery to run....

<nod> I'm not sure what we should return in that case.  Maybe 8 for
operational error?

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-11 16:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <MWHPR10MB1486754F03696347F4E7FEE5A3209@MWHPR10MB1486.namprd10.prod.outlook.com>
2022-10-10 15:40 ` [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair Darrick Wong
2022-10-10 23:29   ` Dave Chinner
2022-10-11  0:30     ` Darrick J. Wong
2022-10-11  4:26       ` Dave Chinner
2022-10-11 16:08         ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).