All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix mount failure due to past and transient device flush error
@ 2021-09-06  9:09 fdmanana
  2021-09-07 11:26 ` David Sterba
  2021-09-07 15:15 ` [PATCH 0/2] btrfs: fix mount/remount failure due to past device flush errors fdmanana
  0 siblings, 2 replies; 15+ messages in thread
From: fdmanana @ 2021-09-06  9:09 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we get an error flushing one device, during a super block commit, we
record the error in the device structure, in the field 'last_flush_error'.
This is used to later check if we should error out the super block commit,
depending on whether the number of flush errors is greater than or equals
to the maximum tolerated device failures for a raid profile.

However if we get a transient device flush error, unmount the filesystem
and later try to mount it, we can fail the mount because we treat that
past error as critical and consider the device is missing. Even if it's
very likely that the error will happen again, as it's probably due to a
hardware related problem, there may be cases where the error might not
happen again. One example is during testing, and a test case like the
new generic/648 from fstests always triggers this. The test cases
generic/019 and generic/475 also trigger this scenario, but very
sporadically.

When this happens we get an error like this:

  $ mount /dev/sdc /mnt
  mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.

  $ dmesg
  (...)
  [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
  [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
  [12918.890853] BTRFS error (device sdc): open_ctree failed

So fix this by making sure btrfs_check_rw_degradable() does not consider
flush errors from past mounts when it is being called at open_ctree() for
the purpose of checking if devices are missing, and clears the record of
such past errors from the devices. Any path during the mount that can
trigger a super block commit (replaying log trees, conversion from free
space cache v1 to v2) must do the usual checks for device flush errors,
just like before.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/disk-io.c | 13 ++++++++++---
 fs/btrfs/volumes.c | 25 +++++++++++++++++++++----
 3 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 38870ae46cbb..6871c43c2881 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -598,6 +598,9 @@ enum {
 	BTRFS_FS_32BIT_ERROR,
 	BTRFS_FS_32BIT_WARN,
 #endif
+
+	/* Used to signal we are checking if we can mount a fs in rw mode. */
+	BTRFS_FS_MOUNT_RW_CHECK,
 };
 
 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7d80e5b22d32..2d668a070ec2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3564,10 +3564,17 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_sysfs;
 	}
 
-	if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
-		btrfs_warn(fs_info,
+	if (!sb_rdonly(sb)) {
+		bool can_mount;
+
+		set_bit(BTRFS_FS_MOUNT_RW_CHECK, &fs_info->flags);
+		can_mount = btrfs_check_rw_degradable(fs_info, NULL);
+		clear_bit(BTRFS_FS_MOUNT_RW_CHECK, &fs_info->flags);
+		if (!can_mount) {
+			btrfs_warn(fs_info,
 		"writable mount is not allowed due to too many missing devices");
-		goto fail_sysfs;
+			goto fail_sysfs;
+		}
 	}
 
 	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b81f25eed298..53da6735ef1b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7395,12 +7395,29 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 		for (i = 0; i < map->num_stripes; i++) {
 			struct btrfs_device *dev = map->stripes[i].dev;
 
-			if (!dev || !dev->bdev ||
-			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) ||
-			    dev->last_flush_error)
+			if (dev && dev->last_flush_error) {
+				/*
+				 * If we had a flush error from a previous mount,
+				 * don't treat it as an error and clear the error
+				 * status. Such an error may be transient, and
+				 * just because it happened in a previous mount,
+				 * it does not mean it will happen again if we
+				 * mount the fs again. If it turns out the error
+				 * happens again after mounting, then we will
+				 * deal with it, abort the running transaction
+				 * and set the fs state to BTRFS_FS_STATE_ERROR.
+				 */
+				if (test_bit(BTRFS_FS_MOUNT_RW_CHECK,
+					     &fs_info->flags))
+					dev->last_flush_error = 0;
+				else
+					missing++;
+			} else if (!dev || !dev->bdev ||
+			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state)) {
 				missing++;
-			else if (failing_dev && failing_dev == dev)
+			} else if (failing_dev && failing_dev == dev) {
 				missing++;
+			}
 		}
 		if (missing > max_tolerated) {
 			if (!failing_dev)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: fix mount failure due to past and transient device flush error
  2021-09-06  9:09 [PATCH] btrfs: fix mount failure due to past and transient device flush error fdmanana
@ 2021-09-07 11:26 ` David Sterba
  2021-09-07 15:15   ` Filipe Manana
  2021-09-07 15:15 ` [PATCH 0/2] btrfs: fix mount/remount failure due to past device flush errors fdmanana
  1 sibling, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-09-07 11:26 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Sep 06, 2021 at 10:09:53AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we get an error flushing one device, during a super block commit, we
> record the error in the device structure, in the field 'last_flush_error'.
> This is used to later check if we should error out the super block commit,
> depending on whether the number of flush errors is greater than or equals
> to the maximum tolerated device failures for a raid profile.
> 
> However if we get a transient device flush error, unmount the filesystem
> and later try to mount it, we can fail the mount because we treat that
> past error as critical and consider the device is missing. Even if it's
> very likely that the error will happen again, as it's probably due to a
> hardware related problem, there may be cases where the error might not
> happen again. One example is during testing, and a test case like the
> new generic/648 from fstests always triggers this. The test cases
> generic/019 and generic/475 also trigger this scenario, but very
> sporadically.
> 
> When this happens we get an error like this:
> 
>   $ mount /dev/sdc /mnt
>   mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
> 
>   $ dmesg
>   (...)
>   [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
>   [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
>   [12918.890853] BTRFS error (device sdc): open_ctree failed
> 
> So fix this by making sure btrfs_check_rw_degradable() does not consider
> flush errors from past mounts when it is being called at open_ctree() for
> the purpose of checking if devices are missing, and clears the record of
> such past errors from the devices. Any path during the mount that can
> trigger a super block commit (replaying log trees, conversion from free
> space cache v1 to v2) must do the usual checks for device flush errors,
> just like before.

Why did you choose to set the global bit instead of passing a parameter.
AFAICS it's only checked inside btrfs_check_rw_degradable so no deep
call stacks where passing that would be cumbersome.

I've also noticed that all callers of btrfs_check_rw_degradable pass
NULL to failing_dev, so that could be cleaned up before adding another
parameter.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: fix mount failure due to past and transient device flush error
  2021-09-07 11:26 ` David Sterba
@ 2021-09-07 15:15   ` Filipe Manana
  0 siblings, 0 replies; 15+ messages in thread
From: Filipe Manana @ 2021-09-07 15:15 UTC (permalink / raw)
  To: David Sterba, Filipe Manana, linux-btrfs

On Tue, Sep 7, 2021 at 12:27 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Sep 06, 2021 at 10:09:53AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When we get an error flushing one device, during a super block commit, we
> > record the error in the device structure, in the field 'last_flush_error'.
> > This is used to later check if we should error out the super block commit,
> > depending on whether the number of flush errors is greater than or equals
> > to the maximum tolerated device failures for a raid profile.
> >
> > However if we get a transient device flush error, unmount the filesystem
> > and later try to mount it, we can fail the mount because we treat that
> > past error as critical and consider the device is missing. Even if it's
> > very likely that the error will happen again, as it's probably due to a
> > hardware related problem, there may be cases where the error might not
> > happen again. One example is during testing, and a test case like the
> > new generic/648 from fstests always triggers this. The test cases
> > generic/019 and generic/475 also trigger this scenario, but very
> > sporadically.
> >
> > When this happens we get an error like this:
> >
> >   $ mount /dev/sdc /mnt
> >   mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
> >
> >   $ dmesg
> >   (...)
> >   [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
> >   [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
> >   [12918.890853] BTRFS error (device sdc): open_ctree failed
> >
> > So fix this by making sure btrfs_check_rw_degradable() does not consider
> > flush errors from past mounts when it is being called at open_ctree() for
> > the purpose of checking if devices are missing, and clears the record of
> > such past errors from the devices. Any path during the mount that can
> > trigger a super block commit (replaying log trees, conversion from free
> > space cache v1 to v2) must do the usual checks for device flush errors,
> > just like before.
>
> Why did you choose to set the global bit instead of passing a parameter.
> AFAICS it's only checked inside btrfs_check_rw_degradable so no deep
> call stacks where passing that would be cumbersome.
>
> I've also noticed that all callers of btrfs_check_rw_degradable pass
> NULL to failing_dev, so that could be cleaned up before adding another
> parameter.

Yes, I didn't notice it before, my head was on multitasking with other problems.
I'll send a v2.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/2] btrfs: fix mount/remount failure due to past device flush errors
  2021-09-06  9:09 [PATCH] btrfs: fix mount failure due to past and transient device flush error fdmanana
  2021-09-07 11:26 ` David Sterba
@ 2021-09-07 15:15 ` fdmanana
  2021-09-07 15:15   ` [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error fdmanana
  2021-09-07 15:15   ` [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable() fdmanana
  1 sibling, 2 replies; 15+ messages in thread
From: fdmanana @ 2021-09-07 15:15 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We currently fail mounting and remounting a fs from RO to RW if we had a
device flush error from a past mount. This causes mount/remount when
such errors were transient and don't happen anymore. One example is during
tests that explicitly trigger IO failure, such as the new generic/648 or,
much less frequently, generic/019 and generic/475. The first patch has
more details in its changelog.

V2: Rework to use instead an additional argument to btrfs_check_rw_degradable()
    and solve the same problem for remounts from RO to RW mode. Also added the
    second patch.

Filipe Manana (2):
  btrfs: fix mount failure due to past and transient device flush error
  btrfs: remove the failing device argument from btrfs_check_rw_degradable()

 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 32 +++++++++++++++++++++-----------
 fs/btrfs/volumes.h |  2 +-
 4 files changed, 25 insertions(+), 15 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error
  2021-09-07 15:15 ` [PATCH 0/2] btrfs: fix mount/remount failure due to past device flush errors fdmanana
@ 2021-09-07 15:15   ` fdmanana
  2021-09-08 14:19     ` Anand Jain
  2021-09-08 18:05     ` [PATCH v3] " fdmanana
  2021-09-07 15:15   ` [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable() fdmanana
  1 sibling, 2 replies; 15+ messages in thread
From: fdmanana @ 2021-09-07 15:15 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we get an error flushing one device, during a super block commit, we
record the error in the device structure, in the field 'last_flush_error'.
This is used to later check if we should error out the super block commit,
depending on whether the number of flush errors is greater than or equals
to the maximum tolerated device failures for a raid profile.

However if we get a transient device flush error, unmount the filesystem
and later try to mount it, we can fail the mount because we treat that
past error as critical and consider the device is missing. Even if it's
very likely that the error will happen again, as it's probably due to a
hardware related problem, there may be cases where the error might not
happen again. One example is during testing, and a test case like the
new generic/648 from fstests always triggers this. The test cases
generic/019 and generic/475 also trigger this scenario, but very
sporadically.

When this happens we get an error like this:

  $ mount /dev/sdc /mnt
  mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.

  $ dmesg
  (...)
  [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
  [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
  [12918.890853] BTRFS error (device sdc): open_ctree failed

So fix this by making sure btrfs_check_rw_degradable() does not consider
flush errors from past mounts when it's being called either on a mount
context or on a RO to RW remount context, and clears the flush errors
from the devices. Any path that triggers a super block commit during
mount/remount must still check for any flush errors and lead to a
mount/remount failure if any are found - all these paths (replaying log
trees, convert space cache v1 to v2, etc) all happen after the first
call to btrfs_check_rw_degradable(), which is the only call that should
ignore and reset past flush errors from the devices.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 26 +++++++++++++++++++++-----
 fs/btrfs/volumes.h |  3 ++-
 4 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7d80e5b22d32..6d7d6288f80a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3564,7 +3564,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_sysfs;
 	}
 
-	if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
+	if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL, true)) {
 		btrfs_warn(fs_info,
 		"writable mount is not allowed due to too many missing devices");
 		goto fail_sysfs;
@@ -4013,7 +4013,7 @@ static blk_status_t wait_dev_flush(struct btrfs_device *device)
 
 static int check_barrier_error(struct btrfs_fs_info *fs_info)
 {
-	if (!btrfs_check_rw_degradable(fs_info, NULL))
+	if (!btrfs_check_rw_degradable(fs_info, NULL, false))
 		return -EIO;
 	return 0;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a927009f02a2..51519141b12f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2017,7 +2017,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			goto restore;
 		}
 
-		if (!btrfs_check_rw_degradable(fs_info, NULL)) {
+		if (!btrfs_check_rw_degradable(fs_info, NULL, true)) {
 			btrfs_warn(fs_info,
 		"too many missing devices, writable remount is not allowed");
 			ret = -EACCES;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b81f25eed298..2a5beba98273 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7367,7 +7367,7 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
  * Return false if any chunk doesn't meet the minimal RW mount requirements.
  */
 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
-					struct btrfs_device *failing_dev)
+			       struct btrfs_device *failing_dev, bool mounting_fs)
 {
 	struct extent_map_tree *map_tree = &fs_info->mapping_tree;
 	struct extent_map *em;
@@ -7395,12 +7395,28 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 		for (i = 0; i < map->num_stripes; i++) {
 			struct btrfs_device *dev = map->stripes[i].dev;
 
-			if (!dev || !dev->bdev ||
-			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) ||
-			    dev->last_flush_error)
+			if (dev && dev->last_flush_error) {
+				/*
+				 * If we had a flush error from a previous mount,
+				 * don't treat it as an error and clear the error
+				 * status. Such an error may be transient, and
+				 * just because it happened in a previous mount,
+				 * it does not mean it will happen again if we
+				 * mount the fs again. If it turns out the error
+				 * happens again after mounting, then we will
+				 * deal with it, abort the running transaction
+				 * and set the fs state to BTRFS_FS_STATE_ERROR.
+				 */
+				if (mounting_fs)
+					dev->last_flush_error = 0;
+				else
+					missing++;
+			} else if (!dev || !dev->bdev ||
+			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state)) {
 				missing++;
-			else if (failing_dev && failing_dev == dev)
+			} else if (failing_dev && failing_dev == dev) {
 				missing++;
+			}
 		}
 		if (missing > max_tolerated) {
 			if (!failing_dev)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7e8f205978d9..7299aa36f41f 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -575,7 +575,8 @@ void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
 
 struct list_head * __attribute_const__ btrfs_get_fs_uuids(void);
 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
-					struct btrfs_device *failing_dev);
+			       struct btrfs_device *failing_dev,
+			       bool mounting_fs);
 void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 			       struct block_device *bdev,
 			       const char *device_path);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable()
  2021-09-07 15:15 ` [PATCH 0/2] btrfs: fix mount/remount failure due to past device flush errors fdmanana
  2021-09-07 15:15   ` [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error fdmanana
@ 2021-09-07 15:15   ` fdmanana
  2021-09-07 16:05     ` David Sterba
  1 sibling, 1 reply; 15+ messages in thread
From: fdmanana @ 2021-09-07 15:15 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently all callers of btrfs_check_rw_degradable() pass a NULL value for
its 'failing_dev' argument, therefore making it useless. So just remove
that argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 10 ++--------
 fs/btrfs/volumes.h |  1 -
 4 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6d7d6288f80a..f5133833bba8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3564,7 +3564,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_sysfs;
 	}
 
-	if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL, true)) {
+	if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, true)) {
 		btrfs_warn(fs_info,
 		"writable mount is not allowed due to too many missing devices");
 		goto fail_sysfs;
@@ -4013,7 +4013,7 @@ static blk_status_t wait_dev_flush(struct btrfs_device *device)
 
 static int check_barrier_error(struct btrfs_fs_info *fs_info)
 {
-	if (!btrfs_check_rw_degradable(fs_info, NULL, false))
+	if (!btrfs_check_rw_degradable(fs_info, false))
 		return -EIO;
 	return 0;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 51519141b12f..da0395512cc2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2017,7 +2017,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			goto restore;
 		}
 
-		if (!btrfs_check_rw_degradable(fs_info, NULL, true)) {
+		if (!btrfs_check_rw_degradable(fs_info, true)) {
 			btrfs_warn(fs_info,
 		"too many missing devices, writable remount is not allowed");
 			ret = -EACCES;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2a5beba98273..a8352965e24c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7361,13 +7361,10 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
 /*
  * Check if all chunks in the fs are OK for read-write degraded mount
  *
- * If the @failing_dev is specified, it's accounted as missing.
- *
  * Return true if all chunks meet the minimal RW mount requirements.
  * Return false if any chunk doesn't meet the minimal RW mount requirements.
  */
-bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
-			       struct btrfs_device *failing_dev, bool mounting_fs)
+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, bool mounting_fs)
 {
 	struct extent_map_tree *map_tree = &fs_info->mapping_tree;
 	struct extent_map *em;
@@ -7414,13 +7411,10 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 			} else if (!dev || !dev->bdev ||
 			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state)) {
 				missing++;
-			} else if (failing_dev && failing_dev == dev) {
-				missing++;
 			}
 		}
 		if (missing > max_tolerated) {
-			if (!failing_dev)
-				btrfs_warn(fs_info,
+			btrfs_warn(fs_info,
 	"chunk %llu missing %d devices, max tolerance is %d for writable mount",
 				   em->start, missing, max_tolerated);
 			free_extent_map(em);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7299aa36f41f..f880d5404b25 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -575,7 +575,6 @@ void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
 
 struct list_head * __attribute_const__ btrfs_get_fs_uuids(void);
 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
-			       struct btrfs_device *failing_dev,
 			       bool mounting_fs);
 void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 			       struct block_device *bdev,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable()
  2021-09-07 15:15   ` [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable() fdmanana
@ 2021-09-07 16:05     ` David Sterba
  2021-09-08 14:25       ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-09-07 16:05 UTC (permalink / raw)
  To: anand.jain; +Cc: fdmanana, linux-btrfs

On Tue, Sep 07, 2021 at 04:15:50PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Currently all callers of btrfs_check_rw_degradable() pass a NULL value for
> its 'failing_dev' argument, therefore making it useless. So just remove
> that argument.

Anand, did you have plans with using the argument? It's been NULL since
the initial patch

https://lore.kernel.org/linux-btrfs/00433e34-a56e-3898-80b9-32a304fe32e2@gmx.com/t/#u

as commit 6528b99d3d20 ("btrfs: factor btrfs_check_rw_degradable() to
check given device") and it was not part of a series.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error
  2021-09-07 15:15   ` [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error fdmanana
@ 2021-09-08 14:19     ` Anand Jain
  2021-09-08 14:26       ` Filipe Manana
  2021-09-08 18:05     ` [PATCH v3] " fdmanana
  1 sibling, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-09-08 14:19 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On 07/09/2021 23:15, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we get an error flushing one device, during a super block commit, we
> record the error in the device structure, in the field 'last_flush_error'.
> This is used to later check if we should error out the super block commit,
> depending on whether the number of flush errors is greater than or equals
> to the maximum tolerated device failures for a raid profile.


> However if we get a transient device flush error, unmount the filesystem
> and later try to mount it, we can fail the mount because we treat that
> past error as critical and consider the device is missing.

> Even if it's
> very likely that the error will happen again, as it's probably due to a
> hardware related problem, there may be cases where the error might not
> happen again. 

  But is there an impact due to flush error, like storage cache lost few 
block? If so, then the current design is correct. No?

Thanks, Anand

> One example is during testing, and a test case like the
> new generic/648 from fstests always triggers this. The test cases
> generic/019 and generic/475 also trigger this scenario, but very
> sporadically.
> 
> When this happens we get an error like this:
> 
>    $ mount /dev/sdc /mnt
>    mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
> 
>    $ dmesg
>    (...)
>    [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
>    [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
>    [12918.890853] BTRFS error (device sdc): open_ctree failed
> 
> So fix this by making sure btrfs_check_rw_degradable() does not consider
> flush errors from past mounts when it's being called either on a mount
> context or on a RO to RW remount context, and clears the flush errors
> from the devices. Any path that triggers a super block commit during
> mount/remount must still check for any flush errors and lead to a
> mount/remount failure if any are found - all these paths (replaying log
> trees, convert space cache v1 to v2, etc) all happen after the first
> call to btrfs_check_rw_degradable(), which is the only call that should
> ignore and reset past flush errors from the devices.
> 

What if the flush error is real that the storage cache dropped few 
blocks and, did not make it to the permanent storage.

Thanks, Anand


> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/disk-io.c |  4 ++--
>   fs/btrfs/super.c   |  2 +-
>   fs/btrfs/volumes.c | 26 +++++++++++++++++++++-----
>   fs/btrfs/volumes.h |  3 ++-
>   4 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7d80e5b22d32..6d7d6288f80a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3564,7 +3564,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>   		goto fail_sysfs;
>   	}
>   
> -	if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
> +	if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL, true)) {
>   		btrfs_warn(fs_info,
>   		"writable mount is not allowed due to too many missing devices");
>   		goto fail_sysfs;
> @@ -4013,7 +4013,7 @@ static blk_status_t wait_dev_flush(struct btrfs_device *device)
>   
>   static int check_barrier_error(struct btrfs_fs_info *fs_info)
>   {
> -	if (!btrfs_check_rw_degradable(fs_info, NULL))
> +	if (!btrfs_check_rw_degradable(fs_info, NULL, false))
>   		return -EIO;
>   	return 0;
>   }
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index a927009f02a2..51519141b12f 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2017,7 +2017,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
>   			goto restore;
>   		}
>   
> -		if (!btrfs_check_rw_degradable(fs_info, NULL)) {
> +		if (!btrfs_check_rw_degradable(fs_info, NULL, true)) {
>   			btrfs_warn(fs_info,
>   		"too many missing devices, writable remount is not allowed");
>   			ret = -EACCES;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index b81f25eed298..2a5beba98273 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7367,7 +7367,7 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
>    * Return false if any chunk doesn't meet the minimal RW mount requirements.
>    */
>   bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> -					struct btrfs_device *failing_dev)
> +			       struct btrfs_device *failing_dev, bool mounting_fs)
>   {
>   	struct extent_map_tree *map_tree = &fs_info->mapping_tree;
>   	struct extent_map *em;
> @@ -7395,12 +7395,28 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
>   		for (i = 0; i < map->num_stripes; i++) {
>   			struct btrfs_device *dev = map->stripes[i].dev;
>   
> -			if (!dev || !dev->bdev ||
> -			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) ||
> -			    dev->last_flush_error)
> +			if (dev && dev->last_flush_error) {
> +				/*
> +				 * If we had a flush error from a previous mount,
> +				 * don't treat it as an error and clear the error
> +				 * status. Such an error may be transient, and
> +				 * just because it happened in a previous mount,
> +				 * it does not mean it will happen again if we
> +				 * mount the fs again. If it turns out the error
> +				 * happens again after mounting, then we will
> +				 * deal with it, abort the running transaction
> +				 * and set the fs state to BTRFS_FS_STATE_ERROR.
> +				 */
> +				if (mounting_fs)
> +					dev->last_flush_error = 0;
> +				else
> +					missing++;
> +			} else if (!dev || !dev->bdev ||
> +			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state)) {
>   				missing++;
> -			else if (failing_dev && failing_dev == dev)
> +			} else if (failing_dev && failing_dev == dev) {
>   				missing++;
> +			}
>   		}
>   		if (missing > max_tolerated) {
>   			if (!failing_dev)
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 7e8f205978d9..7299aa36f41f 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -575,7 +575,8 @@ void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
>   
>   struct list_head * __attribute_const__ btrfs_get_fs_uuids(void);
>   bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> -					struct btrfs_device *failing_dev);
> +			       struct btrfs_device *failing_dev,
> +			       bool mounting_fs);
>   void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
>   			       struct block_device *bdev,
>   			       const char *device_path);
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable()
  2021-09-07 16:05     ` David Sterba
@ 2021-09-08 14:25       ` Anand Jain
  2021-09-09 14:19         ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-09-08 14:25 UTC (permalink / raw)
  To: dsterba; +Cc: fdmanana, linux-btrfs



On 08/09/2021 00:05, David Sterba wrote:
> On Tue, Sep 07, 2021 at 04:15:50PM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Currently all callers of btrfs_check_rw_degradable() pass a NULL value for
>> its 'failing_dev' argument, therefore making it useless. So just remove
>> that argument.
> 
> Anand, did you have plans with using the argument? It's been NULL since
> the initial patch
> 
> https://lore.kernel.org/linux-btrfs/00433e34-a56e-3898-80b9-32a304fe32e2@gmx.com/t/#u
> 
> as commit 6528b99d3d20 ("btrfs: factor btrfs_check_rw_degradable() to
> check given device") and it was not part of a series.

David,

I have a local patch which is V9 of this [1] using the 2nd argument of 
btrfs_check_rw_degradable(). Essentially to check if the mounted fs 
should fail or can continue to write when a disk fails.

[1]
btrfs: introduce device dynamic state transition to failed

https://patchwork.kernel.org/project/linux-btrfs/patch/20171003155920.24925-2-anand.jain@oracle.com/

We need further discussions on this design. I think.

Thanks, Anand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error
  2021-09-08 14:19     ` Anand Jain
@ 2021-09-08 14:26       ` Filipe Manana
  2021-09-08 16:30         ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Filipe Manana @ 2021-09-08 14:26 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Wed, Sep 8, 2021 at 3:20 PM Anand Jain <anand.jain@oracle.com> wrote:
>
> On 07/09/2021 23:15, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When we get an error flushing one device, during a super block commit, we
> > record the error in the device structure, in the field 'last_flush_error'.
> > This is used to later check if we should error out the super block commit,
> > depending on whether the number of flush errors is greater than or equals
> > to the maximum tolerated device failures for a raid profile.
>
>
> > However if we get a transient device flush error, unmount the filesystem
> > and later try to mount it, we can fail the mount because we treat that
> > past error as critical and consider the device is missing.
>
> > Even if it's
> > very likely that the error will happen again, as it's probably due to a
> > hardware related problem, there may be cases where the error might not
> > happen again.
>
>   But is there an impact due to flush error, like storage cache lost few
> block? If so, then the current design is correct. No?

If there was a flush error, then we aborted the current transaction
and set the filesystem to error state.
We only write the super block if we are able to do the device flushes.

>
> Thanks, Anand
>
> > One example is during testing, and a test case like the
> > new generic/648 from fstests always triggers this. The test cases
> > generic/019 and generic/475 also trigger this scenario, but very
> > sporadically.
> >
> > When this happens we get an error like this:
> >
> >    $ mount /dev/sdc /mnt
> >    mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
> >
> >    $ dmesg
> >    (...)
> >    [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
> >    [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
> >    [12918.890853] BTRFS error (device sdc): open_ctree failed
> >
> > So fix this by making sure btrfs_check_rw_degradable() does not consider
> > flush errors from past mounts when it's being called either on a mount
> > context or on a RO to RW remount context, and clears the flush errors
> > from the devices. Any path that triggers a super block commit during
> > mount/remount must still check for any flush errors and lead to a
> > mount/remount failure if any are found - all these paths (replaying log
> > trees, convert space cache v1 to v2, etc) all happen after the first
> > call to btrfs_check_rw_degradable(), which is the only call that should
> > ignore and reset past flush errors from the devices.
> >
>
> What if the flush error is real that the storage cache dropped few
> blocks and, did not make it to the permanent storage.

The previous comment answers this as well.

The comment in the comment also says the same.

Thanks.

>
> Thanks, Anand
>
>
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/disk-io.c |  4 ++--
> >   fs/btrfs/super.c   |  2 +-
> >   fs/btrfs/volumes.c | 26 +++++++++++++++++++++-----
> >   fs/btrfs/volumes.h |  3 ++-
> >   4 files changed, 26 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 7d80e5b22d32..6d7d6288f80a 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3564,7 +3564,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> >               goto fail_sysfs;
> >       }
> >
> > -     if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
> > +     if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL, true)) {
> >               btrfs_warn(fs_info,
> >               "writable mount is not allowed due to too many missing devices");
> >               goto fail_sysfs;
> > @@ -4013,7 +4013,7 @@ static blk_status_t wait_dev_flush(struct btrfs_device *device)
> >
> >   static int check_barrier_error(struct btrfs_fs_info *fs_info)
> >   {
> > -     if (!btrfs_check_rw_degradable(fs_info, NULL))
> > +     if (!btrfs_check_rw_degradable(fs_info, NULL, false))
> >               return -EIO;
> >       return 0;
> >   }
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index a927009f02a2..51519141b12f 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -2017,7 +2017,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
> >                       goto restore;
> >               }
> >
> > -             if (!btrfs_check_rw_degradable(fs_info, NULL)) {
> > +             if (!btrfs_check_rw_degradable(fs_info, NULL, true)) {
> >                       btrfs_warn(fs_info,
> >               "too many missing devices, writable remount is not allowed");
> >                       ret = -EACCES;
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index b81f25eed298..2a5beba98273 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -7367,7 +7367,7 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
> >    * Return false if any chunk doesn't meet the minimal RW mount requirements.
> >    */
> >   bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> > -                                     struct btrfs_device *failing_dev)
> > +                            struct btrfs_device *failing_dev, bool mounting_fs)
> >   {
> >       struct extent_map_tree *map_tree = &fs_info->mapping_tree;
> >       struct extent_map *em;
> > @@ -7395,12 +7395,28 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> >               for (i = 0; i < map->num_stripes; i++) {
> >                       struct btrfs_device *dev = map->stripes[i].dev;
> >
> > -                     if (!dev || !dev->bdev ||
> > -                         test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) ||
> > -                         dev->last_flush_error)
> > +                     if (dev && dev->last_flush_error) {
> > +                             /*
> > +                              * If we had a flush error from a previous mount,
> > +                              * don't treat it as an error and clear the error
> > +                              * status. Such an error may be transient, and
> > +                              * just because it happened in a previous mount,
> > +                              * it does not mean it will happen again if we
> > +                              * mount the fs again. If it turns out the error
> > +                              * happens again after mounting, then we will
> > +                              * deal with it, abort the running transaction
> > +                              * and set the fs state to BTRFS_FS_STATE_ERROR.
> > +                              */
> > +                             if (mounting_fs)
> > +                                     dev->last_flush_error = 0;
> > +                             else
> > +                                     missing++;
> > +                     } else if (!dev || !dev->bdev ||
> > +                         test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state)) {
> >                               missing++;
> > -                     else if (failing_dev && failing_dev == dev)
> > +                     } else if (failing_dev && failing_dev == dev) {
> >                               missing++;
> > +                     }
> >               }
> >               if (missing > max_tolerated) {
> >                       if (!failing_dev)
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index 7e8f205978d9..7299aa36f41f 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -575,7 +575,8 @@ void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
> >
> >   struct list_head * __attribute_const__ btrfs_get_fs_uuids(void);
> >   bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> > -                                     struct btrfs_device *failing_dev);
> > +                            struct btrfs_device *failing_dev,
> > +                            bool mounting_fs);
> >   void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
> >                              struct block_device *bdev,
> >                              const char *device_path);
> >
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error
  2021-09-08 14:26       ` Filipe Manana
@ 2021-09-08 16:30         ` David Sterba
  2021-09-08 17:32           ` Filipe Manana
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-09-08 16:30 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Anand Jain, linux-btrfs

On Wed, Sep 08, 2021 at 03:26:55PM +0100, Filipe Manana wrote:
> On Wed, Sep 8, 2021 at 3:20 PM Anand Jain <anand.jain@oracle.com> wrote:
> >
> > On 07/09/2021 23:15, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When we get an error flushing one device, during a super block commit, we
> > > record the error in the device structure, in the field 'last_flush_error'.
> > > This is used to later check if we should error out the super block commit,
> > > depending on whether the number of flush errors is greater than or equals
> > > to the maximum tolerated device failures for a raid profile.
> >
> >
> > > However if we get a transient device flush error, unmount the filesystem
> > > and later try to mount it, we can fail the mount because we treat that
> > > past error as critical and consider the device is missing.
> >
> > > Even if it's
> > > very likely that the error will happen again, as it's probably due to a
> > > hardware related problem, there may be cases where the error might not
> > > happen again.
> >
> >   But is there an impact due to flush error, like storage cache lost few
> > block? If so, then the current design is correct. No?
> 
> If there was a flush error, then we aborted the current transaction
> and set the filesystem to error state.
> We only write the super block if we are able to do the device flushes.

Should the last_flush_error be reset in btrfs_close_one_device once the
filesystem is unmounted? I wonder if we should allow leaking the status
past the mount and care in the next one.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error
  2021-09-08 16:30         ` David Sterba
@ 2021-09-08 17:32           ` Filipe Manana
  0 siblings, 0 replies; 15+ messages in thread
From: Filipe Manana @ 2021-09-08 17:32 UTC (permalink / raw)
  To: David Sterba, Filipe Manana, Anand Jain, linux-btrfs

On Wed, Sep 8, 2021 at 5:30 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Sep 08, 2021 at 03:26:55PM +0100, Filipe Manana wrote:
> > On Wed, Sep 8, 2021 at 3:20 PM Anand Jain <anand.jain@oracle.com> wrote:
> > >
> > > On 07/09/2021 23:15, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > When we get an error flushing one device, during a super block commit, we
> > > > record the error in the device structure, in the field 'last_flush_error'.
> > > > This is used to later check if we should error out the super block commit,
> > > > depending on whether the number of flush errors is greater than or equals
> > > > to the maximum tolerated device failures for a raid profile.
> > >
> > >
> > > > However if we get a transient device flush error, unmount the filesystem
> > > > and later try to mount it, we can fail the mount because we treat that
> > > > past error as critical and consider the device is missing.
> > >
> > > > Even if it's
> > > > very likely that the error will happen again, as it's probably due to a
> > > > hardware related problem, there may be cases where the error might not
> > > > happen again.
> > >
> > >   But is there an impact due to flush error, like storage cache lost few
> > > block? If so, then the current design is correct. No?
> >
> > If there was a flush error, then we aborted the current transaction
> > and set the filesystem to error state.
> > We only write the super block if we are able to do the device flushes.
>
> Should the last_flush_error be reset in btrfs_close_one_device once the
> filesystem is unmounted? I wonder if we should allow leaking the status
> past the mount and care in the next one.

Yes, that's a simpler and more logical approach.
Will send a new version doing it that way. Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3] btrfs: fix mount failure due to past and transient device flush error
  2021-09-07 15:15   ` [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error fdmanana
  2021-09-08 14:19     ` Anand Jain
@ 2021-09-08 18:05     ` fdmanana
  2021-09-09 15:43       ` David Sterba
  1 sibling, 1 reply; 15+ messages in thread
From: fdmanana @ 2021-09-08 18:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we get an error flushing one device, during a super block commit, we
record the error in the device structure, in the field 'last_flush_error'.
This is used to later check if we should error out the super block commit,
depending on whether the number of flush errors is greater than or equals
to the maximum tolerated device failures for a raid profile.

However if we get a transient device flush error, unmount the filesystem
and later try to mount it, we can fail the mount because we treat that
past error as critical and consider the device is missing. Even if it's
very likely that the error will happen again, as it's probably due to a
hardware related problem, there may be cases where the error might not
happen again. One example is during testing, and a test case like the
new generic/648 from fstests always triggers this. The test cases
generic/019 and generic/475 also trigger this scenario, but very
sporadically.

When this happens we get an error like this:

  $ mount /dev/sdc /mnt
  mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.

  $ dmesg
  (...)
  [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
  [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
  [12918.890853] BTRFS error (device sdc): open_ctree failed

The failure happens because when btrfs_check_rw_degradable() is called at
mount time, or at remount from RO to RW time, is sees a non zero value in
a device's ->last_flush_error attribute, and therefore considers that the
device is 'missing'.

Fix this by setting a device's ->last_flush_error to zero when we close a
device, making sure the error is not seen on the next mount attempt. We
only need to track flush errors during the current mount, so that we never
commit a super block if such errors happened.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V3: Use a different and cleaner approach, by reseting the flush error
    from a device when we close it, so that it's not seen on the next
    mount attempt.

 fs/btrfs/volumes.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b81f25eed298..2101a5bd4eba 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1137,6 +1137,19 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 	atomic_set(&device->dev_stats_ccnt, 0);
 	extent_io_tree_release(&device->alloc_state);
 
+	/*
+	 * Reset the flush error record. We might have a transient flush error
+	 * in this mount, and if so we aborted the current transaction and set
+	 * the fs to an error state, guaranteeing no super blocks can be further
+	 * committed. However that error might be transient and if we unmount the
+	 * filesystem and mount it again, we should allow the mount to succeed
+	 * (btrfs_check_rw_degradable() should not fail) - if after mounting the
+	 * filesystem again we still get flush errors, then we will again abort
+	 * any transaction and set the error state, guaranteeing no commits of
+	 * unsafe super blocks.
+	 */
+	device->last_flush_error = 0;
+
 	/* Verify the device is back in a pristine state  */
 	ASSERT(!test_bit(BTRFS_DEV_STATE_FLUSH_SENT, &device->dev_state));
 	ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state));
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable()
  2021-09-08 14:25       ` Anand Jain
@ 2021-09-09 14:19         ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-09-09 14:19 UTC (permalink / raw)
  To: Anand Jain; +Cc: fdmanana, linux-btrfs

On Wed, Sep 08, 2021 at 10:25:52PM +0800, Anand Jain wrote:
> On 08/09/2021 00:05, David Sterba wrote:
> > On Tue, Sep 07, 2021 at 04:15:50PM +0100, fdmanana@kernel.org wrote:
> >> From: Filipe Manana <fdmanana@suse.com>
> >>
> >> Currently all callers of btrfs_check_rw_degradable() pass a NULL value for
> >> its 'failing_dev' argument, therefore making it useless. So just remove
> >> that argument.
> > 
> > Anand, did you have plans with using the argument? It's been NULL since
> > the initial patch
> > 
> > https://lore.kernel.org/linux-btrfs/00433e34-a56e-3898-80b9-32a304fe32e2@gmx.com/t/#u
> > 
> > as commit 6528b99d3d20 ("btrfs: factor btrfs_check_rw_degradable() to
> > check given device") and it was not part of a series.
> 
> I have a local patch which is V9 of this [1] using the 2nd argument of 
> btrfs_check_rw_degradable(). Essentially to check if the mounted fs 
> should fail or can continue to write when a disk fails.
> 
> [1]
> btrfs: introduce device dynamic state transition to failed
> 
> https://patchwork.kernel.org/project/linux-btrfs/patch/20171003155920.24925-2-anand.jain@oracle.com/
> 
> We need further discussions on this design. I think.

Yeah this does not look like a simple feature. I think we can remove the
parameter for now and add back eventually.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] btrfs: fix mount failure due to past and transient device flush error
  2021-09-08 18:05     ` [PATCH v3] " fdmanana
@ 2021-09-09 15:43       ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-09-09 15:43 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Sep 08, 2021 at 07:05:44PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we get an error flushing one device, during a super block commit, we
> record the error in the device structure, in the field 'last_flush_error'.
> This is used to later check if we should error out the super block commit,
> depending on whether the number of flush errors is greater than or equals
> to the maximum tolerated device failures for a raid profile.
> 
> However if we get a transient device flush error, unmount the filesystem
> and later try to mount it, we can fail the mount because we treat that
> past error as critical and consider the device is missing. Even if it's
> very likely that the error will happen again, as it's probably due to a
> hardware related problem, there may be cases where the error might not
> happen again. One example is during testing, and a test case like the
> new generic/648 from fstests always triggers this. The test cases
> generic/019 and generic/475 also trigger this scenario, but very
> sporadically.
> 
> When this happens we get an error like this:
> 
>   $ mount /dev/sdc /mnt
>   mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
> 
>   $ dmesg
>   (...)
>   [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
>   [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
>   [12918.890853] BTRFS error (device sdc): open_ctree failed
> 
> The failure happens because when btrfs_check_rw_degradable() is called at
> mount time, or at remount from RO to RW time, is sees a non zero value in
> a device's ->last_flush_error attribute, and therefore considers that the
> device is 'missing'.
> 
> Fix this by setting a device's ->last_flush_error to zero when we close a
> device, making sure the error is not seen on the next mount attempt. We
> only need to track flush errors during the current mount, so that we never
> commit a super block if such errors happened.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V3: Use a different and cleaner approach, by reseting the flush error
>     from a device when we close it, so that it's not seen on the next
>     mount attempt.

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-09-09 15:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-06  9:09 [PATCH] btrfs: fix mount failure due to past and transient device flush error fdmanana
2021-09-07 11:26 ` David Sterba
2021-09-07 15:15   ` Filipe Manana
2021-09-07 15:15 ` [PATCH 0/2] btrfs: fix mount/remount failure due to past device flush errors fdmanana
2021-09-07 15:15   ` [PATCH 1/2] btrfs: fix mount failure due to past and transient device flush error fdmanana
2021-09-08 14:19     ` Anand Jain
2021-09-08 14:26       ` Filipe Manana
2021-09-08 16:30         ` David Sterba
2021-09-08 17:32           ` Filipe Manana
2021-09-08 18:05     ` [PATCH v3] " fdmanana
2021-09-09 15:43       ` David Sterba
2021-09-07 15:15   ` [PATCH 2/2] btrfs: remove the failing device argument from btrfs_check_rw_degradable() fdmanana
2021-09-07 16:05     ` David Sterba
2021-09-08 14:25       ` Anand Jain
2021-09-09 14:19         ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.