[PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling
@ 2023-09-22  2:55 Qu Wenruo
  2023-09-22  2:55 ` [PATCH 1/3] btrfs: explicitly mark BTRFS_MOUNT_ enum as 64bit Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Qu Wenruo @ 2023-09-22  2:55 UTC (permalink / raw)
  To: linux-btrfs

During a very interesting (and weird) debugging session, it turns out
that btrfs will ignore a lot of write errors until we hit some critical
location, then btrfs started reacting, normally by aborting the
transaction.

This can be problematic for two types of people:

- Developers
  Sometimes we want to catch the earlies sign, continuing without any
  obvious errors (other than kernel error messages) can make debugging
  much harder.

- Sysadmins who wants to catch problems early
  Dmesg is not really something users would check frequently, and even
  they check it, it may already be too late.
  Meanwhile if the fs flips read-only early it's not only noisy but also
  saves the context much better (more relevant dmesgs etc).

On the other hand, I totally understand if just a single sector failed
to be write and we mark the whole fs read-only, it can be super
frustrating for regular end users, thus we can not make it the default
behavior.

So here we introduce a mount option group "abort=", and make the
following errors more noisy and abort early if specified by the user.

- Any super block write back failure
  Currently we're very loose on the super block writeback failure.
  The failure has to meet both conditions below:
  * The primary super block writeback failed
  * Total failed devices go beyond tolerance
    The tolerance is super high, num_devices - 1. To me this is
    too high, but I don't have a better idea yet.

  This new "rescue=super" may be more frequently used considering how
  loose our existing tolerance is.

- Any data writeback failure
  This is only for the data writeback at btrfs bio layer.
  This means, if a data sector is going to be written to a RAID1 chunk,
  and one mirror failed, we still consider the writeback succeeded.

There would be another one for btrfs bio layer, but I have found
something weird in the code, thus it would only be introduced after I
solved the problem there, meanwhile we can discuss on the usefulness of
this patchset.

Qu Wenruo (3):
  btrfs: explicitly mark BTRFS_MOUNT_ enum as 64bit
  btrfs: introduce "abort=super" mount option
  btrfs: introduce "abort=data" mount option

 fs/btrfs/disk-io.c   | 12 ++++++
 fs/btrfs/extent_io.c |  8 +++-
 fs/btrfs/fs.h        | 64 +++++++++++++++---------------
 fs/btrfs/inode.c     |  9 ++++-
 fs/btrfs/super.c     | 94 +++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 149 insertions(+), 38 deletions(-)

-- 
2.42.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] btrfs: explicitly mark BTRFS_MOUNT_ enum as 64bit
  2023-09-22  2:55 [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling Qu Wenruo
@ 2023-09-22  2:55 ` Qu Wenruo
  2023-09-22  2:55 ` [PATCH 2/3] btrfs: introduce "abort=super" mount option Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2023-09-22  2:55 UTC (permalink / raw)
  To: linux-btrfs

Currently we already have 29 BTRFS_MOUNT_* enums, and with several new
incoming mount options, we will go beyond 32bits.

For now we use "unsigned long" for fs_info->mount_opt, which is 64bit
for x86_64, but it's not always the case for all architects.

So here we explicitly mark BTRFS_MOUNT_* enum and
btrfs_fs_info::mount_opt as 64bit to avoid overflow.

For common architects like x86_64 and aarch64, this change makes no
difference, but for older 32bit systems it would cause a slight memory
usage increase per-fs (4 bytes increase + possible 4 bytes hole).

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/fs.h | 62 +++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 19f9a444bcd8..cec28d6b20bc 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -158,36 +158,36 @@ enum {
  * Note: don't forget to add new options to btrfs_show_options()
  */
 enum {
-	BTRFS_MOUNT_NODATASUM			= (1UL << 0),
-	BTRFS_MOUNT_NODATACOW			= (1UL << 1),
-	BTRFS_MOUNT_NOBARRIER			= (1UL << 2),
-	BTRFS_MOUNT_SSD				= (1UL << 3),
-	BTRFS_MOUNT_DEGRADED			= (1UL << 4),
-	BTRFS_MOUNT_COMPRESS			= (1UL << 5),
-	BTRFS_MOUNT_NOTREELOG   		= (1UL << 6),
-	BTRFS_MOUNT_FLUSHONCOMMIT		= (1UL << 7),
-	BTRFS_MOUNT_SSD_SPREAD			= (1UL << 8),
-	BTRFS_MOUNT_NOSSD			= (1UL << 9),
-	BTRFS_MOUNT_DISCARD_SYNC		= (1UL << 10),
-	BTRFS_MOUNT_FORCE_COMPRESS      	= (1UL << 11),
-	BTRFS_MOUNT_SPACE_CACHE			= (1UL << 12),
-	BTRFS_MOUNT_CLEAR_CACHE			= (1UL << 13),
-	BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED	= (1UL << 14),
-	BTRFS_MOUNT_ENOSPC_DEBUG		= (1UL << 15),
-	BTRFS_MOUNT_AUTO_DEFRAG			= (1UL << 16),
-	BTRFS_MOUNT_USEBACKUPROOT		= (1UL << 17),
-	BTRFS_MOUNT_SKIP_BALANCE		= (1UL << 18),
-	BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	= (1UL << 19),
-	BTRFS_MOUNT_RESCAN_UUID_TREE		= (1UL << 20),
-	BTRFS_MOUNT_FRAGMENT_DATA		= (1UL << 21),
-	BTRFS_MOUNT_FRAGMENT_METADATA		= (1UL << 22),
-	BTRFS_MOUNT_FREE_SPACE_TREE		= (1UL << 23),
-	BTRFS_MOUNT_NOLOGREPLAY			= (1UL << 24),
-	BTRFS_MOUNT_REF_VERIFY			= (1UL << 25),
-	BTRFS_MOUNT_DISCARD_ASYNC		= (1UL << 26),
-	BTRFS_MOUNT_IGNOREBADROOTS		= (1UL << 27),
-	BTRFS_MOUNT_IGNOREDATACSUMS		= (1UL << 28),
-	BTRFS_MOUNT_NODISCARD			= (1UL << 29),
+	BTRFS_MOUNT_NODATASUM			= (1ULL << 0),
+	BTRFS_MOUNT_NODATACOW			= (1ULL << 1),
+	BTRFS_MOUNT_NOBARRIER			= (1ULL << 2),
+	BTRFS_MOUNT_SSD				= (1ULL << 3),
+	BTRFS_MOUNT_DEGRADED			= (1ULL << 4),
+	BTRFS_MOUNT_COMPRESS			= (1ULL << 5),
+	BTRFS_MOUNT_NOTREELOG			= (1ULL << 6),
+	BTRFS_MOUNT_FLUSHONCOMMIT		= (1ULL << 7),
+	BTRFS_MOUNT_SSD_SPREAD			= (1ULL << 8),
+	BTRFS_MOUNT_NOSSD			= (1ULL << 9),
+	BTRFS_MOUNT_DISCARD_SYNC		= (1ULL << 10),
+	BTRFS_MOUNT_FORCE_COMPRESS		= (1ULL << 11),
+	BTRFS_MOUNT_SPACE_CACHE			= (1ULL << 12),
+	BTRFS_MOUNT_CLEAR_CACHE			= (1ULL << 13),
+	BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED	= (1ULL << 14),
+	BTRFS_MOUNT_ENOSPC_DEBUG		= (1ULL << 15),
+	BTRFS_MOUNT_AUTO_DEFRAG			= (1ULL << 16),
+	BTRFS_MOUNT_USEBACKUPROOT		= (1ULL << 17),
+	BTRFS_MOUNT_SKIP_BALANCE		= (1ULL << 18),
+	BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	= (1ULL << 19),
+	BTRFS_MOUNT_RESCAN_UUID_TREE		= (1ULL << 20),
+	BTRFS_MOUNT_FRAGMENT_DATA		= (1ULL << 21),
+	BTRFS_MOUNT_FRAGMENT_METADATA		= (1ULL << 22),
+	BTRFS_MOUNT_FREE_SPACE_TREE		= (1ULL << 23),
+	BTRFS_MOUNT_NOLOGREPLAY			= (1ULL << 24),
+	BTRFS_MOUNT_REF_VERIFY			= (1ULL << 25),
+	BTRFS_MOUNT_DISCARD_ASYNC		= (1ULL << 26),
+	BTRFS_MOUNT_IGNOREBADROOTS		= (1ULL << 27),
+	BTRFS_MOUNT_IGNOREDATACSUMS		= (1ULL << 28),
+	BTRFS_MOUNT_NODISCARD			= (1ULL << 29),
 };
 
 /*
@@ -429,7 +429,7 @@ struct btrfs_fs_info {
 	 * required instead of the faster short fsync log commits
 	 */
 	u64 last_trans_log_full_commit;
-	unsigned long mount_opt;
+	u64 mount_opt;
 
 	unsigned long compress_type:4;
 	unsigned int compress_level;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] btrfs: introduce "abort=super" mount option
  2023-09-22  2:55 [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling Qu Wenruo
  2023-09-22  2:55 ` [PATCH 1/3] btrfs: explicitly mark BTRFS_MOUNT_ enum as 64bit Qu Wenruo
@ 2023-09-22  2:55 ` Qu Wenruo
  2023-09-22  2:55 ` [PATCH 3/3] btrfs: introduce "abort=data" " Qu Wenruo
  2023-09-22 14:55 ` [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling David Sterba
  3 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2023-09-22  2:55 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs would only report error when the primary super block
writeback failed.
If the backup ones failed to be written, we still continue and just
output a warning.

This is fine but sometimes such warning is an early indication of bigger
problems.
For developers and other critical missions, we may want the filesystem
to be more noisy by abort the current transaction for those situations.

This patch would introduce the mount option group "abort=", and
introduce the first sub option: "super".
This new option would make btrfs to abort and mark the fs read-only if
any super block failed to be written back.

This is different from the existing code by:

- We do not ignore backup super blocks write failure

- We do not accept any super writeback failure for any device
  Currently we allow "num_devices - 1" devices to have super block
  writeback failure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 12 +++++++
 fs/btrfs/fs.h      |  1 +
 fs/btrfs/super.c   | 84 +++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dc577b3c53f6..5a85b517a031 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3758,6 +3758,8 @@ static int write_dev_supers(struct btrfs_device *device,
 
 	if (max_mirrors == 0)
 		max_mirrors = BTRFS_SUPER_MIRROR_MAX;
+	if (btrfs_test_opt(fs_info, ABORT_SUPER))
+		max_mirrors = 1;
 
 	shash->tfm = fs_info->csum_shash;
 
@@ -3849,6 +3851,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 
 	if (max_mirrors == 0)
 		max_mirrors = BTRFS_SUPER_MIRROR_MAX;
+	if (btrfs_test_opt(device->fs_info, ABORT_SUPER))
+		max_mirrors = 1;
 
 	for (i = 0; i < max_mirrors; i++) {
 		struct page *page;
@@ -3895,6 +3899,12 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 			  device->devid);
 		return -1;
 	}
+	if (errors >= i) {
+		btrfs_err(device->fs_info,
+			  "error writing super blocks to device %llu",
+			  device->devid);
+		return -1;
+	}
 
 	return errors < i ? 0 : -1;
 }
@@ -4058,6 +4068,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	head = &fs_info->fs_devices->devices;
 	max_errors = btrfs_super_num_devices(fs_info->super_copy) - 1;
+	if (btrfs_test_opt(fs_info, ABORT_SUPER))
+		max_errors = 0;
 
 	if (do_barriers) {
 		ret = barrier_all_devices(fs_info);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index cec28d6b20bc..4fc0afcf1ef2 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -188,6 +188,7 @@ enum {
 	BTRFS_MOUNT_IGNOREBADROOTS		= (1ULL << 27),
 	BTRFS_MOUNT_IGNOREDATACSUMS		= (1ULL << 28),
 	BTRFS_MOUNT_NODISCARD			= (1ULL << 29),
+	BTRFS_MOUNT_ABORT_SUPER			= (1ULL << 30),
 };
 
 /*
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5c6054f0552b..41ab8c6e3fab 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -125,6 +125,11 @@ enum {
 	Opt_ignoredatacsums,
 	Opt_rescue_all,
 
+	/* Extra abort options. */
+	Opt_abort,
+	Opt_abort_super,
+	Opt_abort_all,
+
 	/* Deprecated options */
 	Opt_recovery,
 	Opt_inode_cache, Opt_noinode_cache,
@@ -187,8 +192,12 @@ static const match_table_t tokens = {
 	{Opt_notreelog, "notreelog"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
 
+	/* Abort options */
+	{Opt_abort, "abort=%s"},
+
 	/* Rescue options */
 	{Opt_rescue, "rescue=%s"},
+
 	/* Deprecated, with alias rescue=nologreplay */
 	{Opt_nologreplay, "nologreplay"},
 	/* Deprecated, with alias rescue=usebackuproot */
@@ -222,6 +231,12 @@ static const match_table_t rescue_tokens = {
 	{Opt_err, NULL},
 };
 
+static const match_table_t abort_tokens = {
+	{Opt_abort_super, "super"},
+	{Opt_abort_all, "all"},
+	{Opt_err, NULL},
+};
+
 static bool check_ro_option(struct btrfs_fs_info *fs_info, unsigned long opt,
 			    const char *opt_name)
 {
@@ -233,6 +248,48 @@ static bool check_ro_option(struct btrfs_fs_info *fs_info, unsigned long opt,
 	return false;
 }
 
+static int parse_abort_options(struct btrfs_fs_info *info, const char *options)
+{
+	char *opts;
+	char *orig;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int ret = 0;
+
+	opts = kstrdup(options, GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+	orig = opts;
+
+	while ((p = strsep(&opts, ":")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+		token = match_token(p, abort_tokens, args);
+		switch (token) {
+		case Opt_abort_super:
+			btrfs_set_and_info(info, ABORT_SUPER,
+				"will abort if any super block write back failed");
+			break;
+		case Opt_abort_all:
+			btrfs_info(info, "enabling all abort options");
+			btrfs_set_and_info(info, ABORT_SUPER,
+				"will abort if any super block write back failed");
+			break;
+		case Opt_err:
+			btrfs_info(info, "unrecognized abort option '%s'", p);
+			ret = -EINVAL;
+			goto out;
+		default:
+			break;
+		}
+	}
+out:
+	kfree(orig);
+	return ret;
+}
+
 static int parse_rescue_options(struct btrfs_fs_info *info, const char *options)
 {
 	char *opts;
@@ -736,6 +793,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			}
 			info->commit_interval = intarg;
 			break;
+		case Opt_abort:
+			ret = parse_abort_options(info, args[0].from);
+			if (ret < 0) {
+				btrfs_err(info, "unrecognized abort value %s",
+					  args[0].from);
+				goto out;
+			}
+			break;
 		case Opt_rescue:
 			ret = parse_rescue_options(info, args[0].from);
 			if (ret < 0) {
@@ -1191,12 +1256,19 @@ static void print_rescue_option(struct seq_file *seq, const char *s, bool *print
 	*printed = true;
 }
 
+static void print_abort_option(struct seq_file *seq, const char *s, bool *printed)
+{
+	seq_printf(seq, "%s%s", (*printed) ? ":" : ",abort=", s);
+	*printed = true;
+}
+
 static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 {
 	struct btrfs_fs_info *info = btrfs_sb(dentry->d_sb);
 	const char *compress_type;
 	const char *subvol_name;
-	bool printed = false;
+	bool rescue_printed = false;
+	bool abort_printed = false;
 
 	if (btrfs_test_opt(info, DEGRADED))
 		seq_puts(seq, ",degraded");
@@ -1229,13 +1301,15 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 	if (btrfs_test_opt(info, NOTREELOG))
 		seq_puts(seq, ",notreelog");
 	if (btrfs_test_opt(info, NOLOGREPLAY))
-		print_rescue_option(seq, "nologreplay", &printed);
+		print_rescue_option(seq, "nologreplay", &rescue_printed);
 	if (btrfs_test_opt(info, USEBACKUPROOT))
-		print_rescue_option(seq, "usebackuproot", &printed);
+		print_rescue_option(seq, "usebackuproot", &rescue_printed);
 	if (btrfs_test_opt(info, IGNOREBADROOTS))
-		print_rescue_option(seq, "ignorebadroots", &printed);
+		print_rescue_option(seq, "ignorebadroots", &rescue_printed);
 	if (btrfs_test_opt(info, IGNOREDATACSUMS))
-		print_rescue_option(seq, "ignoredatacsums", &printed);
+		print_rescue_option(seq, "ignoredatacsums", &rescue_printed);
+	if (btrfs_test_opt(info, ABORT_SUPER))
+		print_abort_option(seq, "super", &abort_printed);
 	if (btrfs_test_opt(info, FLUSHONCOMMIT))
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] btrfs: introduce "abort=data" mount option
  2023-09-22  2:55 [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling Qu Wenruo
  2023-09-22  2:55 ` [PATCH 1/3] btrfs: explicitly mark BTRFS_MOUNT_ enum as 64bit Qu Wenruo
  2023-09-22  2:55 ` [PATCH 2/3] btrfs: introduce "abort=super" mount option Qu Wenruo
@ 2023-09-22  2:55 ` Qu Wenruo
  2023-09-22 14:55 ` [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling David Sterba
  3 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2023-09-22  2:55 UTC (permalink / raw)
  To: linux-btrfs

Currently if btrfs fails to write data blocks, it will not really cause
a great damage, but mostly -EIO for involved writeback functions like
fsync() or direct io for that inode.

Normally it's not a big deal, but it can be an indicator of a bigger
problem.

Thus here we introduce the new "abort=data" mount option to be
noisy and mark the whole filesystem read-only as an early warning.

This behavior covers buffered, direct writes.

Please note that, it only counts as an write error if the write to all
mirrors failed.
E.g. if data write into a RAID1 data chunk only failed for one mirror,
it will not be accounted as an write error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c |  8 +++++++-
 fs/btrfs/fs.h        |  1 +
 fs/btrfs/inode.c     |  9 ++++++++-
 fs/btrfs/super.c     | 10 ++++++++++
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5e5852a4ffb5..d38b01412615 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -483,8 +483,14 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
 				   bvec->bv_offset, bvec->bv_len);
 
 		btrfs_finish_ordered_extent(bbio->ordered, page, start, len, !error);
-		if (error)
+		if (error) {
 			mapping_set_error(page->mapping, error);
+			if (btrfs_test_opt(fs_info, ABORT_DATA))
+				btrfs_handle_fs_error(fs_info, -EIO,
+		"data write back failed, root %lld ino %llu fileoff %llu",
+					BTRFS_I(inode)->root->root_key.objectid,
+					btrfs_ino(BTRFS_I(inode)), start);
+		}
 		btrfs_page_clear_writeback(fs_info, page, start, len);
 	}
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 4fc0afcf1ef2..65bfe7e60b03 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -189,6 +189,7 @@ enum {
 	BTRFS_MOUNT_IGNOREDATACSUMS		= (1ULL << 28),
 	BTRFS_MOUNT_NODISCARD			= (1ULL << 29),
 	BTRFS_MOUNT_ABORT_SUPER			= (1ULL << 30),
+	BTRFS_MOUNT_ABORT_DATA			= (1ULL << 31),
 };
 
 /*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 514d2e8a4f52..20b1628b090a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7703,13 +7703,20 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio)
 	struct btrfs_dio_private *dip =
 		container_of(bbio, struct btrfs_dio_private, bbio);
 	struct btrfs_inode *inode = bbio->inode;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct bio *bio = &bbio->bio;
 
 	if (bio->bi_status) {
-		btrfs_warn(inode->root->fs_info,
+		btrfs_warn(fs_info,
 		"direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d",
 			   btrfs_ino(inode), bio->bi_opf,
 			   dip->file_offset, dip->bytes, bio->bi_status);
+		if (btrfs_test_opt(fs_info, ABORT_DATA))
+			btrfs_handle_fs_error(fs_info, -EIO,
+		"direct IO data write back failed, root %lld ino %llu fileoff %llu len %u",
+				inode->root->root_key.objectid,
+				btrfs_ino(inode), dip->file_offset,
+				dip->bytes);
 	}
 
 	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 41ab8c6e3fab..fb509507fb64 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -128,6 +128,7 @@ enum {
 	/* Extra abort options. */
 	Opt_abort,
 	Opt_abort_super,
+	Opt_abort_data,
 	Opt_abort_all,
 
 	/* Deprecated options */
@@ -233,6 +234,7 @@ static const match_table_t rescue_tokens = {
 
 static const match_table_t abort_tokens = {
 	{Opt_abort_super, "super"},
+	{Opt_abort_data, "data"},
 	{Opt_abort_all, "all"},
 	{Opt_err, NULL},
 };
@@ -272,10 +274,16 @@ static int parse_abort_options(struct btrfs_fs_info *info, const char *options)
 			btrfs_set_and_info(info, ABORT_SUPER,
 				"will abort if any super block write back failed");
 			break;
+		case Opt_abort_data:
+			btrfs_set_and_info(info, ABORT_DATA,
+				"will abort if any data write back failed");
+			break;
 		case Opt_abort_all:
 			btrfs_info(info, "enabling all abort options");
 			btrfs_set_and_info(info, ABORT_SUPER,
 				"will abort if any super block write back failed");
+			btrfs_set_and_info(info, ABORT_DATA,
+				"will abort if any data write back failed");
 			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized abort option '%s'", p);
@@ -1310,6 +1318,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		print_rescue_option(seq, "ignoredatacsums", &rescue_printed);
 	if (btrfs_test_opt(info, ABORT_SUPER))
 		print_abort_option(seq, "super", &abort_printed);
+	if (btrfs_test_opt(info, ABORT_DATA))
+		print_abort_option(seq, "data", &abort_printed);
 	if (btrfs_test_opt(info, FLUSHONCOMMIT))
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling
  2023-09-22  2:55 [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling Qu Wenruo
                   ` (2 preceding siblings ...)
  2023-09-22  2:55 ` [PATCH 3/3] btrfs: introduce "abort=data" " Qu Wenruo
@ 2023-09-22 14:55 ` David Sterba
  2023-09-22 21:16   ` Qu Wenruo
  3 siblings, 1 reply; 8+ messages in thread
From: David Sterba @ 2023-09-22 14:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote:
> During a very interesting (and weird) debugging session, it turns out
> that btrfs will ignore a lot of write errors until we hit some critical
> location, then btrfs started reacting, normally by aborting the
> transaction.
> 
> This can be problematic for two types of people:
> 
> - Developers
>   Sometimes we want to catch the earlies sign, continuing without any
>   obvious errors (other than kernel error messages) can make debugging
>   much harder.
> 
> - Sysadmins who wants to catch problems early
>   Dmesg is not really something users would check frequently, and even
>   they check it, it may already be too late.
>   Meanwhile if the fs flips read-only early it's not only noisy but also
>   saves the context much better (more relevant dmesgs etc).

For sysadmins I think that the preferred way is to get events (like via
the uevents interface) that can be monitored and then reacted to by some
tools.

> On the other hand, I totally understand if just a single sector failed
> to be write and we mark the whole fs read-only, it can be super
> frustrating for regular end users, thus we can not make it the default
> behavior.

I can't imagine a realistic scenario where a user would like this
behaviour, one EIO takes down whole filesystem could make sense only for
some testing environments.

> So here we introduce a mount option group "abort=", and make the
> following errors more noisy and abort early if specified by the user.

Default andswer for a new mount option is 'no', here we also have one
that is probably doing the same, 'fatal_errors', so if you really want
to do that by a mount option then please use this one.

> - Any super block write back failure
>   Currently we're very loose on the super block writeback failure.
>   The failure has to meet both conditions below:
>   * The primary super block writeback failed

Doesn't this fail with flip to read-only?

>   * Total failed devices go beyond tolerance
>     The tolerance is super high, num_devices - 1. To me this is
>     too high, but I don't have a better idea yet.

Does this depend on the profile constraints?

>   This new "rescue=super" may be more frequently used considering how
>   loose our existing tolerance is.
> 
> - Any data writeback failure
>   This is only for the data writeback at btrfs bio layer.
>   This means, if a data sector is going to be written to a RAID1 chunk,
>   and one mirror failed, we still consider the writeback succeeded.
> 
> There would be another one for btrfs bio layer, but I have found
> something weird in the code, thus it would only be introduced after I
> solved the problem there, meanwhile we can discuss on the usefulness of
> this patchset.

We can possibly enhance the error checking with additional knobs and
checkpoints that will have to survive or detect specific testing, but as
mount options it's not very flexible. We can possibly do it via sysfs or
BPF but this may not be the proper interface anyway.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling
  2023-09-22 14:55 ` [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling David Sterba
@ 2023-09-22 21:16   ` Qu Wenruo
  2023-09-25 16:37     ` David Sterba
  0 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2023-09-22 21:16 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



On 2023/9/23 00:25, David Sterba wrote:
> On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote:
>> During a very interesting (and weird) debugging session, it turns out
>> that btrfs will ignore a lot of write errors until we hit some critical
>> location, then btrfs started reacting, normally by aborting the
>> transaction.
>>
>> This can be problematic for two types of people:
>>
>> - Developers
>>    Sometimes we want to catch the earlies sign, continuing without any
>>    obvious errors (other than kernel error messages) can make debugging
>>    much harder.
>>
>> - Sysadmins who wants to catch problems early
>>    Dmesg is not really something users would check frequently, and even
>>    they check it, it may already be too late.
>>    Meanwhile if the fs flips read-only early it's not only noisy but also
>>    saves the context much better (more relevant dmesgs etc).
> 
> For sysadmins I think that the preferred way is to get events (like via
> the uevents interface) that can be monitored and then reacted to by some
> tools.

I think this is a future development idea, to have a generic way to do 
error reporting.

> 
>> On the other hand, I totally understand if just a single sector failed
>> to be write and we mark the whole fs read-only, it can be super
>> frustrating for regular end users, thus we can not make it the default
>> behavior.
> 
> I can't imagine a realistic scenario where a user would like this
> behaviour, one EIO takes down whole filesystem could make sense only for
> some testing environments.

I doubt, for some environment with expensive hardware, one may not even 
expect any -EIO for valid operations.
If that happens, it may mean bad firmware or bad hardware, neither is a 
good thing especially if they paid extra money for the fancy hardware or 
the support fee.

> 
>> So here we introduce a mount option group "abort=", and make the
>> following errors more noisy and abort early if specified by the user.
> 
> Default andswer for a new mount option is 'no', here we also have one
> that is probably doing the same, 'fatal_errors', so if you really want
> to do that by a mount option then please use this one.

Or I can go sysfs interface and put it under some debug directory.

> 
>> - Any super block write back failure
>>    Currently we're very loose on the super block writeback failure.
>>    The failure has to meet both conditions below:
>>    * The primary super block writeback failed
> 
> Doesn't this fail with flip to read-only?

Nope, just by itself is not enough to go read-only, as long as we have 
other devices.

If the primary super block writeback failed, it would only make 
write_dev_supers() to return error.
But inside write_all_supers(), error from write_dev_super() would only 
increase @total_errors, not directly erroring out.

> 
>>    * Total failed devices go beyond tolerance
>>      The tolerance is super high, num_devices - 1. To me this is
>>      too high, but I don't have a better idea yet.
> 
> Does this depend on the profile constraints?

Nope, it's profile independent.

The @max_errors inside write_all_supers() is purely determined by 
btrfs_super_num_devices() - 1.

> 
>>    This new "rescue=super" may be more frequently used considering how
>>    loose our existing tolerance is.
>>
>> - Any data writeback failure
>>    This is only for the data writeback at btrfs bio layer.
>>    This means, if a data sector is going to be written to a RAID1 chunk,
>>    and one mirror failed, we still consider the writeback succeeded.
>>
>> There would be another one for btrfs bio layer, but I have found
>> something weird in the code, thus it would only be introduced after I
>> solved the problem there, meanwhile we can discuss on the usefulness of
>> this patchset.
> 
> We can possibly enhance the error checking with additional knobs and
> checkpoints that will have to survive or detect specific testing, but as
> mount options it's not very flexible. We can possibly do it via sysfs or
> BPF but this may not be the proper interface anyway.

I think sysfs would be better, but not familiar enough with BPF to 
determine if it's any better or worse.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling
  2023-09-22 21:16   ` Qu Wenruo
@ 2023-09-25 16:37     ` David Sterba
  2023-09-25 21:14       ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: David Sterba @ 2023-09-25 16:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Sat, Sep 23, 2023 at 06:46:26AM +0930, Qu Wenruo wrote:
> On 2023/9/23 00:25, David Sterba wrote:
> > On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote:
> >> On the other hand, I totally understand if just a single sector failed
> >> to be write and we mark the whole fs read-only, it can be super
> >> frustrating for regular end users, thus we can not make it the default
> >> behavior.
> > 
> > I can't imagine a realistic scenario where a user would like this
> > behaviour, one EIO takes down whole filesystem could make sense only for
> > some testing environments.
> 
> I doubt, for some environment with expensive hardware, one may not even 
> expect any -EIO for valid operations.
> If that happens, it may mean bad firmware or bad hardware, neither is a 
> good thing especially if they paid extra money for the fancy hardware or 
> the support fee.

So the semantics we'd need is like "fail on first error of <type>" where
we can define a set of errors, like EIO, superblock write erorr or
something related to devices.

> >> So here we introduce a mount option group "abort=", and make the
> >> following errors more noisy and abort early if specified by the user.
> > 
> > Default andswer for a new mount option is 'no', here we also have one
> > that is probably doing the same, 'fatal_errors', so if you really want
> > to do that by a mount option then please use this one.
> 
> Or I can go sysfs interface and put it under some debug directory.

For a prototype it's much more convenient until we understand what's the
actual usecase.

> > 
> >>    This new "rescue=super" may be more frequently used considering how
> >>    loose our existing tolerance is.
> >>
> >> - Any data writeback failure
> >>    This is only for the data writeback at btrfs bio layer.
> >>    This means, if a data sector is going to be written to a RAID1 chunk,
> >>    and one mirror failed, we still consider the writeback succeeded.
> >>
> >> There would be another one for btrfs bio layer, but I have found
> >> something weird in the code, thus it would only be introduced after I
> >> solved the problem there, meanwhile we can discuss on the usefulness of
> >> this patchset.
> > 
> > We can possibly enhance the error checking with additional knobs and
> > checkpoints that will have to survive or detect specific testing, but as
> > mount options it's not very flexible. We can possibly do it via sysfs or
> > BPF but this may not be the proper interface anyway.
> 
> I think sysfs would be better, but not familiar enough with BPF to 
> determine if it's any better or worse.

BPF is probably a bad idea, I mentioned only as a potential way, it's
another extensible interface.

What you suggest looks like the forced shutdown of filesystem. This can
be done internally or externally (ioctl). I haven't looked at the
interface to what extent it's configurable, but let's say there's a
bitmask set by admin and the filesystem checks that in case a given type
of error happens. Then forced shutown would be like a transaction abort,
followed by read-only. We have decent support for that but with the
shutdown some kind of audit would have to happen anyway, namely for the
EIO type of errors. Specific context like super block write error would
be relatively easy.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling
  2023-09-25 16:37     ` David Sterba
@ 2023-09-25 21:14       ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2023-09-25 21:14 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



On 2023/9/26 02:07, David Sterba wrote:
> On Sat, Sep 23, 2023 at 06:46:26AM +0930, Qu Wenruo wrote:
>> On 2023/9/23 00:25, David Sterba wrote:
>>> On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote:
>>>> On the other hand, I totally understand if just a single sector failed
>>>> to be write and we mark the whole fs read-only, it can be super
>>>> frustrating for regular end users, thus we can not make it the default
>>>> behavior.
>>>
>>> I can't imagine a realistic scenario where a user would like this
>>> behaviour, one EIO takes down whole filesystem could make sense only for
>>> some testing environments.
>>
>> I doubt, for some environment with expensive hardware, one may not even
>> expect any -EIO for valid operations.
>> If that happens, it may mean bad firmware or bad hardware, neither is a
>> good thing especially if they paid extra money for the fancy hardware or
>> the support fee.
>
> So the semantics we'd need is like "fail on first error of <type>" where
> we can define a set of errors, like EIO, superblock write erorr or
> something related to devices.

The "set of errors" would be very specific, thus -EIO is not a good idea
AFAIK.

>
>>>> So here we introduce a mount option group "abort=", and make the
>>>> following errors more noisy and abort early if specified by the user.
>>>
>>> Default andswer for a new mount option is 'no', here we also have one
>>> that is probably doing the same, 'fatal_errors', so if you really want
>>> to do that by a mount option then please use this one.
>>
>> Or I can go sysfs interface and put it under some debug directory.
>
> For a prototype it's much more convenient until we understand what's the
> actual usecase.

Well, not that convenient, as we need to expand the mount option bits to
U64, or on 32bit systems we're going to cause problems due to the fact
that we're go beyond 32 mount options. (the first patch)

>
>>>
>>>>     This new "rescue=super" may be more frequently used considering how
>>>>     loose our existing tolerance is.
>>>>
>>>> - Any data writeback failure
>>>>     This is only for the data writeback at btrfs bio layer.
>>>>     This means, if a data sector is going to be written to a RAID1 chunk,
>>>>     and one mirror failed, we still consider the writeback succeeded.
>>>>
>>>> There would be another one for btrfs bio layer, but I have found
>>>> something weird in the code, thus it would only be introduced after I
>>>> solved the problem there, meanwhile we can discuss on the usefulness of
>>>> this patchset.
>>>
>>> We can possibly enhance the error checking with additional knobs and
>>> checkpoints that will have to survive or detect specific testing, but as
>>> mount options it's not very flexible. We can possibly do it via sysfs or
>>> BPF but this may not be the proper interface anyway.
>>
>> I think sysfs would be better, but not familiar enough with BPF to
>> determine if it's any better or worse.
>
> BPF is probably a bad idea, I mentioned only as a potential way, it's
> another extensible interface.
>
> What you suggest looks like the forced shutdown of filesystem. This can
> be done internally or externally (ioctl). I haven't looked at the
> interface to what extent it's configurable, but let's say there's a
> bitmask set by admin and the filesystem checks that in case a given type
> of error happens. Then forced shutown would be like a transaction abort,
> followed by read-only. We have decent support for that but with the
> shutdown some kind of audit would have to happen anyway, namely for the
> EIO type of errors. Specific context like super block write error would
> be relatively easy.

I'm not sure if I understand the bitmask thing correctly.

The idea of the series is to catch certain types of error which are by
default ignored/handled gracefully.
The bitmask looks a little too generic, what we want is to catch
specific error at certain call sites (thus I can understand why you
mention ebpf).

Thus we can not just simple use a bitmask to handle all the generic
errors, but add checks into the route we're interested in.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-25 21:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-22  2:55 [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling Qu Wenruo
2023-09-22  2:55 ` [PATCH 1/3] btrfs: explicitly mark BTRFS_MOUNT_ enum as 64bit Qu Wenruo
2023-09-22  2:55 ` [PATCH 2/3] btrfs: introduce "abort=super" mount option Qu Wenruo
2023-09-22  2:55 ` [PATCH 3/3] btrfs: introduce "abort=data" " Qu Wenruo
2023-09-22 14:55 ` [PATCH 0/3] btrfs: introduce "abort=" groups for more strict error handling David Sterba
2023-09-22 21:16   ` Qu Wenruo
2023-09-25 16:37     ` David Sterba
2023-09-25 21:14       ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.