All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
@ 2022-07-25  5:37 Qu Wenruo
  2022-07-25  5:37 ` [PATCH 01/14] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
                   ` (15 more replies)
  0 siblings, 16 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2->v1:
- Add mount time recovery functionality
  Now if a dirty bitmap is found, we will do proper recovery at mount
  time.

  The code is using scrub routine to do the proper recovery for both
  data and P/Q parity.

  Currently we can only test this by manually setting up a dirty bitmap,
  and corrupt the full stripe, then mounting it and verify the full
  stripe using "btrfs check --check-data-csum"

- Skip full stripe writes
  Full stripe writes are either:
  * New writes into unallocated space
    After powerloss, we won't read any data from the full stripe.

  * Writes into NODATACOW ranges
    We won't have csum for them anyway, thus new way to do any recovery.

- Fix a memory leakage caused by RO mount
  Previously we only cleanup the write-intent ctrl if it's RW mount,
  thus for RO mount we will cause memory leak.
 

RFC->v1:
- Fix a corner case in write_intent_set_bits()
  If the range covers the last existing entry, but still needs a new
  entry, the old code will not insert the new entry, causing
  write_intent_clear_bits() to cause a warning.

- Add selftests for the write intent bitmaps
  The write intent bitmaps is an sparse array of bitmaps.
  There are some corner cases tricky to get it done correctly in the
  first try (see above case).
  The test case would prevent such problems from happening again.

- Fix hang with dev-replace, and better bitmaps bio submission
  Previously we will hold device_list_mutex while submitting the bitmaps
  bio, this can lead to deadlock with dev-replace/dev-removal.
  Fix it by using RCU to keep an local copy of devices and use them
  to submit the bitmaps bio.

  Furthermore, there is no need to follow the way of superblocks
  writeback, as the content of bitmaps are always the same for all
  devices, we can just submitting the same page and use atomic counter
  to wait for them to finish.

  Now there is no more crash/warning/deadlock in btrfs/070.

[BACKGROUND]
Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
loss happens.

For pure mirror based profiles it's fine as btrfs can utilize its csums
to find the correct mirror the repair the bad ones.

But for RAID56, the repair itself needs the data from other devices,
thus any out-of-sync data can degrade the tolerance.

Even worse, incorrect RMW can use the stale data to generate P/Q,
removing the possibility of recovery the data.


For md-raid, it goes with write-intent bitmap, to do faster resilver,
and goes journal (partial parity log for RAID5) to ensure it can even
stand a powerloss + device lose.

[OBJECTIVE]

This patchset will introduce a btrfs specific write-intent bitmap.

The bitmap will locate at physical offset 1MiB of each device, and the
content is the same between all devices.

When there is a RAID56 write (currently all RAID56 write, including full
stripe write), before submitting all the real bios to disks,
write-intent bitmap will be updated and flushed to all writeable
devices.

So even if a powerloss happened, at the next mount time we know which
full stripes needs to check, and can start a scrub for those involved
logical bytenr ranges.

[ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]

Since btrfs can utilize csum for its metadata and CoWed data, unlike
dm-bitmap which can only be used for faster re-silver, we can fully
rebuild the full stripe, as long as:

1) There is no missing device
   For missing device case, we still need to go full journal.

2) Untouched data stays untouched
   This should be mostly sane for sane hardware.

And since the btrfs specific write-intent bitmaps are pretty small (4KiB
in size), the overhead much lower than full journal.

In the future, we may allow users to choose between just bitmaps or full
journal to meet their requirement.

[BITMAPS DESIGN]

The bitmaps on-disk format looks like this:

 [ super ][ entry 1 ][ entry 2 ] ... [entry N]
 |<---------  super::size (4K) ------------->|

Super block contains how many entires are in use.

Each entry is 128 bits (16 bytes) in size, containing one u64 for
bytenr, and u64 for one bitmap.

And all utilized entries will be sorted in their bytenr order, and no
bit can overlap.

The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
can contain at most 4MiB, and the whole bitmaps can contain 224 entries.

For the worst case, it can contain 14MiB dirty ranges.
(1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).

For the best case, it can contain 896MiB dirty ranges.
(all bits set per bitmap)

[WHY NOT BTRFS BTREE]

Current write-intent structure needs two features:

- Its data needs to survive cross stripe boundary
  Normally this means write-intent btree needs to acts like a proper
  tree root, has METADATA_ITEMs for all its tree blocks.

- Its data update must be outside of a transaction
  Currently only log tree can do such thing.
  But unfortunately log tree can not survive across transaction
  boundary.

Thus write-intent btree can only meet one of the requirement, not a
suitable solution here.

[TESTING AND BENCHMARK]

For performance benchmark, unfortunately I don't have 3 HDDs to test.
Will do the benchmark after secured enough hardware.

For testing, it can survive volume/raid/dev-replace test groups, and no
write-intent bitmap leakage.

Unfortunately there is still a warning triggered in btrfs/070, still
under investigation, hopefully to be a false alert in bitmap clearing
path.

[TODO]
- Extra optimizations
  * Enlarge the window between btrfs_write_intent_mark_dirty() and
    btrfs_write_intent_writeback()
    So that we can merge more dirty bites and cause less bitmaps
    writeback

- Proper performance benchmark
  Needs hardware/baremetal VMs, since I don't have any physical machine
  large enough to contian 3 3.5" HDDs.


Qu Wenruo (14):
  btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
  btrfs: introduce a new experimental compat RO flag,
    WRITE_INTENT_BITMAP
  btrfs: introduce the on-disk format of btrfs write intent bitmaps
  btrfs: load/create write-intent bitmaps at mount time
  btrfs: write-intent: write the newly created bitmaps to all disks
  btrfs: write-intent: introduce an internal helper to set bits for a
    range.
  btrfs: write-intent: introduce an internal helper to clear bits for a
    range.
  btrfs: selftests: add selftests for write-intent bitmaps
  btrfs: write back write intent bitmap after barrier_all_devices()
  btrfs: update and writeback the write-intent bitmap for RAID56 write.
  btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
  btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
  btrfs: avoid recording full stripe write into write-intent bitmaps
  btrfs: scrub the full stripe which had sub-stripe write at mount time

 fs/btrfs/Makefile                           |   5 +-
 fs/btrfs/ctree.h                            |  26 +-
 fs/btrfs/disk-io.c                          |  58 +-
 fs/btrfs/raid56.c                           |  27 +
 fs/btrfs/scrub.c                            | 177 +++-
 fs/btrfs/sysfs.c                            |   2 +
 fs/btrfs/tests/btrfs-tests.c                |   4 +
 fs/btrfs/tests/btrfs-tests.h                |   2 +
 fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
 fs/btrfs/volumes.c                          |  34 +-
 fs/btrfs/write-intent.c                     | 923 ++++++++++++++++++++
 fs/btrfs/write-intent.h                     | 303 +++++++
 fs/btrfs/zoned.c                            |   8 +
 include/uapi/linux/btrfs.h                  |  17 +
 14 files changed, 1812 insertions(+), 21 deletions(-)
 create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
 create mode 100644 fs/btrfs/write-intent.c
 create mode 100644 fs/btrfs/write-intent.h

-- 
2.37.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/14] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 02/14] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

From the beginning of btrfs kernel module, kernel will avoid allocating
dev extents into the first 1MiB of each device, not only for the
superblock at 64KiB, but also for legacy bootloaders to store their
data.

Here for later expansion, we introduce a new compat RO flag,
EXTRA_SUPER_RESERVED, this allows btrfs to have extra reserved space
beyond the default 1MiB.

The extra reserved space can be utilized by things like write-intent
log.

Several new checks are introduced:

- No super_reserved_bytes should be smaller than the old 1MiB limit
- No zoned device support
  Such reserved space will be utilized in a no COW way, thus it can
  not be supported by zoned device.

We still allow dev extents to exist in the new reserved_bytes range,
this is to allow btrfstune to set reserved_bytes, then rely on kernel
balance to reserve the new space.

But later, if there is any feature relying on the reserved_bytes, we
will disable those features automatically.

There is a special catch, the new member @reserved_bytes is located at
the end of the superblock, to utilize the padding bytes.

The reason is, due to some undetermined extent tree v2 on-disk format
mess, we don't have reliable location inside the reserved8[] array.
I just want a stable location undisturbed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h           | 14 +++++++++++---
 fs/btrfs/disk-io.c         |  9 +++++++++
 fs/btrfs/sysfs.c           |  2 ++
 fs/btrfs/volumes.c         | 22 ++++++++++++++++++----
 fs/btrfs/zoned.c           |  8 ++++++++
 include/uapi/linux/btrfs.h | 10 ++++++++++
 6 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4db85b9dc7ed..ef678419ff3f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -285,14 +285,19 @@ struct btrfs_super_block {
 	__le64 block_group_root_generation;
 	u8 block_group_root_level;
 
-	/* future expansion */
 	u8 reserved8[7];
 	__le64 reserved[25];
 	u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
 	struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 
+	/*
+	 * How many bytes are reserved at the beginning of a device.
+	 * Should be >= BTRFS_DEFAULT_RESERVED.
+	 */
+	__le32 reserved_bytes;
+
 	/* Padded to 4096 bytes */
-	u8 padding[565];
+	u8 padding[561];
 } __attribute__ ((__packed__));
 static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 
@@ -307,7 +312,8 @@ static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
 	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
-	 BTRFS_FEATURE_COMPAT_RO_VERITY)
+	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
+	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -2542,6 +2548,8 @@ BTRFS_SETGET_STACK_FUNCS(super_block_group_root_generation,
 			 block_group_root_generation, 64);
 BTRFS_SETGET_STACK_FUNCS(super_block_group_root_level, struct btrfs_super_block,
 			 block_group_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_reserved_bytes, struct btrfs_super_block,
+			 reserved_bytes, 32);
 
 int btrfs_super_csum_size(const struct btrfs_super_block *s);
 const char *btrfs_super_csum_name(u16 csum_type);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3fac429cf8a4..ed3bcd18e546 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2843,6 +2843,15 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 		ret = -EINVAL;
 	}
 
+	if (btrfs_super_compat_ro_flags(sb) &
+	    BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED &&
+	    btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+		btrfs_err(fs_info,
+"EXTRA_SUPER_RESERVED feature enabled, but reserved space is smaller than default (%u)",
+			    BTRFS_DEVICE_RANGE_RESERVED);
+		ret = -EINVAL;
+	}
+
 	/*
 	 * The generation is a global counter, we'll trust it more than the others
 	 * but it's still possible that it's the one that's wrong.
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index d5d0717fd09a..7fd38539315f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -287,6 +287,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_COMPAT_RO(extra_super_reserved, EXTRA_SUPER_RESERVED);
 #ifdef CONFIG_BLK_DEV_ZONED
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
@@ -317,6 +318,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
+	BTRFS_FEAT_ATTR_PTR(extra_super_reserved),
 #ifdef CONFIG_BLK_DEV_ZONED
 	BTRFS_FEAT_ATTR_PTR(zoned),
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 272901514b0c..412db05d55e3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1400,8 +1400,13 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 
 static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 {
+	struct btrfs_fs_info *fs_info = device->fs_devices->fs_info;
+
 	switch (device->fs_devices->chunk_alloc_policy) {
 	case BTRFS_CHUNK_ALLOC_REGULAR:
+		if (btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED))
+			return max_t(u64, start,
+				btrfs_super_reserved_bytes(fs_info->super_copy));
 		return max_t(u64, start, BTRFS_DEVICE_RANGE_RESERVED);
 	case BTRFS_CHUNK_ALLOC_ZONED:
 		/*
@@ -7965,11 +7970,14 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	struct extent_map *em;
 	struct map_lookup *map;
 	struct btrfs_device *dev;
+	u32 super_reserved = BTRFS_DEVICE_RANGE_RESERVED;
 	u64 stripe_len;
 	bool found = false;
 	int ret = 0;
 	int i;
 
+	if (btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED))
+		super_reserved = btrfs_super_reserved_bytes(fs_info->super_copy);
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, chunk_offset, 1);
 	read_unlock(&em_tree->lock);
@@ -7994,11 +8002,17 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	}
 
 	/*
-	 * Very old mkfs.btrfs (before v4.1) will not respect the reserved
-	 * space. Although kernel can handle it without problem, better to warn
-	 * the users.
+	 * This can be caused by two cases:
+	 * - Very old mkfs.btrfs (before v4.1) and no balance at all
+	 *   This should be pretty rare now, as balance will relocate
+	 *   those dev extents in reserved range.
+	 *
+	 * - Newly set btrfs_super_block::reserved_bytes
+	 *   We are rely on this mount to relocate those dev extents.
+	 *
+	 * So here, we just give a warning and continue the mount.
 	 */
-	if (physical_offset < BTRFS_DEVICE_RANGE_RESERVED)
+	if (physical_offset < super_reserved)
 		btrfs_warn(fs_info,
 		"devid %llu physical %llu len %llu inside the reserved space",
 			   devid, physical_offset, physical_len);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b150b07ba1a7..c135950a8d96 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -719,6 +719,14 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	if (incompat_zoned && btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED)) {
+		btrfs_err(fs_info,
+			"zoned: incompatible optional feature detected: 0x%llx",
+			BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED);
+		ret = -EINVAL;
+		goto out;
+	}
+
 	/*
 	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
 	 * btrfs_create_chunk(). Since we want stripe_len == zone_size,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index f54dc91e4025..4a0c9f4f55d1 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -290,6 +290,16 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
 #define BTRFS_FEATURE_COMPAT_RO_VERITY			(1ULL << 2)
 
+/*
+ * Allow btrfs to have extra reserved space (other than the default 1MiB) at
+ * the beginning of each device.
+ *
+ * This feature will enable the usage of btrfs_super_block::reserved_bytes.
+ *
+ * This feature would only be available for non-zoned filesystems.
+ */
+#define BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED	(1ULL << 3)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/14] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
  2022-07-25  5:37 ` [PATCH 01/14] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 03/14] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

The new flag is for the incoming write intent bitmap, mostly to address
the RAID56 write-hole, by doing a mandatory scrub for partial written
stripes at mount time.

Currently the feature is still under development, this patch is mostly
a placeholder for the extra reserved bytes for write intent bitmap.

We will utilize the newly introduce EXTRA_SUPER_RESERVED compat RO flags
to enlarge the reserved bytes to at least (1MiB + 64KiB), and use that
64KiB (exact value is not yet fully determined) for write-intent bitmap.

Only one extra check is introduced, to ensure we have enough space to
place the write-intent bitmap at 1MiB physical offset.

This patch is only a place holder for the incoming on-disk format
change, no real write-intent functionality is implemented yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h           |  9 +++++++++
 fs/btrfs/disk-io.c         | 20 ++++++++++++++++++++
 fs/btrfs/volumes.c         | 14 +++++++++++++-
 include/uapi/linux/btrfs.h |  7 +++++++
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ef678419ff3f..47f5953731fb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -309,11 +309,20 @@ static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
+#ifdef CONFIG_BTRFS_DEBUG
+#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
+	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
+	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
+	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED |	\
+	 BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP)
+#else
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
 	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
 	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
 	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)
+#endif
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed3bcd18e546..13957b0e027a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2851,6 +2851,26 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 			    BTRFS_DEVICE_RANGE_RESERVED);
 		ret = -EINVAL;
 	}
+	if (btrfs_super_compat_ro_flags(sb) &
+	    BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP) {
+		/* Write intent bitmap requires extra reserve. */
+		if (!(btrfs_super_compat_ro_flags(sb) &
+		      BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)) {
+			btrfs_err(fs_info,
+"WRITE_INTENT_BITMAP feature enabled, but missing EXTRA_SUPER_RESERVED feature");
+			ret = -EINVAL;
+		}
+		/*
+		 * Write intent bitmap is always located at 1MiB.
+		 * Extra check like the length check against the reserved space
+		 * will happen at bitmap load time.
+		 */
+		if (btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+			btrfs_err(fs_info,
+			"not enough reserved space for write intent bitmap");
+			ret = -EINVAL;
+		}
+	}
 
 	/*
 	 * The generation is a global counter, we'll trust it more than the others
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 412db05d55e3..e8a8073a778b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -8012,11 +8012,23 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	 *
 	 * So here, we just give a warning and continue the mount.
 	 */
-	if (physical_offset < super_reserved)
+	if (physical_offset < super_reserved) {
 		btrfs_warn(fs_info,
 		"devid %llu physical %llu len %llu inside the reserved space",
 			   devid, physical_offset, physical_len);
 
+		/* Disable any feature relying on the new reserved_bytes. */
+		if (btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP)) {
+			struct btrfs_super_block *sb = fs_info->super_copy;
+
+			btrfs_warn(fs_info,
+	"disabling write intent bitmap due to the lack of reserved space.");
+			btrfs_set_super_compat_ro_flags(sb,
+				btrfs_super_compat_ro_flags(sb) |
+				~BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP);
+		}
+	}
+
 	for (i = 0; i < map->num_stripes; i++) {
 		if (map->stripes[i].dev->devid == devid &&
 		    map->stripes[i].physical == physical_offset) {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 4a0c9f4f55d1..38c74a50323e 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -300,6 +300,13 @@ struct btrfs_ioctl_fs_info_args {
  */
 #define BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED	(1ULL << 3)
 
+/*
+ * Allow btrfs to have per-device write-intent bitmap.
+ * Will be utilized to close the RAID56 write-hole (by forced scrub for dirty
+ * partial written stripes at mount time).
+ */
+#define BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP	(1ULL << 4)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/14] btrfs: introduce the on-disk format of btrfs write intent bitmaps
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
  2022-07-25  5:37 ` [PATCH 01/14] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
  2022-07-25  5:37 ` [PATCH 02/14] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 04/14] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

With extra comments explaining the on-disk format and the basic
workflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c      |   4 +-
 fs/btrfs/raid56.c       |   1 +
 fs/btrfs/write-intent.h | 199 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 203 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/write-intent.h

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 13957b0e027a..880e668e9660 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -43,6 +43,7 @@
 #include "space-info.h"
 #include "zoned.h"
 #include "subpage.h"
+#include "write-intent.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -2865,7 +2866,8 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 		 * Extra check like the length check against the reserved space
 		 * will happen at bitmap load time.
 		 */
-		if (btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+		if (btrfs_super_reserved_bytes(sb) <
+		    BTRFS_DEVICE_RANGE_RESERVED + WRITE_INTENT_BITMAPS_SIZE) {
 			btrfs_err(fs_info,
 			"not enough reserved space for write intent bitmap");
 			ret = -EINVAL;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1afe32d5ab01..9b79a212fd6a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -19,6 +19,7 @@
 #include "volumes.h"
 #include "raid56.h"
 #include "async-thread.h"
+#include "write-intent.h"
 
 /* set when additional merges to this rbio are not allowed */
 #define RBIO_RMW_LOCKED_BIT	1
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
new file mode 100644
index 000000000000..b851917bb0b6
--- /dev/null
+++ b/fs/btrfs/write-intent.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Btrfs specific write-intent bitmaps. */
+
+#ifndef BTRFS_WRITE_INTENT_H
+#define BTRFS_WRITE_INTENT_H
+
+#include <linux/types.h>
+#include <linux/sizes.h>
+/* For BTRFS_STRIPE_LEN. */
+#include "volumes.h"
+
+#define WRITE_INTENT_SUPER_MAGIC 0x515F62536249775FULL /* ascii _wIbSb_Q */
+
+/* This write-intent bitmap records writes for RAID56 writes. */
+#define WRITE_INTENT_FLAGS_TARGET_RAID56	(1ULL << 0)
+
+/* This write-intent bitmap is internal, aka locates at 1MiB of each device. */
+#define WRITE_INTENT_FLAGS_INTERNAL		(1ULL << 1)
+
+/* This write-intent bitmap uses logical bytenr. */
+#define WRITE_INTENT_FLAGS_BYTENR_LOGICAL	(1ULL << 2)
+
+#define WRITE_INTENT_FLAGS_SUPPORTED			\
+	(WRITE_INTENT_FLAGS_TARGET_RAID56 |		\
+	 WRITE_INTENT_FLAGS_INTERNAL |			\
+	 WRITE_INTENT_FLAGS_BYTENR_LOGICAL)
+
+/*
+ * We use BTRFS_STRIPE_LEN as blocksize.
+ * This makes a RAID56 full stripe to @nr_data bits, and greatly
+ * enlarge how many bytes we can represent just using 4KiB.
+ */
+#define WRITE_INTENT_BLOCKSIZE			(BTRFS_STRIPE_LEN)
+
+/*
+ * For now, 4KiB is enough, as using 64KiB blocksize we can save bitmaps
+ * for 896MiB (224 entries, each entri can cache 64KiB * 64) sized logical
+ * range.
+ */
+#define WRITE_INTENT_BITMAPS_SIZE		(SZ_4K)
+
+
+/*
+ * The super block of write intent bitmaps, should be at physical offset 1MiB of
+ * every writeable device.
+ */
+struct write_intent_super {
+	/* Csum for the super block and all the internal entries. */
+	__u8 csum[BTRFS_CSUM_SIZE];
+	__u8 fsid[BTRFS_FSID_SIZE];
+
+	__le64 magic;
+
+	/* Important behavior flags would be set here. */
+	__le64 flags;
+
+	/*
+	 * Event counter for the bitmap, every time the bitmaps get written
+	 * this value increases.
+	 */
+	__le64 events;
+
+	/*
+	 * Total size of the bitmaps, including this super block and all the
+	 * entries.
+	 *
+	 * U32 should be enough for internal bitmaps, but just in case we want
+	 * to support external device as dedicated journal/bitmap device.
+	 */
+	__le64 size;
+
+	/* How many entries we have utilized. */
+	__le64 nr_entries;
+
+	/* How many bytes one bit represents. */
+	__le32 blocksize;
+	/*
+	 * This should be the same as btrfs_super_block::csum_type.
+	 * Cache csum type here so we read the write intent superblock without
+	 * a fully opened btrfs (for dump purpose).
+	 */
+	__le16 csum_type;
+
+	/* For future expansion, padding to 512 bytes. */
+	__u8 reserved1[418];
+} __attribute__ ((__packed__));
+
+static_assert(sizeof(struct write_intent_super) == 512);
+
+struct write_intent_entry {
+	/*
+	 * Bytenr 0 is special, means this entry is empty, and also means the
+	 * end of the bitmaps.
+	 */
+	__le64 bytenr;
+	__le64 bitmap;
+};
+
+/* How many entries we can have in the bitmaps. */
+#define WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES		\
+	((WRITE_INTENT_BITMAPS_SIZE -				\
+	  sizeof(struct write_intent_super)) /			\
+	 sizeof(struct write_intent_entry))
+
+/* The number of bits we can have in one entry. */
+#define WRITE_INTENT_BITS_PER_ENTRY		(64)
+
+/*
+ * ON-DISK FORMAT
+ * ==============
+ *
+ * [ super ][ entry 1 ][ entry 2 ] ... [entry N]
+ * |<------------  super::size --------------->|
+ *
+ * Normally it's 4KiB in size for internal bitmap.
+ *
+ * Currently the write-intent bitmaps is only for RAID56 writes, thus its
+ * blocksize is always 64KiB.
+ * Thus one entry can represent at most 4MiB (64 * 64 KiB) of logical range.
+ *
+ * When one raid56 full stripe needs partial writeback, the full stripe logical
+ * bytenr range will be included into at least one entry.
+ *
+ * After the last used entry, the remaining entries will all be filled with 0.
+ *
+ * WORKFLOW
+ * ========
+ *
+ * 1) Write bio arrive
+ *    Every write meets the requirement (so far, only RAID56 partial write) will
+ *    have its bio delayed, until corresponding range are marked in the entry.
+ *
+ * 2) Update the write-intent bitmaps
+ *    The entries will be populated, and write back to all writeable devices,
+ *    with FUA flag set.
+ *    Will wait until the write reaches disks.
+ *
+ * 3) Allow the involved write bios to be submitted
+ *
+ * 4) Write bios finish
+ *    The corresponding range will be recorded to be freed at next flush.
+ *
+ * 5) All devices get flushed (caused by btrfs super block writeback or bitmaps
+ *    pressure)
+ *    The recorded ranges will be cleared. And if an entry is empty, it will be
+ *    freed.
+ *    Then update the write-intent bitmaps with its superblock (writeback with FUA
+ *    flag and wait for it).
+ */
+
+#define WRITE_INTENT_SETGET_FUNCS(name, type, member, bits)	\
+static inline u##bits wi_##name(const type *s)			\
+{								\
+	return le##bits##_to_cpu(s->member);			\
+}								\
+static inline void wi_set_##name(type *s, u##bits val)		\
+{								\
+	s->member = cpu_to_le##bits(val);			\
+}
+
+WRITE_INTENT_SETGET_FUNCS(super_magic, struct write_intent_super, magic, 64);
+WRITE_INTENT_SETGET_FUNCS(super_flags, struct write_intent_super, flags, 64);
+WRITE_INTENT_SETGET_FUNCS(super_events, struct write_intent_super, events, 64);
+WRITE_INTENT_SETGET_FUNCS(super_size, struct write_intent_super, size, 64);
+WRITE_INTENT_SETGET_FUNCS(super_nr_entries, struct write_intent_super,
+			  nr_entries, 64);
+WRITE_INTENT_SETGET_FUNCS(super_blocksize, struct write_intent_super,
+			  blocksize, 32);
+WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
+			  csum_type, 16);
+WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
+
+static inline void wie_get_bitmap(struct write_intent_entry *entry,
+				  unsigned long *bitmap)
+{
+#ifdef __LITTLE_ENDIAN
+	bitmap_from_arr64(bitmap, &entry->bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+#else
+	u64 val = le64_to_cpu(entry->bitmap);
+
+	bitmap_from_arr64(bitmap, &val, WRITE_INTENT_BITS_PER_ENTRY);
+#endif
+}
+
+static inline void wie_set_bitmap(struct write_intent_entry *entry,
+				  unsigned long *bitmap)
+{
+#ifdef __LITTLE_ENDIAN
+	bitmap_to_arr64(&entry->bitmap, bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+#else
+	u64 val;
+
+	bitmap_to_arr64(&val, bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+	entry->bitmap = cpu_to_le64(val);
+#endif
+}
+
+#endif
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/14] btrfs: load/create write-intent bitmaps at mount time
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 03/14] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 05/14] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce btrfs_fs_info::wi_ctrl, which will have a
non-highmem page for the write intent bitmaps.

Please note that, if we can't find a valid bitmaps, the newly create one
will only be in memory for now, the bitmaps writeback functionality will
be introduced in the next commit.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile       |   2 +-
 fs/btrfs/ctree.h        |   1 +
 fs/btrfs/disk-io.c      |   7 ++
 fs/btrfs/write-intent.c | 174 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  15 ++++
 5 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/write-intent.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 99f9995670ea..af93119d52e2 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o
+	   subpage.o tree-mod-log.o write-intent.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 47f5953731fb..ade76ba98c68 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -983,6 +983,7 @@ struct btrfs_fs_info {
 	struct workqueue_struct *scrub_wr_completion_workers;
 	struct workqueue_struct *scrub_parity_workers;
 	struct btrfs_subpage_info *subpage_info;
+	struct write_intent_ctrl *wi_ctrl;
 
 	struct btrfs_discard_ctl discard_ctl;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 880e668e9660..98480bf6e5e2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3723,6 +3723,12 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			  ret);
 		goto fail_block_groups;
 	}
+	ret = btrfs_write_intent_init(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init write-intent bitmaps: %d",
+			  ret);
+		goto fail_block_groups;
+	}
 	ret = btrfs_recover_balance(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to recover balance: %d", ret);
@@ -4667,6 +4673,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 		if (ret)
 			btrfs_err(fs_info, "commit super ret %d", ret);
 	}
+	btrfs_write_intent_free(fs_info);
 
 	if (BTRFS_FS_ERROR(fs_info))
 		btrfs_error_commit_super(fs_info);
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
new file mode 100644
index 000000000000..a7ed21182525
--- /dev/null
+++ b/fs/btrfs/write-intent.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "ctree.h"
+#include "volumes.h"
+#include "write-intent.h"
+
+/*
+ * Return 0 if a valid write intent bitmap can be found.
+ * Return 1 if no valid write intent bitmap can be found.
+ * Return <0 for other fatal errors.
+ */
+static int write_intent_load(struct btrfs_device *device, struct page *dst)
+{
+	struct btrfs_fs_info *fs_info = device->fs_info;
+	struct write_intent_super *wis;
+	struct bio *bio;
+	int ret;
+
+	bio = bio_alloc(device->bdev, 1, REQ_OP_READ | REQ_SYNC | REQ_META,
+			GFP_NOFS);
+	/* It's backed by fs_bioset. */
+	ASSERT(bio);
+	bio->bi_iter.bi_sector = BTRFS_DEVICE_RANGE_RESERVED >> SECTOR_SHIFT;
+	__bio_add_page(bio, dst, WRITE_INTENT_BITMAPS_SIZE, 0);
+	ret = submit_bio_wait(bio);
+	if (ret < 0)
+		return ret;
+
+	wis = page_address(dst);
+	if (wi_super_magic(wis) != WRITE_INTENT_SUPER_MAGIC)
+		return 1;
+
+	/* Stale bitmaps, doesn't belong to our fs. */
+	if (memcmp(wis->fsid, device->fs_devices->fsid, BTRFS_FSID_SIZE))
+		return 1;
+
+	/* Above checks pass, but still csum mismatch, a big problem. */
+	if (btrfs_super_csum_type(fs_info->super_copy) !=
+	    wi_super_csum_type(wis)) {
+		btrfs_err(fs_info,
+		"csum type mismatch, write intent bitmap has %u fs has %u",
+			  wi_super_csum_type(wis),
+			  btrfs_super_csum_type(fs_info->super_copy));
+		return -EUCLEAN;
+	}
+
+	if (wi_super_flags(wis) & ~WRITE_INTENT_FLAGS_SUPPORTED) {
+		btrfs_err(fs_info, "unsupported flags 0x%llx",
+			  wi_super_flags(wis) & ~WRITE_INTENT_FLAGS_SUPPORTED);
+		return -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+static void write_intent_init(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+
+	ASSERT(ctrl);
+	ASSERT(ctrl->page);
+
+	/* Always start event count from 1. */
+	atomic64_set(&ctrl->event, 1);
+	ctrl->blocksize = WRITE_INTENT_BLOCKSIZE;
+	memzero_page(ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
+
+	wis = page_address(ctrl->page);
+	memcpy(wis->fsid, fs_info->fs_devices->fsid, BTRFS_FSID_SIZE);
+	wi_set_super_magic(wis, WRITE_INTENT_SUPER_MAGIC);
+	wi_set_super_csum_type(wis, btrfs_super_csum_type(fs_info->super_copy));
+	wi_set_super_events(wis, 1);
+	wi_set_super_flags(wis, WRITE_INTENT_FLAGS_SUPPORTED);
+	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
+	wi_set_super_blocksize(wis, ctrl->blocksize);
+	wi_set_super_nr_entries(wis, 0);
+	btrfs_info(fs_info, "creating new write intent bitmaps");
+}
+
+int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_device *highest_dev = NULL;
+	struct btrfs_device *dev;
+	struct write_intent_super *wis;
+	u64 highest_event = 0;
+	int ret;
+
+	ASSERT(fs_info->wi_ctrl == NULL);
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return 0;
+
+	fs_info->wi_ctrl = kzalloc(sizeof(*fs_info->wi_ctrl), GFP_NOFS);
+	if (!fs_info->wi_ctrl)
+		return -ENOMEM;
+
+	fs_info->wi_ctrl->page = alloc_page(GFP_NOFS);
+	if (!fs_info->wi_ctrl->page) {
+		ret = -ENOMEM;
+		goto cleanup;
+	}
+
+	/*
+	 * Go through every writeable device to find the highest event.
+	 *
+	 * Only the write-intent with highest event number makes sense.
+	 * If during bitmap writeback we lost power, some dev may have old
+	 * bitmap which is already stale.
+	 */
+	list_for_each_entry(dev, &fs_info->fs_devices->devices, dev_list) {
+		u64 cur_event;
+
+		if (!dev->bdev)
+			continue;
+
+		ret = write_intent_load(dev, fs_info->wi_ctrl->page);
+		if (ret > 0)
+			continue;
+		if (ret < 0) {
+			btrfs_err(fs_info,
+			"failed to load write intent from devid %llu: %d",
+				  dev->devid, ret);
+			goto cleanup;
+		}
+		wis = page_address(fs_info->wi_ctrl->page);
+		cur_event = wi_super_events(wis);
+		if (cur_event > highest_event) {
+			highest_dev = dev;
+			highest_event = cur_event;
+		}
+	}
+
+	/* Load the bitmap with lowest event as our bitmap. */
+	if (highest_dev) {
+		ret = write_intent_load(highest_dev, fs_info->wi_ctrl->page);
+		if (ret < 0) {
+			btrfs_err(fs_info,
+			"failed to load write intent from devid %llu: %d",
+				  dev->devid, ret);
+			goto cleanup;
+		}
+		wis = page_address(fs_info->wi_ctrl->page);
+		atomic64_set(&fs_info->wi_ctrl->event, wi_super_events(wis));
+		fs_info->wi_ctrl->blocksize = wi_super_blocksize(wis);
+		btrfs_info(fs_info,
+			"loaded write intent bitmaps, event count %llu",
+			atomic64_read(&fs_info->wi_ctrl->event));
+		return 0;
+	}
+
+	/* No valid bitmap found, create a new one. */
+	write_intent_init(fs_info);
+	return 0;
+cleanup:
+	if (fs_info->wi_ctrl) {
+		if (fs_info->wi_ctrl->page)
+			__free_page(fs_info->wi_ctrl->page);
+		kfree(fs_info->wi_ctrl);
+		fs_info->wi_ctrl = NULL;
+	}
+	return ret;
+}
+
+void btrfs_write_intent_free(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+
+	if (!ctrl)
+		return;
+	ASSERT(ctrl->page);
+	__free_page(ctrl->page);
+	kfree(ctrl);
+	fs_info->wi_ctrl = NULL;
+}
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index b851917bb0b6..2c5cd434e978 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -106,6 +106,18 @@ struct write_intent_entry {
 /* The number of bits we can have in one entry. */
 #define WRITE_INTENT_BITS_PER_ENTRY		(64)
 
+/* In-memory write-intent control structure. */
+struct write_intent_ctrl {
+	/* For the write_intent super and entries. */
+	struct page *page;
+
+	/* Cached event counter.*/
+	atomic64_t event;
+
+	/* Cached blocksize from write intent super. */
+	u32 blocksize;
+};
+
 /*
  * ON-DISK FORMAT
  * ==============
@@ -196,4 +208,7 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 #endif
 }
 
+int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
+void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
+
 #endif
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/14] btrfs: write-intent: write the newly created bitmaps to all disks
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 04/14] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 06/14] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

This write back will happen even for RO mounts.

This will ensure we always have write-intent bitmaps for fses with that
compat RO flags set.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 154 ++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/write-intent.h |   6 ++
 2 files changed, 155 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index a7ed21182525..d1c5e8e206ba 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -1,8 +1,140 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <crypto/hash.h>
+#include <linux/bio.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "write-intent.h"
+#include "rcu-string.h"
+
+struct bitmap_writeback_contrl {
+	atomic_t pending_bios;
+	atomic_t errors;
+	wait_queue_head_t wait;
+};
+
+static void write_intent_end_write(struct bio *bio)
+{
+	struct bitmap_writeback_contrl *wb_ctrl = bio->bi_private;
+
+	if (bio->bi_status)
+		atomic_inc(&wb_ctrl->errors);
+	atomic_dec(&wb_ctrl->pending_bios);
+	wake_up(&wb_ctrl->wait);
+
+	bio_put(bio);
+}
+
+static int submit_one_device(struct btrfs_device *dev,
+			     struct bitmap_writeback_contrl *wb_ctrl)
+{
+	struct btrfs_fs_info *fs_info = dev->fs_info;
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct bio *bio;
+
+	if (!dev->bdev)
+		return -EIO;
+
+	if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &dev->dev_state) ||
+	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))
+		return 0;
+
+	atomic_inc(&wb_ctrl->pending_bios);
+	bio = bio_alloc(dev->bdev, 1, REQ_OP_WRITE | REQ_SYNC |
+			REQ_META | REQ_PRIO | REQ_FUA, GFP_NOFS);
+	bio->bi_iter.bi_sector = BTRFS_DEVICE_RANGE_RESERVED >> SECTOR_SHIFT;
+	bio->bi_private = wb_ctrl;
+	bio->bi_end_io = write_intent_end_write;
+	__bio_add_page(bio, ctrl->commit_page, WRITE_INTENT_BITMAPS_SIZE,
+			offset_in_page(BTRFS_DEVICE_RANGE_RESERVED));
+	submit_bio(bio);
+	return 0;
+}
+
+/* Write back the bitmaps page to all devices. */
+static int write_intent_writeback(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	struct btrfs_device *dev;
+	struct btrfs_device **found_devs;
+	struct bitmap_writeback_contrl wb_ctrl = {0};
+	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	const int nr_devs_max = fs_info->fs_devices->open_devices + 4;
+	int nr_devs = 0;
+	int total_errors = 0;
+	int ret;
+	int i;
+
+	ASSERT(ctrl);
+
+	found_devs = kcalloc(nr_devs_max, sizeof(struct btrfs_device *),
+			     GFP_NOFS);
+	if (!found_devs)
+		return -ENOMEM;
+
+	shash->tfm = fs_info->csum_shash;
+
+	spin_lock(&ctrl->lock);
+	wis = page_address(ctrl->page);
+
+	/*
+	 * Bump up the event counter each time this bitmap is going to be
+	 * written.
+	 */
+	wi_set_super_events(wis, wi_super_events(wis) + 1);
+	crypto_shash_digest(shash, (unsigned char *)wis + BTRFS_CSUM_SIZE,
+			    WRITE_INTENT_BITMAPS_SIZE - BTRFS_CSUM_SIZE,
+			    wis->csum);
+	atomic64_inc(&ctrl->event);
+	memcpy_page(ctrl->commit_page, 0, ctrl->page, 0,
+		    WRITE_INTENT_BITMAPS_SIZE);
+	spin_unlock(&ctrl->lock);
+
+	init_waitqueue_head(&wb_ctrl.wait);
+	atomic_set(&wb_ctrl.pending_bios, 0);
+	atomic_set(&wb_ctrl.errors, 0);
+
+	rcu_read_lock();
+	/*
+	 * Record all the writeable devices into found_devs[].
+	 *
+	 * We have to do this to keep a consistent view of writeable devices,
+	 * without holding device_list_mutex.
+	 * As dev-replace/dev-removal will all hold that mutex and wait for
+	 * submitted bios to finish.
+	 * If we try to hold device_list_mutex at bio submission path, we will
+	 * deadlock with above dev-replace/dev-removal
+	 *
+	 * So here we just grab a local list of devices, and since we're at
+	 * bio submission path, the device will never disapper before the bio
+	 * finished.
+	 */
+	list_for_each_entry_rcu(dev, &fs_info->fs_devices->devices, dev_list) {
+		found_devs[nr_devs] = dev;
+		nr_devs++;
+
+		if (unlikely(nr_devs >= nr_devs_max))
+			break;
+	}
+	rcu_read_unlock();
+
+	/* Go through all the recorded devices, and submit the commit_page. */
+	for (i = 0; i < nr_devs; i++) {
+		ret = submit_one_device(found_devs[i], &wb_ctrl);
+		if (ret < 0)
+			total_errors++;
+	}
+	wait_event(wb_ctrl.wait, atomic_read(&wb_ctrl.pending_bios) == 0);
+
+	if (total_errors + atomic_read(&wb_ctrl.errors) >
+	    btrfs_super_num_devices(fs_info->super_copy) - 1) {
+		btrfs_err(fs_info, "failed to writeback write-intent bitmaps");
+		ret = -EIO;
+	}
+	kfree(found_devs);
+	return ret;
+}
 
 /*
  * Return 0 if a valid write intent bitmap can be found.
@@ -53,10 +185,11 @@ static int write_intent_load(struct btrfs_device *device, struct page *dst)
 	return ret;
 }
 
-static void write_intent_init(struct btrfs_fs_info *fs_info)
+static int write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
 	struct write_intent_super *wis;
+	int ret;
 
 	ASSERT(ctrl);
 	ASSERT(ctrl->page);
@@ -75,7 +208,12 @@ static void write_intent_init(struct btrfs_fs_info *fs_info)
 	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
 	wi_set_super_blocksize(wis, ctrl->blocksize);
 	wi_set_super_nr_entries(wis, 0);
-	btrfs_info(fs_info, "creating new write intent bitmaps");
+
+	ret = write_intent_writeback(fs_info);
+	if (ret < 0)
+		return ret;
+	btrfs_info(fs_info, "new write intent bitmaps created");
+	return 0;
 }
 
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
@@ -95,11 +233,14 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 		return -ENOMEM;
 
 	fs_info->wi_ctrl->page = alloc_page(GFP_NOFS);
-	if (!fs_info->wi_ctrl->page) {
+	fs_info->wi_ctrl->commit_page = alloc_page(GFP_NOFS);
+	if (!fs_info->wi_ctrl->page || !fs_info->wi_ctrl->commit_page) {
 		ret = -ENOMEM;
 		goto cleanup;
 	}
 
+	spin_lock_init(&fs_info->wi_ctrl->lock);
+
 	/*
 	 * Go through every writeable device to find the highest event.
 	 *
@@ -149,12 +290,15 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 	}
 
 	/* No valid bitmap found, create a new one. */
-	write_intent_init(fs_info);
-	return 0;
+	ret = write_intent_init(fs_info);
+
+	return ret;
 cleanup:
 	if (fs_info->wi_ctrl) {
 		if (fs_info->wi_ctrl->page)
 			__free_page(fs_info->wi_ctrl->page);
+		if (fs_info->wi_ctrl->commit_page)
+			__free_page(fs_info->wi_ctrl->commit_page);
 		kfree(fs_info->wi_ctrl);
 		fs_info->wi_ctrl = NULL;
 	}
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 2c5cd434e978..797e57aef0e1 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -111,9 +111,15 @@ struct write_intent_ctrl {
 	/* For the write_intent super and entries. */
 	struct page *page;
 
+	/* A copy for writeback. */
+	struct page *commit_page;
+
 	/* Cached event counter.*/
 	atomic64_t event;
 
+	/* Lock for reading/writing above @page. */
+	spinlock_t lock;
+
 	/* Cached blocksize from write intent super. */
 	u32 blocksize;
 };
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/14] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 05/14] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 07/14] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 246 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  19 ++++
 2 files changed, 265 insertions(+)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index d1c5e8e206ba..eaf6d010462e 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -216,6 +216,252 @@ static int write_intent_init(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
+static struct write_intent_entry *write_intent_entry_nr(
+				struct write_intent_ctrl *ctrl, int nr)
+{
+
+	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+	return (page_address(ctrl->page) +
+		sizeof(struct write_intent_super) +
+		nr * sizeof(struct write_intent_entry));
+}
+
+/*
+ * Return <0 if the bytenr is before the given entry.
+ * Return 0 if the bytenr is inside the given entry.
+ * Return >0 if the bytenr is after the given entry.
+ */
+static int compare_bytenr_to_range(u64 bytenr, u64 start, u32 len)
+{
+	if (bytenr < start)
+		return -1;
+	if (start <= bytenr && bytenr < start + len)
+		return 0;
+	return 1;
+}
+
+/*
+ * Move all non-empty entries starting from @nr, to the right, and make room
+ * for @nr_new entries.
+ * Those new entries will be all zero filled.
+ *
+ * Caller should ensure we have enough room for @nr_new new entries.
+ */
+static void move_entries_right(struct write_intent_ctrl *ctrl, int nr,
+			       int nr_new)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	int move_size;
+
+	ASSERT(nr_new > 0);
+	ASSERT(wi_super_nr_entries(wis) + nr_new <=
+	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+
+	move_size = (wi_super_nr_entries(wis) - nr) *
+		     sizeof(struct write_intent_entry);
+
+	memmove(write_intent_entry_nr(ctrl, nr + nr_new),
+		write_intent_entry_nr(ctrl, nr), move_size);
+	memset(write_intent_entry_nr(ctrl, nr), 0,
+	       nr_new * sizeof(struct write_intent_entry));
+	wi_set_super_nr_entries(wis, wi_super_nr_entries(wis) + nr_new);
+}
+
+static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
+				  struct write_intent_entry *entry,
+				  u64 bytenr, u32 len)
+{
+	const u64 entry_start = wi_entry_bytenr(entry);
+	const u32 entry_len = write_intent_entry_size(ctrl);
+	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + entry_len);
+	bitmap_set(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
+		   len / ctrl->blocksize);
+	wie_set_bitmap(entry, bitmaps);
+}
+
+/*
+ * Insert new entries for the range [@bytenr, @bytenr + @len) at slot @nr
+ * and fill the new entries with proper bytenr and bitmaps.
+ */
+static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
+			       u64 bytenr, u32 len)
+{
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	u64 entry_start;
+	u64 new_start = round_down(bytenr, entry_size);
+	u64 new_end;
+	int nr_new_entries;
+	u64 cur;
+
+	if (nr >= wi_super_nr_entries(page_address(ctrl->page)) ||
+	    nr >= WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES)
+		entry_start = U64_MAX;
+	else
+		entry_start = wi_entry_bytenr(write_intent_entry_nr(ctrl, nr));
+
+	ASSERT(bytenr < entry_start);
+
+	new_end = min(entry_start, round_up(bytenr + len, entry_size));
+	nr_new_entries = (new_end - new_start) / entry_size;
+
+	if (nr_new_entries == 0)
+		return;
+
+	move_entries_right(ctrl, nr, nr_new_entries);
+
+	for (cur = new_start; cur < new_end; cur += entry_size, nr++) {
+		struct write_intent_entry *entry =
+			write_intent_entry_nr(ctrl, nr);
+		u64 range_start = max(cur, bytenr);
+		u64 range_len = min(cur + entry_size, bytenr + len) -
+				range_start;
+
+		/* Fill the bytenr into the new empty entries.*/
+		wi_set_entry_bytenr(entry, cur);
+
+		/* And set the bitmap. */
+		set_bits_in_one_entry(ctrl, entry, range_start, range_len);
+	}
+}
+
+/*
+ * This should be only called when we have enough room in the bitmaps, and hold
+ * the wi_ctrl->lock.
+ *
+ * This function is only exported for selftests, which doesn't need to hold any
+ * lock.
+ */
+void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	int i;
+	u64 nr_entries = wi_super_nr_entries(wis);
+	u64 cur_bytenr;
+
+	/*
+	 * Currently we only accept full stripe length, which should be
+	 * aligned to 64KiB.
+	 */
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	/*
+	 * We should have room to contain the worst case scenario, in which we
+	 * need to create one or more new entry.
+	 */
+	ASSERT(nr_entries + bytes_to_entries(bytenr, len, BTRFS_STRIPE_LEN) <=
+	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+
+	/*
+	 * Iterate through the existing entries to insert new entries or set
+	 * bits in the existing ones.
+	 */
+	for (i = 0, cur_bytenr = bytenr;
+	     i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; i++) {
+		struct write_intent_entry *entry = write_intent_entry_nr(ctrl, i);
+		u64 entry_start = wi_entry_bytenr(entry);
+		u64 entry_end = entry_start + entry_size;
+
+		/*
+		 *			|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *		|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *	|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * We need to insert one or more new entries for the range not
+		 * covered by the existing entry.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start,
+					    entry_size) < 0) {
+			u64 new_range_end;
+
+			new_range_end = min(entry_start, bytenr + len);
+			insert_new_entries(ctrl, i, cur_bytenr,
+					   new_range_end - cur_bytenr);
+
+			cur_bytenr = new_range_end;
+			continue;
+		}
+		/*
+		 * |<-- entry -->|
+		 *	|<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 * |<-------- entry ------->|
+		 *	|<- bytenr/len ->|
+		 *
+		 * In this case, we just set the bitmap in the current entry, and
+		 * advance @cur_bytenr to the end of the existing entry.
+		 * By this, we either go check the range against the next entry,
+		 * or we finish our current range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start,
+					    entry_size) == 0) {
+			u64 range_end = min(entry_end, bytenr + len);
+
+			set_bits_in_one_entry(ctrl, entry, cur_bytenr,
+					      range_end - cur_bytenr);
+			cur_bytenr = range_end;
+			continue;
+		}
+
+		/*
+		 * (A)
+		 * |<-- entry -->|			|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (B)
+		 * |<-- entry -->|		|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (C)
+		 * |<-- entry -->|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (D)
+		 * |<-- entry -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * For all above cases, we just skip to the next entry.
+		 *
+		 * For case (A) and (B), we will insert new entries before
+		 * the next one at the next loop.
+		 *
+		 * For case (C), we just do the regular set bits.
+		 * Thus case (A) ~ (C) are all handled properly.
+		 *
+		 * For case (D), we will handle it after the loop.
+		 */
+	}
+
+	/*
+	 * We still have some range not handled after the existing entries,
+	 * just insert new entries.
+	 */
+	if (cur_bytenr < bytenr + len)
+		insert_new_entries(ctrl, i, cur_bytenr,
+				   bytenr + len - cur_bytenr);
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 797e57aef0e1..707ccf73e13a 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -106,6 +106,15 @@ struct write_intent_entry {
 /* The number of bits we can have in one entry. */
 #define WRITE_INTENT_BITS_PER_ENTRY		(64)
 
+static inline u32 bytes_to_entries(u64 start, u32 length, u32 blocksize)
+{
+	u32 entry_len = blocksize * WRITE_INTENT_BITS_PER_ENTRY;
+	u64 entry_start = round_down(start, entry_len);
+	u64 entry_end = round_up(start + length, entry_len);
+
+	return DIV_ROUND_UP((u32)(entry_end - entry_start), entry_len);
+}
+
 /* In-memory write-intent control structure. */
 struct write_intent_ctrl {
 	/* For the write_intent super and entries. */
@@ -189,6 +198,13 @@ WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
 			  csum_type, 16);
 WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
 
+static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+
+	return wi_super_blocksize(wis) * WRITE_INTENT_BITS_PER_ENTRY;
+}
+
 static inline void wie_get_bitmap(struct write_intent_entry *entry,
 				  unsigned long *bitmap)
 {
@@ -214,6 +230,9 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 #endif
 }
 
+/* This function is only exported for selftests. */
+void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len);
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
 void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/14] btrfs: write-intent: introduce an internal helper to clear bits for a range.
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 06/14] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 08/14] btrfs: selftests: add selftests for write-intent bitmaps Qu Wenruo
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

This new helper. write_intent_clear_bits(), is much simpler than the set
bits counter part.

As if we can not find a entry for our target range, then it must be
something wrong, and we only need to warn and skip to next entry.

Although there has one extra thing to do, if we have emptied one entry,
we have to delete that empty entry.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 172 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |   4 +-
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index eaf6d010462e..8a69bf39a994 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -283,6 +283,36 @@ static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
 	wie_set_bitmap(entry, bitmaps);
 }
 
+static bool is_entry_empty(struct write_intent_ctrl *ctrl,
+			   struct write_intent_entry *entry)
+{
+	unsigned long bitmaps[2];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	return bitmap_empty(bitmaps, WRITE_INTENT_BITS_PER_ENTRY);
+}
+
+/*
+ * NOTE: This function may clear all bits of an entry, caller must check if
+ * that's the case, and delete the empty entry if needed.
+ */
+static void clear_bits_in_one_entry(struct write_intent_ctrl *ctrl,
+				    struct write_intent_entry *entry,
+				    u64 bytenr, u32 len)
+{
+	const u64 entry_start = wi_entry_bytenr(entry);
+	const u32 entry_len = write_intent_entry_size(ctrl);
+	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + entry_len);
+	bitmap_clear(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
+		   len / ctrl->blocksize);
+	wie_set_bitmap(entry, bitmaps);
+}
+
 /*
  * Insert new entries for the range [@bytenr, @bytenr + @len) at slot @nr
  * and fill the new entries with proper bytenr and bitmaps.
@@ -328,6 +358,25 @@ static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
 	}
 }
 
+static void delete_one_entry(struct write_intent_ctrl *ctrl, int nr)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	int cur_nr_entries = wi_super_nr_entries(wis);
+
+	ASSERT(is_entry_empty(ctrl, write_intent_entry_nr(ctrl, nr)));
+	ASSERT(nr < cur_nr_entries);
+
+	/* Move all the entries after slot @nr by one slot. */
+	memmove(write_intent_entry_nr(ctrl, nr),
+		write_intent_entry_nr(ctrl, nr + 1),
+		(cur_nr_entries - nr - 1) * sizeof(struct write_intent_entry));
+
+	/* Memzero the right most entry. */
+	memset(write_intent_entry_nr(ctrl, cur_nr_entries - 1), 0,
+	       sizeof(struct write_intent_entry));
+	wi_set_super_nr_entries(wis, cur_nr_entries - 1);
+}
+
 /*
  * This should be only called when we have enough room in the bitmaps, and hold
  * the wi_ctrl->lock.
@@ -462,6 +511,129 @@ void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len)
 				   bytenr + len - cur_bytenr);
 }
 
+/* This should be only called with wi_ctrl->lock hold, except for selftests. */
+void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
+			     u32 len)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	int i;
+	u64 cur_bytenr;
+
+	/*
+	 * Currently we only accept full stripe length, which should be
+	 * aligned to 64KiB.
+	 */
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	/*
+	 * Iterate through the existing entries to delete entries or clear
+	 * bits in the existing ones.
+	 */
+	for (i = 0, cur_bytenr = bytenr;
+	     i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; i++) {
+		struct write_intent_entry *entry = write_intent_entry_nr(ctrl, i);
+		u64 entry_start = wi_entry_bytenr(entry);
+		u64 entry_end = entry_start + entry_size;
+
+		/*
+		 *			|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *		|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *	|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * This case should not happen, it means we have some logged
+		 * dirty range, but it's no longer there.
+		 * Just warn and skip to the next covered range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start, entry_size) < 0) {
+			WARN_ON_ONCE(1);
+			cur_bytenr = min(bytenr + len, entry_start);
+			continue;
+		}
+
+		/*
+		 * |<-- entry -->|
+		 *	|<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 * |<-------- entry ------->|
+		 *	|<- bytenr/len ->|
+		 *
+		 * In this case, we just clear the bitmap in current entry, and
+		 * advance @cur_bytenr.
+		 * By this, we either go check the range against the next entry,
+		 * or we finish our current range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start, entry_size) == 0) {
+			u64 range_end = min(entry_end, bytenr + len);
+
+			clear_bits_in_one_entry(ctrl, entry, cur_bytenr,
+						range_end - cur_bytenr);
+			cur_bytenr = range_end;
+			/*
+			 * If the current entry is empty, we need to delete the
+			 * entry and decrease @nr.
+			 */
+			if (is_entry_empty(ctrl, entry)) {
+				delete_one_entry(ctrl, i);
+				i--;
+			}
+			continue;
+		}
+
+		/*
+		 * (A)
+		 * |<-- entry -->|			|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (B)
+		 * |<-- entry -->|		|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (C)
+		 * |<-- entry -->|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (D)
+		 * |<-- entry -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * For all above cases, we just skip to the next entry.
+		 *
+		 * For case (A) and (B), we will trigger wanring as we
+		 * don't have expected entries to clear.
+		 *
+		 * For case (C), we just do the regular clear bits.
+		 * Thus case (A) ~ (C) are all handled properly.
+		 *
+		 * For case (D), we will handle it after the loop.
+		 */
+	}
+	/*
+	 * There is some range not handled, and there is no more entries,
+	 * another unexpected case.
+	 * Just do warning.
+	 */
+	if (cur_bytenr < bytenr + len)
+		WARN_ON_ONCE(1);
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 707ccf73e13a..0da1b7421590 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -230,8 +230,10 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 #endif
 }
 
-/* This function is only exported for selftests. */
+/* These two functions are only exported for selftests. */
 void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len);
+void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
+			     u32 len);
 
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
 void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/14] btrfs: selftests: add selftests for write-intent bitmaps
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 07/14] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 09/14] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

It turns out such sparse bitmap still has a lot of things to go wrong,
definitely needs some tests to cover all the different corner cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile                           |   3 +-
 fs/btrfs/tests/btrfs-tests.c                |   4 +
 fs/btrfs/tests/btrfs-tests.h                |   2 +
 fs/btrfs/tests/write-intent-bitmaps-tests.c | 245 ++++++++++++++++++++
 fs/btrfs/write-intent.c                     |  10 -
 fs/btrfs/write-intent.h                     |  12 +
 6 files changed, 265 insertions(+), 11 deletions(-)
 create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index af93119d52e2..a9658782d363 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -42,4 +42,5 @@ btrfs-$(CONFIG_FS_VERITY) += verity.o
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
 	tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
-	tests/free-space-tree-tests.o tests/extent-map-tests.o
+	tests/free-space-tree-tests.o tests/extent-map-tests.o \
+	tests/write-intent-bitmaps-tests.o
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index cc9377cf56a3..118e95fbd548 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -27,6 +27,7 @@ const char *test_error[] = {
 	[TEST_ALLOC_INODE]	     = "cannot allocate inode",
 	[TEST_ALLOC_BLOCK_GROUP]     = "cannot allocate block group",
 	[TEST_ALLOC_EXTENT_MAP]      = "cannot allocate extent map",
+	[TEST_ALLOC_WRITE_INTENT_CTRL] = "cannot allocate write intent control",
 };
 
 static const struct super_operations btrfs_test_super_ops = {
@@ -296,6 +297,9 @@ int btrfs_run_sanity_tests(void)
 		}
 	}
 	ret = btrfs_test_extent_map();
+	if (ret)
+		goto out;
+	ret = btrfs_test_write_intent_bitmaps();
 
 out:
 	btrfs_destroy_test_fs();
diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
index 7a2d7ffbe30e..1845bfb6465d 100644
--- a/fs/btrfs/tests/btrfs-tests.h
+++ b/fs/btrfs/tests/btrfs-tests.h
@@ -23,6 +23,7 @@ enum {
 	TEST_ALLOC_INODE,
 	TEST_ALLOC_BLOCK_GROUP,
 	TEST_ALLOC_EXTENT_MAP,
+	TEST_ALLOC_WRITE_INTENT_CTRL,
 };
 
 extern const char *test_error[];
@@ -37,6 +38,7 @@ int btrfs_test_inodes(u32 sectorsize, u32 nodesize);
 int btrfs_test_qgroups(u32 sectorsize, u32 nodesize);
 int btrfs_test_free_space_tree(u32 sectorsize, u32 nodesize);
 int btrfs_test_extent_map(void);
+int btrfs_test_write_intent_bitmaps(void);
 struct inode *btrfs_new_test_inode(void);
 struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize);
 void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/tests/write-intent-bitmaps-tests.c b/fs/btrfs/tests/write-intent-bitmaps-tests.c
new file mode 100644
index 000000000000..c956acaea1c7
--- /dev/null
+++ b/fs/btrfs/tests/write-intent-bitmaps-tests.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../ctree.h"
+#include "../volumes.h"
+#include "../write-intent.h"
+#include "btrfs-tests.h"
+
+static struct write_intent_ctrl *alloc_dummy_ctrl(void)
+{
+	struct write_intent_ctrl *ctrl;
+	struct write_intent_super *wis;
+
+	ctrl = kzalloc(sizeof(*ctrl), GFP_NOFS);
+	if (!ctrl)
+		return NULL;
+	/*
+	 * For dummy tests, we only need the primary page, no need for the
+	 * commit page.
+	 */
+	ctrl->page = alloc_page(GFP_NOFS);
+	if (!ctrl->page) {
+		kfree(ctrl);
+		return NULL;
+	}
+	ctrl->blocksize = BTRFS_STRIPE_LEN;
+	atomic64_set(&ctrl->event, 1);
+	spin_lock_init(&ctrl->lock);
+	memzero_page(ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
+	wis = page_address(ctrl->page);
+	wi_set_super_magic(wis, WRITE_INTENT_SUPER_MAGIC);
+	wi_set_super_csum_type(wis, 0);
+	wi_set_super_events(wis, 1);
+	wi_set_super_flags(wis, WRITE_INTENT_FLAGS_SUPPORTED);
+	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
+	wi_set_super_blocksize(wis, ctrl->blocksize);
+	wi_set_super_nr_entries(wis, 0);
+	return ctrl;
+}
+
+static void zero_bitmaps(struct write_intent_ctrl *ctrl)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+
+	memzero_page(ctrl->page, sizeof(struct write_intent_super),
+			WRITE_INTENT_BITMAPS_SIZE -
+			sizeof(struct write_intent_super));
+	wi_set_super_nr_entries(wis, 0);
+}
+
+static int compare_bitmaps(struct write_intent_ctrl *ctrl,
+			   int nr_entries, u64 *bytenrs, u64 *bitmaps)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	struct write_intent_entry empty = {0};
+	int i;
+
+	if (wi_super_nr_entries(wis) != nr_entries) {
+		test_err("nr entries mismatch, has %llu expect %u",
+			wi_super_nr_entries(wis), nr_entries);
+		goto err;
+	}
+
+	for (i = 0; i < nr_entries; i++) {
+		struct write_intent_entry *entry =
+			write_intent_entry_nr(ctrl, i);
+
+		if (wi_entry_bytenr(entry) != bytenrs[i]) {
+			test_err("bytenr mismatch, has %llu expect %llu",
+				  wi_entry_bytenr(entry), bytenrs[i]);
+			goto err;
+		}
+		if (wi_entry_raw_bitmap(entry) != bitmaps[i]) {
+			test_err("bitmap mismatch, has 0x%016llx expect 0x%016llx",
+				  wi_entry_raw_bitmap(entry), bitmaps[i]);
+			goto err;
+		}
+	}
+
+	/* The unused entries should all be zero. */
+	for (i = nr_entries; i < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES;
+	     i++) {
+		if (memcmp(write_intent_entry_nr(ctrl, i), &empty,
+			   sizeof(empty))) {
+			test_err(
+			"unused entry is not empty, entry %u nr_entries %u",
+				 i, nr_entries);
+			goto err;
+		}
+	}
+	return 0;
+err:
+	/* Dump the bitmaps for better debugging. */
+	test_err("dumping bitmaps, nr_entries=%llu:", wi_super_nr_entries(wis));
+	for (i = 0; i < wi_super_nr_entries(wis); i++) {
+		struct write_intent_entry *entry =
+			write_intent_entry_nr(ctrl, i);
+
+		test_err("  entry=%u bytenr=%llu bitmap=0x%016llx\n",
+			 i, wi_entry_bytenr(entry), wi_entry_raw_bitmap(entry));
+	}
+	return -EUCLEAN;
+}
+
+static void free_dummy_ctrl(struct write_intent_ctrl *ctrl)
+{
+	__free_page(ctrl->page);
+	ASSERT(ctrl->commit_page == NULL);
+	kfree(ctrl);
+}
+
+/*
+ * Basic tests to ensure set and clear can properly handle bits set/clear in
+ * one entry.
+ */
+static int test_case_simple_entry(struct write_intent_ctrl *ctrl)
+{
+	const u32 blocksize = BTRFS_STRIPE_LEN;
+	u64 bitmaps[1] = { 0 };
+	u64 bytenrs[1] = { 0 };
+	int ret;
+
+	zero_bitmaps(ctrl);
+
+	write_intent_set_bits(ctrl, 0, blocksize * 3);
+
+	bitmaps[0] = 0x7;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, 0, blocksize * 3);
+	ret = compare_bitmaps(ctrl, 0, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_set_bits(ctrl, blocksize * 8, blocksize * 3);
+
+	bitmaps[0] = 0x700;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, blocksize * 9, blocksize * 2);
+	bitmaps[0] = 0x100;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, blocksize * 8, blocksize * 1);
+	ret = compare_bitmaps(ctrl, 0, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	/* Tests at high bits. */
+	write_intent_set_bits(ctrl, blocksize * 61, blocksize * 3);
+	bitmaps[0] = 0xe000000000000000L;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+	write_intent_clear_bits(ctrl, blocksize * 61, blocksize * 1);
+	bitmaps[0] = 0xc000000000000000L;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+	write_intent_clear_bits(ctrl, blocksize * 62, blocksize * 2);
+	ret = compare_bitmaps(ctrl, 0, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+/* Tests set/clear that cross entry boundaries. */
+static int test_case_cross_entries(struct write_intent_ctrl *ctrl)
+{
+	const u32 blocksize = BTRFS_STRIPE_LEN;
+	u64 bitmaps[3] = { 0 };
+	u64 bytenrs[3] = { 0 };
+	int ret;
+
+	zero_bitmaps(ctrl);
+
+	write_intent_set_bits(ctrl, blocksize * 32, blocksize * 64);
+	bitmaps[0] = 0xffffffff00000000L;
+	bytenrs[0] = 0;
+	bitmaps[1] = 0x00000000ffffffffL;
+	bytenrs[1] = 4194304;
+	ret = compare_bitmaps(ctrl, 2, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_set_bits(ctrl, blocksize * 96, blocksize * 64);
+	bitmaps[0] = 0xffffffff00000000L;
+	bytenrs[0] = 0;
+	bitmaps[1] = 0xffffffffffffffffL;
+	bytenrs[1] = 4194304;
+	bitmaps[2] = 0x00000000ffffffffL;
+	bytenrs[2] = 8388608;
+	ret = compare_bitmaps(ctrl, 3, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, blocksize * 33, blocksize * 126);
+	bitmaps[0] = 0x0000000100000000L;
+	bytenrs[0] = 0;
+	bitmaps[1] = 0x0000000080000000L;
+	bytenrs[1] = 8388608;
+	ret = compare_bitmaps(ctrl, 2, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
+int btrfs_test_write_intent_bitmaps(void)
+{
+	struct write_intent_ctrl *ctrl;
+	int ret;
+
+	ctrl = alloc_dummy_ctrl();
+	if (!ctrl) {
+		test_std_err(TEST_ALLOC_WRITE_INTENT_CTRL);
+		return -ENOMEM;
+	}
+	test_msg("running extent_map tests");
+
+	ret = test_case_simple_entry(ctrl);
+	if (ret < 0) {
+		test_err("failed set/clear tests in one simple entry");
+		goto out;
+	}
+
+	ret = test_case_cross_entries(ctrl);
+	if (ret < 0) {
+		test_err("failed set/clear tests across entry boundaries");
+		goto out;
+	}
+out:
+	free_dummy_ctrl(ctrl);
+	return ret;
+}
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 8a69bf39a994..b4a205cb0c88 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -216,16 +216,6 @@ static int write_intent_init(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
-static struct write_intent_entry *write_intent_entry_nr(
-				struct write_intent_ctrl *ctrl, int nr)
-{
-
-	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
-	return (page_address(ctrl->page) +
-		sizeof(struct write_intent_super) +
-		nr * sizeof(struct write_intent_entry));
-}
-
 /*
  * Return <0 if the bytenr is before the given entry.
  * Return 0 if the bytenr is inside the given entry.
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 0da1b7421590..00eb116e4c38 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -197,6 +197,8 @@ WRITE_INTENT_SETGET_FUNCS(super_blocksize, struct write_intent_super,
 WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
 			  csum_type, 16);
 WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
+WRITE_INTENT_SETGET_FUNCS(entry_raw_bitmap, struct write_intent_entry,
+			  bitmap, 64);
 
 static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
 {
@@ -205,6 +207,16 @@ static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
 	return wi_super_blocksize(wis) * WRITE_INTENT_BITS_PER_ENTRY;
 }
 
+static inline struct write_intent_entry *write_intent_entry_nr(
+				struct write_intent_ctrl *ctrl, int nr)
+{
+
+	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+	return (page_address(ctrl->page) +
+		sizeof(struct write_intent_super) +
+		nr * sizeof(struct write_intent_entry));
+}
+
 static inline void wie_get_bitmap(struct write_intent_entry *entry,
 				  unsigned long *bitmap)
 {
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/14] btrfs: write back write intent bitmap after barrier_all_devices()
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (7 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 08/14] btrfs: selftests: add selftests for write-intent bitmaps Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 10/14] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

The new function, btrfs_write_intent_writeback(), will also accept a u64
parameter, @event, to do extra fastpath check to avoid unnecessary
writeback.

For now we just pass 0 to write the current bitmap to disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c                          |  9 +++
 fs/btrfs/tests/write-intent-bitmaps-tests.c |  2 +
 fs/btrfs/write-intent.c                     | 79 ++++++++++++++++++++-
 fs/btrfs/write-intent.h                     | 25 +++++++
 4 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 98480bf6e5e2..89bf3b2693a5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4381,6 +4381,15 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 		}
 	}
 
+	ret = btrfs_write_intent_writeback(fs_info, 0);
+	if (ret < 0) {
+		mutex_unlock(
+			&fs_info->fs_devices->device_list_mutex);
+		btrfs_handle_fs_error(fs_info, ret,
+			"errors while writing back write intent bitmaps.");
+		return ret;
+	}
+
 	list_for_each_entry(dev, head, dev_list) {
 		if (!dev->bdev) {
 			total_errors++;
diff --git a/fs/btrfs/tests/write-intent-bitmaps-tests.c b/fs/btrfs/tests/write-intent-bitmaps-tests.c
index c956acaea1c7..1b81720f21d1 100644
--- a/fs/btrfs/tests/write-intent-bitmaps-tests.c
+++ b/fs/btrfs/tests/write-intent-bitmaps-tests.c
@@ -25,6 +25,8 @@ static struct write_intent_ctrl *alloc_dummy_ctrl(void)
 	ctrl->blocksize = BTRFS_STRIPE_LEN;
 	atomic64_set(&ctrl->event, 1);
 	spin_lock_init(&ctrl->lock);
+	init_waitqueue_head(&ctrl->overflow_wait);
+	init_waitqueue_head(&ctrl->write_wait);
 	memzero_page(ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
 	wis = page_address(ctrl->page);
 	wi_set_super_magic(wis, WRITE_INTENT_SUPER_MAGIC);
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index b4a205cb0c88..5e26813a06e0 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -76,8 +76,17 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	shash->tfm = fs_info->csum_shash;
 
 	spin_lock(&ctrl->lock);
-	wis = page_address(ctrl->page);
 
+	/* No update on the bitmap, just skip this writeback. */
+	if (!memcmp(page_address(ctrl->page), page_address(ctrl->commit_page),
+		    WRITE_INTENT_BITMAPS_SIZE)) {
+		ctrl->writing_event = 0;
+		spin_unlock(&ctrl->lock);
+		wake_up(&ctrl->write_wait);
+		return 0;
+	}
+
+	wis = page_address(ctrl->page);
 	/*
 	 * Bump up the event counter each time this bitmap is going to be
 	 * written.
@@ -127,6 +136,13 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	}
 	wait_event(wb_ctrl.wait, atomic_read(&wb_ctrl.pending_bios) == 0);
 
+	spin_lock(&ctrl->lock);
+	if (ctrl->writing_event > ctrl->committed_event)
+		ctrl->committed_event = ctrl->writing_event;
+	ctrl->writing_event = 0;
+	spin_unlock(&ctrl->lock);
+	wake_up(&ctrl->write_wait);
+
 	if (total_errors + atomic_read(&wb_ctrl.errors) >
 	    btrfs_super_num_devices(fs_info->super_copy) - 1) {
 		btrfs_err(fs_info, "failed to writeback write-intent bitmaps");
@@ -624,6 +640,63 @@ void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 		WARN_ON_ONCE(1);
 }
 
+int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return 0;
+
+	ASSERT(ctrl);
+
+again:
+	spin_lock(&ctrl->lock);
+
+	/*
+	 * The bitmap has already been written to disk at least once. Our update
+	 * has already reached disk.
+	 */
+	if (event && ctrl->committed_event > event) {
+		spin_unlock(&ctrl->lock);
+		return 0;
+	}
+
+	/* Previous writing hasn't finished, wait for it and retry. */
+	if (ctrl->writing_event && ctrl->writing_event <= event) {
+		DEFINE_WAIT(__wait);
+
+		prepare_to_wait_event(&ctrl->write_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		goto again;
+	}
+
+	/* Someone is already writing back the bitmap. */
+	if (ctrl->writing_event) {
+		DEFINE_WAIT(__wait);
+
+		ASSERT(ctrl->writing_event > event);
+		prepare_to_wait_event(&ctrl->write_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		return 0;
+	}
+
+	/*
+	 * We're the one to write back the bitmap, update @writing_event so
+	 * all the other caller will just wait for us.
+	 */
+	ctrl->writing_event = atomic64_read(&ctrl->event) + 1;
+	spin_unlock(&ctrl->lock);
+
+	/* Slow path, do the submission and wait. */
+	return write_intent_writeback(fs_info);
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
@@ -648,6 +721,8 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 	}
 
 	spin_lock_init(&fs_info->wi_ctrl->lock);
+	init_waitqueue_head(&fs_info->wi_ctrl->overflow_wait);
+	init_waitqueue_head(&fs_info->wi_ctrl->write_wait);
 
 	/*
 	 * Go through every writeable device to find the highest event.
@@ -688,6 +763,8 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 				  dev->devid, ret);
 			goto cleanup;
 		}
+		memcpy_page(fs_info->wi_ctrl->commit_page, 0,
+			    fs_info->wi_ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
 		wis = page_address(fs_info->wi_ctrl->page);
 		atomic64_set(&fs_info->wi_ctrl->event, wi_super_events(wis));
 		fs_info->wi_ctrl->blocksize = wi_super_blocksize(wis);
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 00eb116e4c38..bf84737e0706 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -123,12 +123,30 @@ struct write_intent_ctrl {
 	/* A copy for writeback. */
 	struct page *commit_page;
 
+	/*
+	 * For callers who has updated their bitmap, wait for the bitmap to be
+	 * flushed to disk.
+	 */
+	wait_queue_head_t write_wait;
+
+	/*
+	 * For callers whose bits can not be updated as no enough space left
+	 * int the bitmap.
+	 */
+	wait_queue_head_t overflow_wait;
+
 	/* Cached event counter.*/
 	atomic64_t event;
 
 	/* Lock for reading/writing above @page. */
 	spinlock_t lock;
 
+	/* Event number for the bitmap being written. */
+	u64 writing_event;
+
+	/* Event number for the last committed bitmap. */
+	u64 committed_event;
+
 	/* Cached blocksize from write intent super. */
 	u32 blocksize;
 };
@@ -250,4 +268,11 @@ void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
 void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
 
+/*
+ * Ensure the bitmap for @event is already written to disk.
+ *
+ * If @event is 0, it means to write current bitmap to disk.
+ */
+int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
+
 #endif
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/14] btrfs: update and writeback the write-intent bitmap for RAID56 write.
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (8 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 09/14] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:37 ` [PATCH 11/14] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

This allows us to mark the write-intent bitmaps dirty for later recovery
usage.

For now, we only mark the bitmaps dirty but without really clearing
them, this is going to cause problems (hang the fs if the bitmap is
full), but this also allows us to debug the bitmap with the new
dump-write-intent tool:

  csum_type		0 (crc32c)
  csum			0xad622029 [match]
  magic			_wIbSb_Q [match]
  fsid			46bcd711-6c9b-400f-aaba-bf99aa3dd321
  flags			0x7
  			( TARGET_RAID56 |
  			  INTERNAL |
  			  BYTENR_LOGICAL )
  events			10
  size			4096
  blocksize		65536
  nr_entries		1
  entry 0, bytenr 385875968, bitmap 0x000000000003fffc

This is doing 1MiB write for logical 388104192, which matches the above
output bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c       |  8 +++++++
 fs/btrfs/write-intent.c | 46 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  9 ++++++++
 3 files changed, 63 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 9b79a212fd6a..1b9a9f40df29 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1183,6 +1183,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	bool has_qstripe;
 	struct bio_list bio_list;
 	struct bio *bio;
+	u64 event;
 	int ret;
 
 	bio_list_init(&bio_list);
@@ -1334,6 +1335,13 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
 	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
+	/* Update the write intent bitmap before we start submitting bios. */
+	btrfs_write_intent_mark_dirty(bioc->fs_info, rbio->bioc->raid_map[0],
+				     rbio->nr_data * BTRFS_STRIPE_LEN, &event);
+	ret = btrfs_write_intent_writeback(bioc->fs_info, event);
+
+	if (ret < 0)
+		goto cleanup;
 	while ((bio = bio_list_pop(&bio_list))) {
 		bio->bi_end_io = raid_write_end_io;
 
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 5e26813a06e0..8520105d0d84 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -381,6 +381,9 @@ static void delete_one_entry(struct write_intent_ctrl *ctrl, int nr)
 	memset(write_intent_entry_nr(ctrl, cur_nr_entries - 1), 0,
 	       sizeof(struct write_intent_entry));
 	wi_set_super_nr_entries(wis, cur_nr_entries - 1);
+
+	/* We freed one entry, wake up who are waiting for the extra space. */
+	wake_up(&ctrl->overflow_wait);
 }
 
 /*
@@ -640,6 +643,49 @@ void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 		WARN_ON_ONCE(1);
 }
 
+void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 len, u64 *event_ret)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	u32 entry_len;
+	int nr_entries;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return;
+
+	ASSERT(ctrl);
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+again:
+	spin_lock(&ctrl->lock);
+	entry_len = ctrl->blocksize * WRITE_INTENT_BITS_PER_ENTRY;
+	nr_entries = (round_up(logical + len, entry_len) -
+		      round_down(logical, entry_len)) / entry_len;
+	wis = page_address(ctrl->page);
+
+	/*
+	 * May not have enough space left. This calculation is definitely
+	 * overkilled, but will ensure we have enough space for it.
+	 */
+	if (unlikely(wi_super_nr_entries(wis) + nr_entries) >=
+		     WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES) {
+		DEFINE_WAIT(__wait);
+
+		prepare_to_wait_event(&ctrl->overflow_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		goto again;
+	}
+
+	/* Update the bitmap. */
+	write_intent_set_bits(ctrl, logical, len);
+	*event_ret = atomic64_read(&ctrl->event);
+	spin_unlock(&ctrl->lock);
+}
+
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index bf84737e0706..e1e3a16f8929 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -275,4 +275,13 @@ void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
  */
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
 
+/*
+ * Mark the range dirty in write intent bitmaps.
+ *
+ * May (but unlikely) sleep if there is not enough free entries.
+ * In that case, we will wait for enough free entries to be released.
+ */
+void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 len, u64 *event_ret);
+
 #endif
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/14] btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (9 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 10/14] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
@ 2022-07-25  5:37 ` Qu Wenruo
  2022-07-25  5:38 ` [PATCH 12/14] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:37 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c       |  7 +++++++
 fs/btrfs/write-intent.c | 45 +++++++++++++++++++++++++++++------------
 fs/btrfs/write-intent.h |  8 ++++++++
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1b9a9f40df29..0a0a2a1e96c3 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -817,6 +817,13 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 
 	if (rbio->generic_bio_cnt)
 		btrfs_bio_counter_sub(rbio->bioc->fs_info, rbio->generic_bio_cnt);
+
+	/* Clear the write-intent bitmap range for write operation. */
+	if (rbio->operation == BTRFS_RBIO_WRITE)
+		btrfs_write_intent_clear_dirty(rbio->bioc->fs_info,
+				       rbio->bioc->raid_map[0],
+				       rbio->nr_data * BTRFS_STRIPE_LEN);
+
 	/*
 	 * Clear the data bitmap, as the rbio may be cached for later usage.
 	 * do this before before unlock_stripe() so there will be no new bio
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 8520105d0d84..40d579574f3d 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -61,6 +61,7 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	struct bitmap_writeback_contrl wb_ctrl = {0};
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	const int nr_devs_max = fs_info->fs_devices->open_devices + 4;
+	unsigned long flags;
 	int nr_devs = 0;
 	int total_errors = 0;
 	int ret;
@@ -75,13 +76,13 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 
 	shash->tfm = fs_info->csum_shash;
 
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	/* No update on the bitmap, just skip this writeback. */
 	if (!memcmp(page_address(ctrl->page), page_address(ctrl->commit_page),
 		    WRITE_INTENT_BITMAPS_SIZE)) {
 		ctrl->writing_event = 0;
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		wake_up(&ctrl->write_wait);
 		return 0;
 	}
@@ -98,7 +99,7 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	atomic64_inc(&ctrl->event);
 	memcpy_page(ctrl->commit_page, 0, ctrl->page, 0,
 		    WRITE_INTENT_BITMAPS_SIZE);
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	init_waitqueue_head(&wb_ctrl.wait);
 	atomic_set(&wb_ctrl.pending_bios, 0);
@@ -136,11 +137,11 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	}
 	wait_event(wb_ctrl.wait, atomic_read(&wb_ctrl.pending_bios) == 0);
 
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 	if (ctrl->writing_event > ctrl->committed_event)
 		ctrl->committed_event = ctrl->writing_event;
 	ctrl->writing_event = 0;
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 	wake_up(&ctrl->write_wait);
 
 	if (total_errors + atomic_read(&wb_ctrl.errors) >
@@ -648,6 +649,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
 	struct write_intent_super *wis;
+	unsigned long flags;
 	u32 entry_len;
 	int nr_entries;
 
@@ -658,7 +660,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
 
 again:
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 	entry_len = ctrl->blocksize * WRITE_INTENT_BITS_PER_ENTRY;
 	nr_entries = (round_up(logical + len, entry_len) -
 		      round_down(logical, entry_len)) / entry_len;
@@ -674,7 +676,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 
 		prepare_to_wait_event(&ctrl->overflow_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		goto again;
@@ -683,12 +685,29 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	/* Update the bitmap. */
 	write_intent_set_bits(ctrl, logical, len);
 	*event_ret = atomic64_read(&ctrl->event);
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+
+void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				    u32 len)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	unsigned long flags;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return;
+	ASSERT(ctrl);
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	write_intent_clear_bits(ctrl, logical, len);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	unsigned long flags;
 
 	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
 		return 0;
@@ -696,14 +715,14 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 	ASSERT(ctrl);
 
 again:
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	/*
 	 * The bitmap has already been written to disk at least once. Our update
 	 * has already reached disk.
 	 */
 	if (event && ctrl->committed_event > event) {
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		return 0;
 	}
 
@@ -713,7 +732,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 
 		prepare_to_wait_event(&ctrl->write_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		goto again;
@@ -726,7 +745,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 		ASSERT(ctrl->writing_event > event);
 		prepare_to_wait_event(&ctrl->write_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		return 0;
@@ -737,7 +756,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 	 * all the other caller will just wait for us.
 	 */
 	ctrl->writing_event = atomic64_read(&ctrl->event) + 1;
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	/* Slow path, do the submission and wait. */
 	return write_intent_writeback(fs_info);
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index e1e3a16f8929..872a707ef67d 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -284,4 +284,12 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
 void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 				   u32 len, u64 *event_ret);
 
+/*
+ * Clear the range dirty in write intent bitmaps.
+ *
+ * This function should not sleep, and no need to wait for the bitmap to be
+ * flushed.
+ */
+void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				    u32 len);
 #endif
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 12/14] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (10 preceding siblings ...)
  2022-07-25  5:37 ` [PATCH 11/14] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
@ 2022-07-25  5:38 ` Qu Wenruo
  2022-07-25  5:38 ` [PATCH 13/14] btrfs: avoid recording full stripe write into write-intent bitmaps Qu Wenruo
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:38 UTC (permalink / raw)
  To: linux-btrfs

To properly cleanup the bitmaps, we need to scrub the logical ranges in
the bitmaps.

Unfortunately there is no such convient interface at all (scrub only
works at device offset for now).

So just introduce a place holder to warn and clear the bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c      |  7 +++++++
 fs/btrfs/write-intent.c | 35 +++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  8 ++++++++
 3 files changed, 50 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 89bf3b2693a5..d29fad12d459 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3729,6 +3729,13 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			  ret);
 		goto fail_block_groups;
 	}
+
+	ret = btrfs_write_intent_recover(fs_info);
+	if (ret < 0) {
+		btrfs_err(fs_info, "failed to recover write-intent bitmap: %d",
+			  ret);
+		goto fail_block_groups;
+	}
 	ret = btrfs_recover_balance(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to recover balance: %d", ret);
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 40d579574f3d..82228713e621 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -704,6 +704,41 @@ void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
+int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	int ret = 0;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return ret;
+
+	ASSERT(ctrl);
+	wis = page_address(ctrl->page);
+
+	if (wi_super_nr_entries(wis) != 0) {
+		int i;
+
+		btrfs_warn(fs_info, "dirty write intent bitmap found:");
+		for (i = 0; i < wi_super_nr_entries(wis); i++) {
+			struct write_intent_entry *entry =
+				write_intent_entry_nr(ctrl, i);
+
+			btrfs_warn(fs_info,
+				"  entry=%u bytenr=%llu bitmap=0x%016llx\n", i,
+				   wi_entry_bytenr(entry),
+				   wi_entry_raw_bitmap(entry));
+		}
+		/* For now, we just clear the whole bitmap. */
+		memzero_page(ctrl->page, sizeof(struct write_intent_super),
+			     WRITE_INTENT_BITMAPS_SIZE -
+			     sizeof(struct write_intent_super));
+		wi_set_super_nr_entries(wis, 0);
+		ret = write_intent_writeback(fs_info);
+	}
+	return ret;
+}
+
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 872a707ef67d..bb6e9b599373 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -292,4 +292,12 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
  */
 void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 				    u32 len);
+
+/*
+ * Rebuild the range in the write-intent bitmaps.
+ *
+ * Currently not working, it will just output a warning and clear the bitmap.
+ */
+int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info);
+
 #endif
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 13/14] btrfs: avoid recording full stripe write into write-intent bitmaps
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (11 preceding siblings ...)
  2022-07-25  5:38 ` [PATCH 12/14] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
@ 2022-07-25  5:38 ` Qu Wenruo
  2022-07-25  5:38 ` [PATCH 14/14] btrfs: scrub the full stripe which had sub-stripe write at mount time Qu Wenruo
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:38 UTC (permalink / raw)
  To: linux-btrfs

Full stripe write can happen in the following cases:

- Writing into a completely new stripe
  In this case, even if powerloss happened, we won't have any committed
  metadata referring the new full stripe.

  Thus we don't need to recover.

- Writing into a NODATACOW range
  In this case, although in theory we should recovery after power loss,
  but NODATACOW implies NODATASUM, thus we have no way to determine
  which data is correct.

  Thus we don't need to and can't recover either.

So just avoid recording full stripe write into write-intent bitmaps.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0a0a2a1e96c3..37e5fd5df1f9 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -818,8 +818,13 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 	if (rbio->generic_bio_cnt)
 		btrfs_bio_counter_sub(rbio->bioc->fs_info, rbio->generic_bio_cnt);
 
-	/* Clear the write-intent bitmap range for write operation. */
-	if (rbio->operation == BTRFS_RBIO_WRITE)
+	/*
+	 * Clear the write-intent bitmap range for write operation.
+	 * For full stripe write we didn't record it into write-intent thus no
+	 * need to clear the bits for full stripe write.
+	 */
+	if (rbio->operation == BTRFS_RBIO_WRITE &&
+	    rbio->bio_list_bytes < rbio->nr_data * BTRFS_STRIPE_LEN)
 		btrfs_write_intent_clear_dirty(rbio->bioc->fs_info,
 				       rbio->bioc->raid_map[0],
 				       rbio->nr_data * BTRFS_STRIPE_LEN);
@@ -1342,13 +1347,19 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
 	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
-	/* Update the write intent bitmap before we start submitting bios. */
-	btrfs_write_intent_mark_dirty(bioc->fs_info, rbio->bioc->raid_map[0],
-				     rbio->nr_data * BTRFS_STRIPE_LEN, &event);
-	ret = btrfs_write_intent_writeback(bioc->fs_info, event);
+	/*
+	 * Update the write intent bitmap if it's a sub-stripe write,
+	 * before we start submitting bios.
+	 */
+	if (rbio->bio_list_bytes < rbio->nr_data * BTRFS_STRIPE_LEN) {
+		btrfs_write_intent_mark_dirty(bioc->fs_info,
+				rbio->bioc->raid_map[0],
+				rbio->nr_data * BTRFS_STRIPE_LEN, &event);
+		ret = btrfs_write_intent_writeback(bioc->fs_info, event);
+		if (ret < 0)
+			goto cleanup;
+	}
 
-	if (ret < 0)
-		goto cleanup;
 	while ((bio = bio_list_pop(&bio_list))) {
 		bio->bi_end_io = raid_write_end_io;
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 14/14] btrfs: scrub the full stripe which had sub-stripe write at mount time
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (12 preceding siblings ...)
  2022-07-25  5:38 ` [PATCH 13/14] btrfs: avoid recording full stripe write into write-intent bitmaps Qu Wenruo
@ 2022-07-25  5:38 ` Qu Wenruo
  2022-08-03  0:29 ` [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 me
  2022-08-03  8:48 ` Goffredo Baroncelli
  15 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-07-25  5:38 UTC (permalink / raw)
  To: linux-btrfs

Since we have write-intent bitmaps, we know exactly which full stripes
were dirty before last unclean shutdown.

As long as there is no missing device, we can just scrub all the data
stripes, then scrub the P/Q stripes.

This patch will implement the code for mount time recovery, by:

- Grab the full stripe
- Scrub the data stripes first
- Scrub the P/Q stripes
  This is not the optimal way to check, as previous scrub on data
  stripes has ensured all the data are correct, we can just
  re-generate the P/Q.

  But unfortunately we don't have a good way to re-use the sectors from
  previous scrub.
  So here we still go scrub_raid56_parity(), which would re-check the
  data stripes.

And since btrfs_write_intent_recover() needs block groups to be
initialized, and has to be called before balance, this patch also moves
the timing of btrfs_write_intent_recover() and btrfs_recover_balance()
after btrfs_read_block_groups().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h        |   2 +
 fs/btrfs/disk-io.c      |  24 +++---
 fs/btrfs/scrub.c        | 177 +++++++++++++++++++++++++++++++++++++---
 fs/btrfs/write-intent.c |  30 +++++--
 4 files changed, 206 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ade76ba98c68..7b22dd04fd4f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4003,6 +4003,8 @@ int btrfs_scrub_cancel(struct btrfs_fs_info *info);
 int btrfs_scrub_cancel_dev(struct btrfs_device *dev);
 int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid,
 			 struct btrfs_scrub_progress *progress);
+int btrfs_scrub_raid56_full_stripe(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 *full_stripe_ret);
 static inline void btrfs_init_full_stripe_locks_tree(
 			struct btrfs_full_stripe_locks_tree *locks_root)
 {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d29fad12d459..b1f8c17906ea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3730,18 +3730,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_block_groups;
 	}
 
-	ret = btrfs_write_intent_recover(fs_info);
-	if (ret < 0) {
-		btrfs_err(fs_info, "failed to recover write-intent bitmap: %d",
-			  ret);
-		goto fail_block_groups;
-	}
-	ret = btrfs_recover_balance(fs_info);
-	if (ret) {
-		btrfs_err(fs_info, "failed to recover balance: %d", ret);
-		goto fail_block_groups;
-	}
-
 	ret = btrfs_init_dev_stats(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to init dev_stats: %d", ret);
@@ -3786,6 +3774,18 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_sysfs;
 	}
 
+	ret = btrfs_write_intent_recover(fs_info);
+	if (ret < 0) {
+		btrfs_err(fs_info, "failed to recover write-intent bitmap: %d",
+			  ret);
+		goto fail_sysfs;
+	}
+	ret = btrfs_recover_balance(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to recover balance: %d", ret);
+		goto fail_sysfs;
+	}
+
 	btrfs_free_zone_cache(fs_info);
 
 	if (!sb_rdonly(sb) && fs_info->fs_devices->missing_devices &&
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3afe5fa50a63..4f6478072549 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4093,19 +4093,10 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
-int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
-		    u64 end, struct btrfs_scrub_progress *progress,
-		    int readonly, int is_dev_replace)
+static int check_scrub_requreiment(struct btrfs_fs_info *fs_info)
 {
-	struct btrfs_dev_lookup_args args = { .devid = devid };
-	struct scrub_ctx *sctx;
-	int ret;
-	struct btrfs_device *dev;
-	unsigned int nofs_flag;
-
 	if (btrfs_fs_closing(fs_info))
 		return -EAGAIN;
-
 	if (fs_info->nodesize > BTRFS_STRIPE_LEN) {
 		/*
 		 * in this case scrub is unable to calculate the checksum
@@ -4132,6 +4123,22 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		       fs_info->sectorsize, SCRUB_MAX_SECTORS_PER_BLOCK);
 		return -EINVAL;
 	}
+	return 0;
+}
+
+int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
+		    u64 end, struct btrfs_scrub_progress *progress,
+		    int readonly, int is_dev_replace)
+{
+	struct btrfs_dev_lookup_args args = { .devid = devid };
+	struct scrub_ctx *sctx;
+	int ret;
+	struct btrfs_device *dev;
+	unsigned int nofs_flag;
+
+	ret = check_scrub_requreiment(fs_info);
+	if (ret < 0)
+		return ret;
 
 	/* Allocate outside of device_list_mutex */
 	sctx = scrub_setup_ctx(fs_info, is_dev_replace);
@@ -4355,3 +4362,153 @@ static void scrub_find_good_copy(struct btrfs_fs_info *fs_info,
 	*extent_dev = bioc->stripes[0].dev;
 	btrfs_put_bioc(bioc);
 }
+
+/*
+ * This is for mount time write-intent recovery, to repair the full stripe
+ * which
+ */
+int btrfs_scrub_raid56_full_stripe(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 *full_stripe_ret)
+{
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, logical);
+	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, logical);
+	struct scrub_ctx *sctx;
+	struct btrfs_block_group *bg;
+	struct extent_map *em;
+	struct map_lookup *map;
+	u64 map_length = BTRFS_STRIPE_LEN;
+	int nr_data;
+	int ret;
+	int i;
+
+	/*
+	 * Use STRIPE_LEN as default return length. Will be updated after we
+	 * got a bioc.
+	 */
+	*full_stripe_ret = BTRFS_STRIPE_LEN;
+
+	ret = check_scrub_requreiment(fs_info);
+	if (ret < 0)
+		return ret;
+
+	bg = btrfs_lookup_block_group(fs_info, logical);
+	if (!bg) {
+		btrfs_err(fs_info, "can not find block group for logical %llu",
+			  logical);
+		return -ENOENT;
+	}
+
+	read_lock(&fs_info->mapping_tree.lock);
+	em = lookup_extent_mapping(&fs_info->mapping_tree, logical,
+				   fs_info->sectorsize);
+	read_unlock(&fs_info->mapping_tree.lock);
+	if (!em) {
+		btrfs_err(fs_info, "can not find chunk for logical %llu",
+			  logical);
+		ret = -ENOENT;
+		goto out_put_bg;
+	}
+
+	map = em->map_lookup;
+	if (!(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)) {
+		btrfs_warn(fs_info, "logical %llu is not inside a RAID56 chunk",
+			   logical);
+		ret = -EINVAL;
+		goto out_free_em;
+	}
+	/* We need to make sure the logical bytenr is aligned to STRIPE_LEN. */
+	if (!IS_ALIGNED(logical - em->start, BTRFS_STRIPE_LEN)) {
+		btrfs_err(fs_info, "logical %llu is not stripe aligned", logical);
+		ret = -EINVAL;
+		goto out_free_em;
+	}
+
+	/* Also logical should be at a full stripe start. */
+	if (div_u64(logical - em->start, BTRFS_STRIPE_LEN) % map->num_stripes) {
+		btrfs_err(fs_info, "logical %llu is not the start of a full stripe",
+			  logical);
+		ret = -EINVAL;
+		goto out_free_em;
+	}
+
+	sctx = scrub_setup_ctx(fs_info, false);
+	if (IS_ERR(sctx)) {
+		ret = PTR_ERR(sctx);
+		goto out_free_em;
+	}
+
+	ret = scrub_workers_get(fs_info, false);
+	if (ret)
+		goto out_free_ctx;
+
+	/* Grab the full stripe mapping. */
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+			       &map_length, &bioc);
+	if (ret < 0)
+		goto out_free_ctx;
+
+	if (!(bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)) {
+		btrfs_warn(fs_info, "logical %llu is not inside a RAID56 chunk",
+			   logical);
+		ret = -EOPNOTSUPP;
+		goto out_put_bioc;
+	}
+
+	nr_data = bioc->num_stripes - bioc->num_tgtdevs -
+		  btrfs_nr_parity_stripes(bioc->map_type);
+	*full_stripe_ret = nr_data * BTRFS_STRIPE_LEN;
+
+	/* Scrub the data stripes first. */
+	for (i = 0; i < nr_data; i++, logical += BTRFS_STRIPE_LEN) {
+		struct btrfs_device *dev = bioc->stripes[i].dev;
+
+		ret = scrub_simple_mirror(sctx, extent_root, csum_root, bg,
+					  map, logical, BTRFS_STRIPE_LEN, dev,
+					  bioc->stripes[i].physical, 1);
+		scrub_submit(sctx);
+		mutex_lock(&sctx->wr_lock);
+		scrub_wr_submit(sctx);
+		mutex_unlock(&sctx->wr_lock);
+		if (ret < 0)
+			goto out_put_bioc;
+	}
+
+	/* Then scrub the P/Q stripes. */
+	ret = scrub_raid56_parity(sctx, map, bioc->stripes[i].dev,
+				  bioc->raid_map[0],
+				  bioc->raid_map[0] + nr_data * BTRFS_STRIPE_LEN);
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_lock);
+	if (ret < 0)
+		goto out_put_bioc;
+	i++;
+
+	if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
+		ret = scrub_raid56_parity(sctx, map, bioc->stripes[i].dev,
+					  bioc->raid_map[0],
+					  bioc->raid_map[0] + nr_data * BTRFS_STRIPE_LEN);
+		scrub_submit(sctx);
+		mutex_lock(&sctx->wr_lock);
+		scrub_wr_submit(sctx);
+		mutex_unlock(&sctx->wr_lock);
+		if (ret < 0)
+			goto out_put_bioc;
+	}
+	if (!ret)
+		btrfs_info(fs_info, "scrubbed full stripe at logical %llu",
+			   logical);
+
+out_put_bioc:
+	btrfs_put_bioc(bioc);
+out_free_ctx:
+	scrub_free_ctx(sctx);
+out_free_em:
+	free_extent_map(em);
+out_put_bg:
+	btrfs_put_block_group(bg);
+
+	return ret;
+}
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 82228713e621..f8912de016c9 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -704,6 +704,29 @@ void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
+static void check_one_entry(struct btrfs_fs_info *fs_info,
+			   struct write_intent_entry *entry)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
+	int bit;
+
+	wie_get_bitmap(entry, bitmaps);
+
+	for_each_set_bit(bit, bitmaps, WRITE_INTENT_BITS_PER_ENTRY) {
+		u64 bytenr = wi_entry_bytenr(entry) + bit * ctrl->blocksize;
+		u32 full_stripe_len;
+		int ret;
+
+		ret = btrfs_scrub_raid56_full_stripe(fs_info, bytenr,
+						     &full_stripe_len);
+		if (ret < 0)
+			btrfs_err(fs_info,
+			"failed to scrub full stripe at logical %llu", bytenr);
+		bit += full_stripe_len / BTRFS_STRIPE_LEN;
+	}
+}
+
 int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
@@ -724,12 +747,9 @@ int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info)
 			struct write_intent_entry *entry =
 				write_intent_entry_nr(ctrl, i);
 
-			btrfs_warn(fs_info,
-				"  entry=%u bytenr=%llu bitmap=0x%016llx\n", i,
-				   wi_entry_bytenr(entry),
-				   wi_entry_raw_bitmap(entry));
+			check_one_entry(fs_info, entry);
 		}
-		/* For now, we just clear the whole bitmap. */
+		/* All checked, clear the whole bitmap. */
 		memzero_page(ctrl->page, sizeof(struct write_intent_super),
 			     WRITE_INTENT_BITMAPS_SIZE -
 			     sizeof(struct write_intent_super));
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (13 preceding siblings ...)
  2022-07-25  5:38 ` [PATCH 14/14] btrfs: scrub the full stripe which had sub-stripe write at mount time Qu Wenruo
@ 2022-08-03  0:29 ` me
  2022-08-03  0:58   ` Qu Wenruo
  2022-08-03  8:48 ` Goffredo Baroncelli
  15 siblings, 1 reply; 22+ messages in thread
From: me @ 2022-08-03  0:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hey there,
It would be nice if a bitmap could be used for other purposes too, not
just for RAID5/6. To not only improve resilver times without scrubbing
the entire array, but also for just syncing up NOCOW files. Corruption
sucks, and using NOCOW the user should assume you have no protection
from csums. However, the problem with NOCOW right now is on any
redundant profile, it seems once one copy goes out of sync with
another, unless the user balances the affected chunks or cp
--reflink=never the file, you can end up in a case where one copy may
always be out of sync with the other. Then, the application reading
the file can get different results depending on which disk it reads
from.

That seems worse than the corruption itself that can't be repaired due
to no CSUMs, and is by far worse than MD.

This is especially worrisome since a number of distros and userspace
tools and distros these days are defaulting to using the NOCOW
attribute for common applications. ie. Arch and mySQL, or libvirt.

Is there any way this could be used for these purposes as well, in
addition to the stated purpose for RAID5/6?

On Mon, Jul 25, 2022 at 2:38 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [CHANGELOG]
> v2->v1:
> - Add mount time recovery functionality
>   Now if a dirty bitmap is found, we will do proper recovery at mount
>   time.
>
>   The code is using scrub routine to do the proper recovery for both
>   data and P/Q parity.
>
>   Currently we can only test this by manually setting up a dirty bitmap,
>   and corrupt the full stripe, then mounting it and verify the full
>   stripe using "btrfs check --check-data-csum"
>
> - Skip full stripe writes
>   Full stripe writes are either:
>   * New writes into unallocated space
>     After powerloss, we won't read any data from the full stripe.
>
>   * Writes into NODATACOW ranges
>     We won't have csum for them anyway, thus new way to do any recovery.
>
> - Fix a memory leakage caused by RO mount
>   Previously we only cleanup the write-intent ctrl if it's RW mount,
>   thus for RO mount we will cause memory leak.
>
>
> RFC->v1:
> - Fix a corner case in write_intent_set_bits()
>   If the range covers the last existing entry, but still needs a new
>   entry, the old code will not insert the new entry, causing
>   write_intent_clear_bits() to cause a warning.
>
> - Add selftests for the write intent bitmaps
>   The write intent bitmaps is an sparse array of bitmaps.
>   There are some corner cases tricky to get it done correctly in the
>   first try (see above case).
>   The test case would prevent such problems from happening again.
>
> - Fix hang with dev-replace, and better bitmaps bio submission
>   Previously we will hold device_list_mutex while submitting the bitmaps
>   bio, this can lead to deadlock with dev-replace/dev-removal.
>   Fix it by using RCU to keep an local copy of devices and use them
>   to submit the bitmaps bio.
>
>   Furthermore, there is no need to follow the way of superblocks
>   writeback, as the content of bitmaps are always the same for all
>   devices, we can just submitting the same page and use atomic counter
>   to wait for them to finish.
>
>   Now there is no more crash/warning/deadlock in btrfs/070.
>
> [BACKGROUND]
> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
> loss happens.
>
> For pure mirror based profiles it's fine as btrfs can utilize its csums
> to find the correct mirror the repair the bad ones.
>
> But for RAID56, the repair itself needs the data from other devices,
> thus any out-of-sync data can degrade the tolerance.
>
> Even worse, incorrect RMW can use the stale data to generate P/Q,
> removing the possibility of recovery the data.
>
>
> For md-raid, it goes with write-intent bitmap, to do faster resilver,
> and goes journal (partial parity log for RAID5) to ensure it can even
> stand a powerloss + device lose.
>
> [OBJECTIVE]
>
> This patchset will introduce a btrfs specific write-intent bitmap.
>
> The bitmap will locate at physical offset 1MiB of each device, and the
> content is the same between all devices.
>
> When there is a RAID56 write (currently all RAID56 write, including full
> stripe write), before submitting all the real bios to disks,
> write-intent bitmap will be updated and flushed to all writeable
> devices.
>
> So even if a powerloss happened, at the next mount time we know which
> full stripes needs to check, and can start a scrub for those involved
> logical bytenr ranges.
>
> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>
> Since btrfs can utilize csum for its metadata and CoWed data, unlike
> dm-bitmap which can only be used for faster re-silver, we can fully
> rebuild the full stripe, as long as:
>
> 1) There is no missing device
>    For missing device case, we still need to go full journal.
>
> 2) Untouched data stays untouched
>    This should be mostly sane for sane hardware.
>
> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
> in size), the overhead much lower than full journal.
>
> In the future, we may allow users to choose between just bitmaps or full
> journal to meet their requirement.
>
> [BITMAPS DESIGN]
>
> The bitmaps on-disk format looks like this:
>
>  [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>  |<---------  super::size (4K) ------------->|
>
> Super block contains how many entires are in use.
>
> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> bytenr, and u64 for one bitmap.
>
> And all utilized entries will be sorted in their bytenr order, and no
> bit can overlap.
>
> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>
> For the worst case, it can contain 14MiB dirty ranges.
> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>
> For the best case, it can contain 896MiB dirty ranges.
> (all bits set per bitmap)
>
> [WHY NOT BTRFS BTREE]
>
> Current write-intent structure needs two features:
>
> - Its data needs to survive cross stripe boundary
>   Normally this means write-intent btree needs to acts like a proper
>   tree root, has METADATA_ITEMs for all its tree blocks.
>
> - Its data update must be outside of a transaction
>   Currently only log tree can do such thing.
>   But unfortunately log tree can not survive across transaction
>   boundary.
>
> Thus write-intent btree can only meet one of the requirement, not a
> suitable solution here.
>
> [TESTING AND BENCHMARK]
>
> For performance benchmark, unfortunately I don't have 3 HDDs to test.
> Will do the benchmark after secured enough hardware.
>
> For testing, it can survive volume/raid/dev-replace test groups, and no
> write-intent bitmap leakage.
>
> Unfortunately there is still a warning triggered in btrfs/070, still
> under investigation, hopefully to be a false alert in bitmap clearing
> path.
>
> [TODO]
> - Extra optimizations
>   * Enlarge the window between btrfs_write_intent_mark_dirty() and
>     btrfs_write_intent_writeback()
>     So that we can merge more dirty bites and cause less bitmaps
>     writeback
>
> - Proper performance benchmark
>   Needs hardware/baremetal VMs, since I don't have any physical machine
>   large enough to contian 3 3.5" HDDs.
>
>
> Qu Wenruo (14):
>   btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>   btrfs: introduce a new experimental compat RO flag,
>     WRITE_INTENT_BITMAP
>   btrfs: introduce the on-disk format of btrfs write intent bitmaps
>   btrfs: load/create write-intent bitmaps at mount time
>   btrfs: write-intent: write the newly created bitmaps to all disks
>   btrfs: write-intent: introduce an internal helper to set bits for a
>     range.
>   btrfs: write-intent: introduce an internal helper to clear bits for a
>     range.
>   btrfs: selftests: add selftests for write-intent bitmaps
>   btrfs: write back write intent bitmap after barrier_all_devices()
>   btrfs: update and writeback the write-intent bitmap for RAID56 write.
>   btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>   btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>   btrfs: avoid recording full stripe write into write-intent bitmaps
>   btrfs: scrub the full stripe which had sub-stripe write at mount time
>
>  fs/btrfs/Makefile                           |   5 +-
>  fs/btrfs/ctree.h                            |  26 +-
>  fs/btrfs/disk-io.c                          |  58 +-
>  fs/btrfs/raid56.c                           |  27 +
>  fs/btrfs/scrub.c                            | 177 +++-
>  fs/btrfs/sysfs.c                            |   2 +
>  fs/btrfs/tests/btrfs-tests.c                |   4 +
>  fs/btrfs/tests/btrfs-tests.h                |   2 +
>  fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>  fs/btrfs/volumes.c                          |  34 +-
>  fs/btrfs/write-intent.c                     | 923 ++++++++++++++++++++
>  fs/btrfs/write-intent.h                     | 303 +++++++
>  fs/btrfs/zoned.c                            |   8 +
>  include/uapi/linux/btrfs.h                  |  17 +
>  14 files changed, 1812 insertions(+), 21 deletions(-)
>  create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>  create mode 100644 fs/btrfs/write-intent.c
>  create mode 100644 fs/btrfs/write-intent.h
>
> --
> 2.37.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
  2022-08-03  0:29 ` [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 me
@ 2022-08-03  0:58   ` Qu Wenruo
  2022-08-03  9:11     ` Goffredo Baroncelli
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2022-08-03  0:58 UTC (permalink / raw)
  To: me, Qu Wenruo; +Cc: linux-btrfs



On 2022/8/3 08:29, me@jse.io wrote:
> Hey there,
> It would be nice if a bitmap could be used for other purposes too, not
> just for RAID5/6. To not only improve resilver times without scrubbing
> the entire array,

That's already in the todo list.

> but also for just syncing up NOCOW files.

But I'm not sure if it can help for NOCOW files.

> Corruption
> sucks, and using NOCOW the user should assume you have no protection
> from csums. However, the problem with NOCOW right now is on any
> redundant profile, it seems once one copy goes out of sync with
> another, unless the user balances the affected chunks or cp
> --reflink=never the file, you can end up in a case where one copy may
> always be out of sync with the other. Then, the application reading
> the file can get different results depending on which disk it reads
> from.
>
> That seems worse than the corruption itself that can't be repaired due
> to no CSUMs, and is by far worse than MD.

The problem for NOCOW file is, it also implies NOCSUM, and even we can
detect mismatches in copies, it only really helps limited profiles like
RAID1C3, RAID1C4, and maybe RAID56.

The problem here is, for 2 mirrors profiles, even if we can detect the
mismatch, we have no idea which is correct.

We rely on csum to determine which copy is correct, but NOCOW/NOCSUM
makes it very hard to do.

For RAID1C3/C4 we can go democracy, but for RAID1C4 it's also possible
that we got two different content from all the mirrors.

If we really want to solve the NOCOW/NOCSUM problem, I guess full
journal is the only solution, and I'm pretty sure we will go that
direction anyway (after the write-intent part get merged, as the full
journal still relies on a lot of things from write-intent code).

>
> This is especially worrisome since a number of distros and userspace
> tools and distros these days are defaulting to using the NOCOW
> attribute for common applications. ie. Arch and mySQL, or libvirt.
>
> Is there any way this could be used for these purposes as well, in
> addition to the stated purpose for RAID5/6?

For now, the plan for future development on write-intent only includes:

- For degraded mount
- As basis for later full journal implement.

For your concern, it's in fact not related to write-intent, but in scrub
code.

In that part, I have some plan to rework the scrub interface completely,
and with the rework, it will have the ability to detect content
difference between mirrors, even without csum.

But that would take a longer time, and the main purpose is to improve
the RAID56 scrub perf, not really the problem you mentioned.

Thanks,
Qu

>
> On Mon, Jul 25, 2022 at 2:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [CHANGELOG]
>> v2->v1:
>> - Add mount time recovery functionality
>>    Now if a dirty bitmap is found, we will do proper recovery at mount
>>    time.
>>
>>    The code is using scrub routine to do the proper recovery for both
>>    data and P/Q parity.
>>
>>    Currently we can only test this by manually setting up a dirty bitmap,
>>    and corrupt the full stripe, then mounting it and verify the full
>>    stripe using "btrfs check --check-data-csum"
>>
>> - Skip full stripe writes
>>    Full stripe writes are either:
>>    * New writes into unallocated space
>>      After powerloss, we won't read any data from the full stripe.
>>
>>    * Writes into NODATACOW ranges
>>      We won't have csum for them anyway, thus new way to do any recovery.
>>
>> - Fix a memory leakage caused by RO mount
>>    Previously we only cleanup the write-intent ctrl if it's RW mount,
>>    thus for RO mount we will cause memory leak.
>>
>>
>> RFC->v1:
>> - Fix a corner case in write_intent_set_bits()
>>    If the range covers the last existing entry, but still needs a new
>>    entry, the old code will not insert the new entry, causing
>>    write_intent_clear_bits() to cause a warning.
>>
>> - Add selftests for the write intent bitmaps
>>    The write intent bitmaps is an sparse array of bitmaps.
>>    There are some corner cases tricky to get it done correctly in the
>>    first try (see above case).
>>    The test case would prevent such problems from happening again.
>>
>> - Fix hang with dev-replace, and better bitmaps bio submission
>>    Previously we will hold device_list_mutex while submitting the bitmaps
>>    bio, this can lead to deadlock with dev-replace/dev-removal.
>>    Fix it by using RCU to keep an local copy of devices and use them
>>    to submit the bitmaps bio.
>>
>>    Furthermore, there is no need to follow the way of superblocks
>>    writeback, as the content of bitmaps are always the same for all
>>    devices, we can just submitting the same page and use atomic counter
>>    to wait for them to finish.
>>
>>    Now there is no more crash/warning/deadlock in btrfs/070.
>>
>> [BACKGROUND]
>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
>> loss happens.
>>
>> For pure mirror based profiles it's fine as btrfs can utilize its csums
>> to find the correct mirror the repair the bad ones.
>>
>> But for RAID56, the repair itself needs the data from other devices,
>> thus any out-of-sync data can degrade the tolerance.
>>
>> Even worse, incorrect RMW can use the stale data to generate P/Q,
>> removing the possibility of recovery the data.
>>
>>
>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
>> and goes journal (partial parity log for RAID5) to ensure it can even
>> stand a powerloss + device lose.
>>
>> [OBJECTIVE]
>>
>> This patchset will introduce a btrfs specific write-intent bitmap.
>>
>> The bitmap will locate at physical offset 1MiB of each device, and the
>> content is the same between all devices.
>>
>> When there is a RAID56 write (currently all RAID56 write, including full
>> stripe write), before submitting all the real bios to disks,
>> write-intent bitmap will be updated and flushed to all writeable
>> devices.
>>
>> So even if a powerloss happened, at the next mount time we know which
>> full stripes needs to check, and can start a scrub for those involved
>> logical bytenr ranges.
>>
>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>>
>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
>> dm-bitmap which can only be used for faster re-silver, we can fully
>> rebuild the full stripe, as long as:
>>
>> 1) There is no missing device
>>     For missing device case, we still need to go full journal.
>>
>> 2) Untouched data stays untouched
>>     This should be mostly sane for sane hardware.
>>
>> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
>> in size), the overhead much lower than full journal.
>>
>> In the future, we may allow users to choose between just bitmaps or full
>> journal to meet their requirement.
>>
>> [BITMAPS DESIGN]
>>
>> The bitmaps on-disk format looks like this:
>>
>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>   |<---------  super::size (4K) ------------->|
>>
>> Super block contains how many entires are in use.
>>
>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>> bytenr, and u64 for one bitmap.
>>
>> And all utilized entries will be sorted in their bytenr order, and no
>> bit can overlap.
>>
>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>>
>> For the worst case, it can contain 14MiB dirty ranges.
>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>
>> For the best case, it can contain 896MiB dirty ranges.
>> (all bits set per bitmap)
>>
>> [WHY NOT BTRFS BTREE]
>>
>> Current write-intent structure needs two features:
>>
>> - Its data needs to survive cross stripe boundary
>>    Normally this means write-intent btree needs to acts like a proper
>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>
>> - Its data update must be outside of a transaction
>>    Currently only log tree can do such thing.
>>    But unfortunately log tree can not survive across transaction
>>    boundary.
>>
>> Thus write-intent btree can only meet one of the requirement, not a
>> suitable solution here.
>>
>> [TESTING AND BENCHMARK]
>>
>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>> Will do the benchmark after secured enough hardware.
>>
>> For testing, it can survive volume/raid/dev-replace test groups, and no
>> write-intent bitmap leakage.
>>
>> Unfortunately there is still a warning triggered in btrfs/070, still
>> under investigation, hopefully to be a false alert in bitmap clearing
>> path.
>>
>> [TODO]
>> - Extra optimizations
>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>      btrfs_write_intent_writeback()
>>      So that we can merge more dirty bites and cause less bitmaps
>>      writeback
>>
>> - Proper performance benchmark
>>    Needs hardware/baremetal VMs, since I don't have any physical machine
>>    large enough to contian 3 3.5" HDDs.
>>
>>
>> Qu Wenruo (14):
>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>    btrfs: introduce a new experimental compat RO flag,
>>      WRITE_INTENT_BITMAP
>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>    btrfs: load/create write-intent bitmaps at mount time
>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>      range.
>>    btrfs: write-intent: introduce an internal helper to clear bits for a
>>      range.
>>    btrfs: selftests: add selftests for write-intent bitmaps
>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>    btrfs: avoid recording full stripe write into write-intent bitmaps
>>    btrfs: scrub the full stripe which had sub-stripe write at mount time
>>
>>   fs/btrfs/Makefile                           |   5 +-
>>   fs/btrfs/ctree.h                            |  26 +-
>>   fs/btrfs/disk-io.c                          |  58 +-
>>   fs/btrfs/raid56.c                           |  27 +
>>   fs/btrfs/scrub.c                            | 177 +++-
>>   fs/btrfs/sysfs.c                            |   2 +
>>   fs/btrfs/tests/btrfs-tests.c                |   4 +
>>   fs/btrfs/tests/btrfs-tests.h                |   2 +
>>   fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>>   fs/btrfs/volumes.c                          |  34 +-
>>   fs/btrfs/write-intent.c                     | 923 ++++++++++++++++++++
>>   fs/btrfs/write-intent.h                     | 303 +++++++
>>   fs/btrfs/zoned.c                            |   8 +
>>   include/uapi/linux/btrfs.h                  |  17 +
>>   14 files changed, 1812 insertions(+), 21 deletions(-)
>>   create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>>   create mode 100644 fs/btrfs/write-intent.c
>>   create mode 100644 fs/btrfs/write-intent.h
>>
>> --
>> 2.37.0
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (14 preceding siblings ...)
  2022-08-03  0:29 ` [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 me
@ 2022-08-03  8:48 ` Goffredo Baroncelli
  2022-08-03  9:52   ` Qu Wenruo
  15 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-08-03  8:48 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi Qu,
one minor tip in the cover letter

On 25/07/2022 07.37, Qu Wenruo wrote:
> [CHANGELOG]
> v2->v1:

[...]

> 
> When there is a RAID56 write (currently all RAID56 write, including full
> stripe write), before submitting all the real bios to disks,
> write-intent bitmap will be updated and flushed to all writeable
> devices.

I think that the previous statement is not fully correct; now (with patch #13)
the fully stripe is not logged anymore in the intent bitmap.

[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
  2022-08-03  0:58   ` Qu Wenruo
@ 2022-08-03  9:11     ` Goffredo Baroncelli
  2022-08-03  9:39       ` Qu Wenruo
  2022-08-03 21:00       ` me
  0 siblings, 2 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-08-03  9:11 UTC (permalink / raw)
  To: Qu Wenruo, me, Qu Wenruo; +Cc: linux-btrfs

On 03/08/2022 02.58, Qu Wenruo wrote:
> 
> 
> On 2022/8/3 08:29, me@jse.io wrote:
>> Hey there,
[...]
>>
>> That seems worse than the corruption itself that can't be repaired due
>> to no CSUMs, and is by far worse than MD.
> 
> The problem for NOCOW file is, it also implies NOCSUM, and even we can
> detect mismatches in copies, it only really helps limited profiles like
> RAID1C3, RAID1C4, and maybe RAID56.
> 
> The problem here is, for 2 mirrors profiles, even if we can detect the
> mismatch, we have no idea which is correct.
> 
> We rely on csum to determine which copy is correct, but NOCOW/NOCSUM
> makes it very hard to do.
> 
> For RAID1C3/C4 we can go democracy, but for RAID1C4 it's also possible
> that we got two different content from all the mirrors.

The same is true for all the "even" redundancy. I.e. even for raid6
you can have two different matches: one with P and one with Q.

Anyway I think that the point of 'me' is to not have two different
data depending by the read path. To this, it is preferable to have always
the same data, even if it is wrong.

What does scrub when it works on a raid4c (or raid 5/6) with a nocow
data and each block is different ?

If this can "solve" the problem of 'me' we could discuss to refine
the logic to skip only under the conditions "full stripe" **and**
"cow data".

This is a problem that we need to discuss even with a
full journal (where still have the option to skip a full stripe).

BR
G.Baroncelli

> 
> If we really want to solve the NOCOW/NOCSUM problem, I guess full
> journal is the only solution, and I'm pretty sure we will go that
> direction anyway (after the write-intent part get merged, as the full
> journal still relies on a lot of things from write-intent code).
> 
>>
>> This is especially worrisome since a number of distros and userspace
>> tools and distros these days are defaulting to using the NOCOW
>> attribute for common applications. ie. Arch and mySQL, or libvirt.
>>
>> Is there any way this could be used for these purposes as well, in
>> addition to the stated purpose for RAID5/6?
> 
> For now, the plan for future development on write-intent only includes:
> 
> - For degraded mount
> - As basis for later full journal implement.
> 
> For your concern, it's in fact not related to write-intent, but in scrub
> code.
> 
> In that part, I have some plan to rework the scrub interface completely,
> and with the rework, it will have the ability to detect content
> difference between mirrors, even without csum.
> 
> But that would take a longer time, and the main purpose is to improve
> the RAID56 scrub perf, not really the problem you mentioned.
> 
> Thanks,
> Qu
> 
>>
>> On Mon, Jul 25, 2022 at 2:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>>
>>> [CHANGELOG]
>>> v2->v1:
>>> - Add mount time recovery functionality
>>>    Now if a dirty bitmap is found, we will do proper recovery at mount
>>>    time.
>>>
>>>    The code is using scrub routine to do the proper recovery for both
>>>    data and P/Q parity.
>>>
>>>    Currently we can only test this by manually setting up a dirty bitmap,
>>>    and corrupt the full stripe, then mounting it and verify the full
>>>    stripe using "btrfs check --check-data-csum"
>>>
>>> - Skip full stripe writes
>>>    Full stripe writes are either:
>>>    * New writes into unallocated space
>>>      After powerloss, we won't read any data from the full stripe.
>>>
>>>    * Writes into NODATACOW ranges
>>>      We won't have csum for them anyway, thus new way to do any recovery.
>>>
>>> - Fix a memory leakage caused by RO mount
>>>    Previously we only cleanup the write-intent ctrl if it's RW mount,
>>>    thus for RO mount we will cause memory leak.
>>>
>>>
>>> RFC->v1:
>>> - Fix a corner case in write_intent_set_bits()
>>>    If the range covers the last existing entry, but still needs a new
>>>    entry, the old code will not insert the new entry, causing
>>>    write_intent_clear_bits() to cause a warning.
>>>
>>> - Add selftests for the write intent bitmaps
>>>    The write intent bitmaps is an sparse array of bitmaps.
>>>    There are some corner cases tricky to get it done correctly in the
>>>    first try (see above case).
>>>    The test case would prevent such problems from happening again.
>>>
>>> - Fix hang with dev-replace, and better bitmaps bio submission
>>>    Previously we will hold device_list_mutex while submitting the bitmaps
>>>    bio, this can lead to deadlock with dev-replace/dev-removal.
>>>    Fix it by using RCU to keep an local copy of devices and use them
>>>    to submit the bitmaps bio.
>>>
>>>    Furthermore, there is no need to follow the way of superblocks
>>>    writeback, as the content of bitmaps are always the same for all
>>>    devices, we can just submitting the same page and use atomic counter
>>>    to wait for them to finish.
>>>
>>>    Now there is no more crash/warning/deadlock in btrfs/070.
>>>
>>> [BACKGROUND]
>>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
>>> loss happens.
>>>
>>> For pure mirror based profiles it's fine as btrfs can utilize its csums
>>> to find the correct mirror the repair the bad ones.
>>>
>>> But for RAID56, the repair itself needs the data from other devices,
>>> thus any out-of-sync data can degrade the tolerance.
>>>
>>> Even worse, incorrect RMW can use the stale data to generate P/Q,
>>> removing the possibility of recovery the data.
>>>
>>>
>>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
>>> and goes journal (partial parity log for RAID5) to ensure it can even
>>> stand a powerloss + device lose.
>>>
>>> [OBJECTIVE]
>>>
>>> This patchset will introduce a btrfs specific write-intent bitmap.
>>>
>>> The bitmap will locate at physical offset 1MiB of each device, and the
>>> content is the same between all devices.
>>>
>>> When there is a RAID56 write (currently all RAID56 write, including full
>>> stripe write), before submitting all the real bios to disks,
>>> write-intent bitmap will be updated and flushed to all writeable
>>> devices.
>>>
>>> So even if a powerloss happened, at the next mount time we know which
>>> full stripes needs to check, and can start a scrub for those involved
>>> logical bytenr ranges.
>>>
>>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>>>
>>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
>>> dm-bitmap which can only be used for faster re-silver, we can fully
>>> rebuild the full stripe, as long as:
>>>
>>> 1) There is no missing device
>>>     For missing device case, we still need to go full journal.
>>>
>>> 2) Untouched data stays untouched
>>>     This should be mostly sane for sane hardware.
>>>
>>> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
>>> in size), the overhead much lower than full journal.
>>>
>>> In the future, we may allow users to choose between just bitmaps or full
>>> journal to meet their requirement.
>>>
>>> [BITMAPS DESIGN]
>>>
>>> The bitmaps on-disk format looks like this:
>>>
>>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>>   |<---------  super::size (4K) ------------->|
>>>
>>> Super block contains how many entires are in use.
>>>
>>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>>> bytenr, and u64 for one bitmap.
>>>
>>> And all utilized entries will be sorted in their bytenr order, and no
>>> bit can overlap.
>>>
>>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>>>
>>> For the worst case, it can contain 14MiB dirty ranges.
>>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>>
>>> For the best case, it can contain 896MiB dirty ranges.
>>> (all bits set per bitmap)
>>>
>>> [WHY NOT BTRFS BTREE]
>>>
>>> Current write-intent structure needs two features:
>>>
>>> - Its data needs to survive cross stripe boundary
>>>    Normally this means write-intent btree needs to acts like a proper
>>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>>
>>> - Its data update must be outside of a transaction
>>>    Currently only log tree can do such thing.
>>>    But unfortunately log tree can not survive across transaction
>>>    boundary.
>>>
>>> Thus write-intent btree can only meet one of the requirement, not a
>>> suitable solution here.
>>>
>>> [TESTING AND BENCHMARK]
>>>
>>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>>> Will do the benchmark after secured enough hardware.
>>>
>>> For testing, it can survive volume/raid/dev-replace test groups, and no
>>> write-intent bitmap leakage.
>>>
>>> Unfortunately there is still a warning triggered in btrfs/070, still
>>> under investigation, hopefully to be a false alert in bitmap clearing
>>> path.
>>>
>>> [TODO]
>>> - Extra optimizations
>>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>>      btrfs_write_intent_writeback()
>>>      So that we can merge more dirty bites and cause less bitmaps
>>>      writeback
>>>
>>> - Proper performance benchmark
>>>    Needs hardware/baremetal VMs, since I don't have any physical machine
>>>    large enough to contian 3 3.5" HDDs.
>>>
>>>
>>> Qu Wenruo (14):
>>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>>    btrfs: introduce a new experimental compat RO flag,
>>>      WRITE_INTENT_BITMAP
>>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>>    btrfs: load/create write-intent bitmaps at mount time
>>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>>      range.
>>>    btrfs: write-intent: introduce an internal helper to clear bits for a
>>>      range.
>>>    btrfs: selftests: add selftests for write-intent bitmaps
>>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
>>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>>    btrfs: avoid recording full stripe write into write-intent bitmaps
>>>    btrfs: scrub the full stripe which had sub-stripe write at mount time
>>>
>>>   fs/btrfs/Makefile                           |   5 +-
>>>   fs/btrfs/ctree.h                            |  26 +-
>>>   fs/btrfs/disk-io.c                          |  58 +-
>>>   fs/btrfs/raid56.c                           |  27 +
>>>   fs/btrfs/scrub.c                            | 177 +++-
>>>   fs/btrfs/sysfs.c                            |   2 +
>>>   fs/btrfs/tests/btrfs-tests.c                |   4 +
>>>   fs/btrfs/tests/btrfs-tests.h                |   2 +
>>>   fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>>>   fs/btrfs/volumes.c                          |  34 +-
>>>   fs/btrfs/write-intent.c                     | 923 ++++++++++++++++++++
>>>   fs/btrfs/write-intent.h                     | 303 +++++++
>>>   fs/btrfs/zoned.c                            |   8 +
>>>   include/uapi/linux/btrfs.h                  |  17 +
>>>   14 files changed, 1812 insertions(+), 21 deletions(-)
>>>   create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>>>   create mode 100644 fs/btrfs/write-intent.c
>>>   create mode 100644 fs/btrfs/write-intent.h
>>>
>>> -- 
>>> 2.37.0
>>>

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
  2022-08-03  9:11     ` Goffredo Baroncelli
@ 2022-08-03  9:39       ` Qu Wenruo
  2022-08-03 21:00       ` me
  1 sibling, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-08-03  9:39 UTC (permalink / raw)
  To: kreijack, me, Qu Wenruo; +Cc: linux-btrfs



On 2022/8/3 17:11, Goffredo Baroncelli wrote:
> On 03/08/2022 02.58, Qu Wenruo wrote:
>>
>>
>> On 2022/8/3 08:29, me@jse.io wrote:
>>> Hey there,
> [...]
>>>
>>> That seems worse than the corruption itself that can't be repaired due
>>> to no CSUMs, and is by far worse than MD.
>>
>> The problem for NOCOW file is, it also implies NOCSUM, and even we can
>> detect mismatches in copies, it only really helps limited profiles like
>> RAID1C3, RAID1C4, and maybe RAID56.
>>
>> The problem here is, for 2 mirrors profiles, even if we can detect the
>> mismatch, we have no idea which is correct.
>>
>> We rely on csum to determine which copy is correct, but NOCOW/NOCSUM
>> makes it very hard to do.
>>
>> For RAID1C3/C4 we can go democracy, but for RAID1C4 it's also possible
>> that we got two different content from all the mirrors.
>
> The same is true for all the "even" redundancy. I.e. even for raid6
> you can have two different matches: one with P and one with Q.
>
> Anyway I think that the point of 'me' is to not have two different
> data depending by the read path. To this, it is preferable to have always
> the same data, even if it is wrong.
>
> What does scrub when it works on a raid4c (or raid 5/6) with a nocow
> data and each block is different ?

The current scrub only read the data, furthermore current scrub only
cares the device it's scrubbing, thus unless hit a read error or csum
mismatch, it won't even try the remaining copies.

Thus for NOCSUM data, as long as it can be read, scrub won't bother it.

Thanks,
Qu
>
> If this can "solve" the problem of 'me' we could discuss to refine
> the logic to skip only under the conditions "full stripe" **and**
> "cow data".
>
> This is a problem that we need to discuss even with a
> full journal (where still have the option to skip a full stripe).
>
> BR
> G.Baroncelli
>
>>
>> If we really want to solve the NOCOW/NOCSUM problem, I guess full
>> journal is the only solution, and I'm pretty sure we will go that
>> direction anyway (after the write-intent part get merged, as the full
>> journal still relies on a lot of things from write-intent code).
>>
>>>
>>> This is especially worrisome since a number of distros and userspace
>>> tools and distros these days are defaulting to using the NOCOW
>>> attribute for common applications. ie. Arch and mySQL, or libvirt.
>>>
>>> Is there any way this could be used for these purposes as well, in
>>> addition to the stated purpose for RAID5/6?
>>
>> For now, the plan for future development on write-intent only includes:
>>
>> - For degraded mount
>> - As basis for later full journal implement.
>>
>> For your concern, it's in fact not related to write-intent, but in scrub
>> code.
>>
>> In that part, I have some plan to rework the scrub interface completely,
>> and with the rework, it will have the ability to detect content
>> difference between mirrors, even without csum.
>>
>> But that would take a longer time, and the main purpose is to improve
>> the RAID56 scrub perf, not really the problem you mentioned.
>>
>> Thanks,
>> Qu
>>
>>>
>>> On Mon, Jul 25, 2022 at 2:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> [CHANGELOG]
>>>> v2->v1:
>>>> - Add mount time recovery functionality
>>>>    Now if a dirty bitmap is found, we will do proper recovery at mount
>>>>    time.
>>>>
>>>>    The code is using scrub routine to do the proper recovery for both
>>>>    data and P/Q parity.
>>>>
>>>>    Currently we can only test this by manually setting up a dirty
>>>> bitmap,
>>>>    and corrupt the full stripe, then mounting it and verify the full
>>>>    stripe using "btrfs check --check-data-csum"
>>>>
>>>> - Skip full stripe writes
>>>>    Full stripe writes are either:
>>>>    * New writes into unallocated space
>>>>      After powerloss, we won't read any data from the full stripe.
>>>>
>>>>    * Writes into NODATACOW ranges
>>>>      We won't have csum for them anyway, thus new way to do any
>>>> recovery.
>>>>
>>>> - Fix a memory leakage caused by RO mount
>>>>    Previously we only cleanup the write-intent ctrl if it's RW mount,
>>>>    thus for RO mount we will cause memory leak.
>>>>
>>>>
>>>> RFC->v1:
>>>> - Fix a corner case in write_intent_set_bits()
>>>>    If the range covers the last existing entry, but still needs a new
>>>>    entry, the old code will not insert the new entry, causing
>>>>    write_intent_clear_bits() to cause a warning.
>>>>
>>>> - Add selftests for the write intent bitmaps
>>>>    The write intent bitmaps is an sparse array of bitmaps.
>>>>    There are some corner cases tricky to get it done correctly in the
>>>>    first try (see above case).
>>>>    The test case would prevent such problems from happening again.
>>>>
>>>> - Fix hang with dev-replace, and better bitmaps bio submission
>>>>    Previously we will hold device_list_mutex while submitting the
>>>> bitmaps
>>>>    bio, this can lead to deadlock with dev-replace/dev-removal.
>>>>    Fix it by using RCU to keep an local copy of devices and use them
>>>>    to submit the bitmaps bio.
>>>>
>>>>    Furthermore, there is no need to follow the way of superblocks
>>>>    writeback, as the content of bitmaps are always the same for all
>>>>    devices, we can just submitting the same page and use atomic counter
>>>>    to wait for them to finish.
>>>>
>>>>    Now there is no more crash/warning/deadlock in btrfs/070.
>>>>
>>>> [BACKGROUND]
>>>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
>>>> loss happens.
>>>>
>>>> For pure mirror based profiles it's fine as btrfs can utilize its csums
>>>> to find the correct mirror the repair the bad ones.
>>>>
>>>> But for RAID56, the repair itself needs the data from other devices,
>>>> thus any out-of-sync data can degrade the tolerance.
>>>>
>>>> Even worse, incorrect RMW can use the stale data to generate P/Q,
>>>> removing the possibility of recovery the data.
>>>>
>>>>
>>>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
>>>> and goes journal (partial parity log for RAID5) to ensure it can even
>>>> stand a powerloss + device lose.
>>>>
>>>> [OBJECTIVE]
>>>>
>>>> This patchset will introduce a btrfs specific write-intent bitmap.
>>>>
>>>> The bitmap will locate at physical offset 1MiB of each device, and the
>>>> content is the same between all devices.
>>>>
>>>> When there is a RAID56 write (currently all RAID56 write, including
>>>> full
>>>> stripe write), before submitting all the real bios to disks,
>>>> write-intent bitmap will be updated and flushed to all writeable
>>>> devices.
>>>>
>>>> So even if a powerloss happened, at the next mount time we know which
>>>> full stripes needs to check, and can start a scrub for those involved
>>>> logical bytenr ranges.
>>>>
>>>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>>>>
>>>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
>>>> dm-bitmap which can only be used for faster re-silver, we can fully
>>>> rebuild the full stripe, as long as:
>>>>
>>>> 1) There is no missing device
>>>>     For missing device case, we still need to go full journal.
>>>>
>>>> 2) Untouched data stays untouched
>>>>     This should be mostly sane for sane hardware.
>>>>
>>>> And since the btrfs specific write-intent bitmaps are pretty small
>>>> (4KiB
>>>> in size), the overhead much lower than full journal.
>>>>
>>>> In the future, we may allow users to choose between just bitmaps or
>>>> full
>>>> journal to meet their requirement.
>>>>
>>>> [BITMAPS DESIGN]
>>>>
>>>> The bitmaps on-disk format looks like this:
>>>>
>>>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>>>   |<---------  super::size (4K) ------------->|
>>>>
>>>> Super block contains how many entires are in use.
>>>>
>>>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>>>> bytenr, and u64 for one bitmap.
>>>>
>>>> And all utilized entries will be sorted in their bytenr order, and no
>>>> bit can overlap.
>>>>
>>>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>>>> can contain at most 4MiB, and the whole bitmaps can contain 224
>>>> entries.
>>>>
>>>> For the worst case, it can contain 14MiB dirty ranges.
>>>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>>>
>>>> For the best case, it can contain 896MiB dirty ranges.
>>>> (all bits set per bitmap)
>>>>
>>>> [WHY NOT BTRFS BTREE]
>>>>
>>>> Current write-intent structure needs two features:
>>>>
>>>> - Its data needs to survive cross stripe boundary
>>>>    Normally this means write-intent btree needs to acts like a proper
>>>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>>>
>>>> - Its data update must be outside of a transaction
>>>>    Currently only log tree can do such thing.
>>>>    But unfortunately log tree can not survive across transaction
>>>>    boundary.
>>>>
>>>> Thus write-intent btree can only meet one of the requirement, not a
>>>> suitable solution here.
>>>>
>>>> [TESTING AND BENCHMARK]
>>>>
>>>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>>>> Will do the benchmark after secured enough hardware.
>>>>
>>>> For testing, it can survive volume/raid/dev-replace test groups, and no
>>>> write-intent bitmap leakage.
>>>>
>>>> Unfortunately there is still a warning triggered in btrfs/070, still
>>>> under investigation, hopefully to be a false alert in bitmap clearing
>>>> path.
>>>>
>>>> [TODO]
>>>> - Extra optimizations
>>>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>>>      btrfs_write_intent_writeback()
>>>>      So that we can merge more dirty bites and cause less bitmaps
>>>>      writeback
>>>>
>>>> - Proper performance benchmark
>>>>    Needs hardware/baremetal VMs, since I don't have any physical
>>>> machine
>>>>    large enough to contian 3 3.5" HDDs.
>>>>
>>>>
>>>> Qu Wenruo (14):
>>>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>>>    btrfs: introduce a new experimental compat RO flag,
>>>>      WRITE_INTENT_BITMAP
>>>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>>>    btrfs: load/create write-intent bitmaps at mount time
>>>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>>>      range.
>>>>    btrfs: write-intent: introduce an internal helper to clear bits
>>>> for a
>>>>      range.
>>>>    btrfs: selftests: add selftests for write-intent bitmaps
>>>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>>>    btrfs: update and writeback the write-intent bitmap for RAID56
>>>> write.
>>>>    btrfs: raid56: clear write-intent bimaps when a full stripe
>>>> finishes.
>>>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>>>    btrfs: avoid recording full stripe write into write-intent bitmaps
>>>>    btrfs: scrub the full stripe which had sub-stripe write at mount
>>>> time
>>>>
>>>>   fs/btrfs/Makefile                           |   5 +-
>>>>   fs/btrfs/ctree.h                            |  26 +-
>>>>   fs/btrfs/disk-io.c                          |  58 +-
>>>>   fs/btrfs/raid56.c                           |  27 +
>>>>   fs/btrfs/scrub.c                            | 177 +++-
>>>>   fs/btrfs/sysfs.c                            |   2 +
>>>>   fs/btrfs/tests/btrfs-tests.c                |   4 +
>>>>   fs/btrfs/tests/btrfs-tests.h                |   2 +
>>>>   fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>>>>   fs/btrfs/volumes.c                          |  34 +-
>>>>   fs/btrfs/write-intent.c                     | 923
>>>> ++++++++++++++++++++
>>>>   fs/btrfs/write-intent.h                     | 303 +++++++
>>>>   fs/btrfs/zoned.c                            |   8 +
>>>>   include/uapi/linux/btrfs.h                  |  17 +
>>>>   14 files changed, 1812 insertions(+), 21 deletions(-)
>>>>   create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>>>>   create mode 100644 fs/btrfs/write-intent.c
>>>>   create mode 100644 fs/btrfs/write-intent.h
>>>>
>>>> --
>>>> 2.37.0
>>>>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
  2022-08-03  8:48 ` Goffredo Baroncelli
@ 2022-08-03  9:52   ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2022-08-03  9:52 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs



On 2022/8/3 16:48, Goffredo Baroncelli wrote:
> Hi Qu,
> one minor tip in the cover letter
>
> On 25/07/2022 07.37, Qu Wenruo wrote:
>> [CHANGELOG]
>> v2->v1:
>
> [...]
>
>>
>> When there is a RAID56 write (currently all RAID56 write, including full
>> stripe write), before submitting all the real bios to disks,
>> write-intent bitmap will be updated and flushed to all writeable
>> devices.
>
> I think that the previous statement is not fully correct; now (with
> patch #13)
> the fully stripe is not logged anymore in the intent bitmap.

Oh right, forgot to update the cover letter.

Thanks for catching this out-of-date paragraph,
Qu
>
> [...]
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56
  2022-08-03  9:11     ` Goffredo Baroncelli
  2022-08-03  9:39       ` Qu Wenruo
@ 2022-08-03 21:00       ` me
  1 sibling, 0 replies; 22+ messages in thread
From: me @ 2022-08-03 21:00 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Aug 3, 2022 at 6:11 AM Goffredo Baroncelli <kreijack@libero.it> wrote:
>
> On 03/08/2022 02.58, Qu Wenruo wrote:
> >
> >
> > On 2022/8/3 08:29, me@jse.io wrote:
> >> Hey there,
> [...]
> >>
> >> That seems worse than the corruption itself that can't be repaired due
> >> to no CSUMs, and is by far worse than MD.
> >
> > The problem for NOCOW file is, it also implies NOCSUM, and even we can
> > detect mismatches in copies, it only really helps limited profiles like
> > RAID1C3, RAID1C4, and maybe RAID56.
> >
> > The problem here is, for 2 mirrors profiles, even if we can detect the
> > mismatch, we have no idea which is correct.
> >
> > We rely on csum to determine which copy is correct, but NOCOW/NOCSUM
> > makes it very hard to do.
> >
> > For RAID1C3/C4 we can go democracy, but for RAID1C4 it's also possible
> > that we got two different content from all the mirrors.
>
> The same is true for all the "even" redundancy. I.e. even for raid6
> you can have two different matches: one with P and one with Q.
>
> Anyway I think that the point of 'me' is to not have two different
> data depending by the read path. To this, it is preferable to have always
> the same data, even if it is wrong.

You're exactly right. As it stands now, even if the corruption is read
by the application and repaired/worked around, depending on how the
application works, now you may have corruption on your other copy
until you manually intervene. If the user isn't aware of this
situation, then they'll just assume that "btrfs is unstable" and not
really understand why their VM or database keeps corrupting at
'random'. Btrfs doesn't really provide a way to easily identify when
you run into this situation anyway. I'm coming from the standpoint of
a user who just goes with defaults, given libvirt and distros are
defaulting the attribute for certain applications.

It would be preferable to be able to sync it from the get go through
the use of bitmaps, knowing that Btrfs cannot determine which copy is
correct, but neither can MD. It just knows the range on a specific
disk is 'dirty'. All we should expect of Btrfs in the NOCOW case is it
should at least produce similar behavior to how MD would work with a
traditional overwriting filesystem on it. It would sync the range in
question thanks to its bitmap.

>
> What does scrub when it works on a raid4c (or raid 5/6) with a nocow
> data and each block is different ?
>
> If this can "solve" the problem of 'me' we could discuss to refine
> the logic to skip only under the conditions "full stripe" **and**
> "cow data".
>
> This is a problem that we need to discuss even with a
> full journal (where still have the option to skip a full stripe).
>
> BR
> G.Baroncelli
>
> >
> > If we really want to solve the NOCOW/NOCSUM problem, I guess full
> > journal is the only solution, and I'm pretty sure we will go that
> > direction anyway (after the write-intent part get merged, as the full
> > journal still relies on a lot of things from write-intent code).
> >
> >>
> >> This is especially worrisome since a number of distros and userspace
> >> tools and distros these days are defaulting to using the NOCOW
> >> attribute for common applications. ie. Arch and mySQL, or libvirt.
> >>
> >> Is there any way this could be used for these purposes as well, in
> >> addition to the stated purpose for RAID5/6?
> >
> > For now, the plan for future development on write-intent only includes:
> >
> > - For degraded mount
> > - As basis for later full journal implement.
> >
> > For your concern, it's in fact not related to write-intent, but in scrub
> > code.
> >
> > In that part, I have some plan to rework the scrub interface completely,
> > and with the rework, it will have the ability to detect content
> > difference between mirrors, even without csum.
> >
> > But that would take a longer time, and the main purpose is to improve
> > the RAID56 scrub perf, not really the problem you mentioned.
> >
> > Thanks,
> > Qu
> >
> >>
> >> On Mon, Jul 25, 2022 at 2:38 AM Qu Wenruo <wqu@suse.com> wrote:
> >>>
> >>> [CHANGELOG]
> >>> v2->v1:
> >>> - Add mount time recovery functionality
> >>>    Now if a dirty bitmap is found, we will do proper recovery at mount
> >>>    time.
> >>>
> >>>    The code is using scrub routine to do the proper recovery for both
> >>>    data and P/Q parity.
> >>>
> >>>    Currently we can only test this by manually setting up a dirty bitmap,
> >>>    and corrupt the full stripe, then mounting it and verify the full
> >>>    stripe using "btrfs check --check-data-csum"
> >>>
> >>> - Skip full stripe writes
> >>>    Full stripe writes are either:
> >>>    * New writes into unallocated space
> >>>      After powerloss, we won't read any data from the full stripe.
> >>>
> >>>    * Writes into NODATACOW ranges
> >>>      We won't have csum for them anyway, thus new way to do any recovery.
> >>>
> >>> - Fix a memory leakage caused by RO mount
> >>>    Previously we only cleanup the write-intent ctrl if it's RW mount,
> >>>    thus for RO mount we will cause memory leak.
> >>>
> >>>
> >>> RFC->v1:
> >>> - Fix a corner case in write_intent_set_bits()
> >>>    If the range covers the last existing entry, but still needs a new
> >>>    entry, the old code will not insert the new entry, causing
> >>>    write_intent_clear_bits() to cause a warning.
> >>>
> >>> - Add selftests for the write intent bitmaps
> >>>    The write intent bitmaps is an sparse array of bitmaps.
> >>>    There are some corner cases tricky to get it done correctly in the
> >>>    first try (see above case).
> >>>    The test case would prevent such problems from happening again.
> >>>
> >>> - Fix hang with dev-replace, and better bitmaps bio submission
> >>>    Previously we will hold device_list_mutex while submitting the bitmaps
> >>>    bio, this can lead to deadlock with dev-replace/dev-removal.
> >>>    Fix it by using RCU to keep an local copy of devices and use them
> >>>    to submit the bitmaps bio.
> >>>
> >>>    Furthermore, there is no need to follow the way of superblocks
> >>>    writeback, as the content of bitmaps are always the same for all
> >>>    devices, we can just submitting the same page and use atomic counter
> >>>    to wait for them to finish.
> >>>
> >>>    Now there is no more crash/warning/deadlock in btrfs/070.
> >>>
> >>> [BACKGROUND]
> >>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
> >>> loss happens.
> >>>
> >>> For pure mirror based profiles it's fine as btrfs can utilize its csums
> >>> to find the correct mirror the repair the bad ones.
> >>>
> >>> But for RAID56, the repair itself needs the data from other devices,
> >>> thus any out-of-sync data can degrade the tolerance.
> >>>
> >>> Even worse, incorrect RMW can use the stale data to generate P/Q,
> >>> removing the possibility of recovery the data.
> >>>
> >>>
> >>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
> >>> and goes journal (partial parity log for RAID5) to ensure it can even
> >>> stand a powerloss + device lose.
> >>>
> >>> [OBJECTIVE]
> >>>
> >>> This patchset will introduce a btrfs specific write-intent bitmap.
> >>>
> >>> The bitmap will locate at physical offset 1MiB of each device, and the
> >>> content is the same between all devices.
> >>>
> >>> When there is a RAID56 write (currently all RAID56 write, including full
> >>> stripe write), before submitting all the real bios to disks,
> >>> write-intent bitmap will be updated and flushed to all writeable
> >>> devices.
> >>>
> >>> So even if a powerloss happened, at the next mount time we know which
> >>> full stripes needs to check, and can start a scrub for those involved
> >>> logical bytenr ranges.
> >>>
> >>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
> >>>
> >>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
> >>> dm-bitmap which can only be used for faster re-silver, we can fully
> >>> rebuild the full stripe, as long as:
> >>>
> >>> 1) There is no missing device
> >>>     For missing device case, we still need to go full journal.
> >>>
> >>> 2) Untouched data stays untouched
> >>>     This should be mostly sane for sane hardware.
> >>>
> >>> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
> >>> in size), the overhead much lower than full journal.
> >>>
> >>> In the future, we may allow users to choose between just bitmaps or full
> >>> journal to meet their requirement.
> >>>
> >>> [BITMAPS DESIGN]
> >>>
> >>> The bitmaps on-disk format looks like this:
> >>>
> >>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
> >>>   |<---------  super::size (4K) ------------->|
> >>>
> >>> Super block contains how many entires are in use.
> >>>
> >>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> >>> bytenr, and u64 for one bitmap.
> >>>
> >>> And all utilized entries will be sorted in their bytenr order, and no
> >>> bit can overlap.
> >>>
> >>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> >>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
> >>>
> >>> For the worst case, it can contain 14MiB dirty ranges.
> >>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
> >>>
> >>> For the best case, it can contain 896MiB dirty ranges.
> >>> (all bits set per bitmap)
> >>>
> >>> [WHY NOT BTRFS BTREE]
> >>>
> >>> Current write-intent structure needs two features:
> >>>
> >>> - Its data needs to survive cross stripe boundary
> >>>    Normally this means write-intent btree needs to acts like a proper
> >>>    tree root, has METADATA_ITEMs for all its tree blocks.
> >>>
> >>> - Its data update must be outside of a transaction
> >>>    Currently only log tree can do such thing.
> >>>    But unfortunately log tree can not survive across transaction
> >>>    boundary.
> >>>
> >>> Thus write-intent btree can only meet one of the requirement, not a
> >>> suitable solution here.
> >>>
> >>> [TESTING AND BENCHMARK]
> >>>
> >>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
> >>> Will do the benchmark after secured enough hardware.
> >>>
> >>> For testing, it can survive volume/raid/dev-replace test groups, and no
> >>> write-intent bitmap leakage.
> >>>
> >>> Unfortunately there is still a warning triggered in btrfs/070, still
> >>> under investigation, hopefully to be a false alert in bitmap clearing
> >>> path.
> >>>
> >>> [TODO]
> >>> - Extra optimizations
> >>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
> >>>      btrfs_write_intent_writeback()
> >>>      So that we can merge more dirty bites and cause less bitmaps
> >>>      writeback
> >>>
> >>> - Proper performance benchmark
> >>>    Needs hardware/baremetal VMs, since I don't have any physical machine
> >>>    large enough to contian 3 3.5" HDDs.
> >>>
> >>>
> >>> Qu Wenruo (14):
> >>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
> >>>    btrfs: introduce a new experimental compat RO flag,
> >>>      WRITE_INTENT_BITMAP
> >>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
> >>>    btrfs: load/create write-intent bitmaps at mount time
> >>>    btrfs: write-intent: write the newly created bitmaps to all disks
> >>>    btrfs: write-intent: introduce an internal helper to set bits for a
> >>>      range.
> >>>    btrfs: write-intent: introduce an internal helper to clear bits for a
> >>>      range.
> >>>    btrfs: selftests: add selftests for write-intent bitmaps
> >>>    btrfs: write back write intent bitmap after barrier_all_devices()
> >>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
> >>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
> >>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
> >>>    btrfs: avoid recording full stripe write into write-intent bitmaps
> >>>    btrfs: scrub the full stripe which had sub-stripe write at mount time
> >>>
> >>>   fs/btrfs/Makefile                           |   5 +-
> >>>   fs/btrfs/ctree.h                            |  26 +-
> >>>   fs/btrfs/disk-io.c                          |  58 +-
> >>>   fs/btrfs/raid56.c                           |  27 +
> >>>   fs/btrfs/scrub.c                            | 177 +++-
> >>>   fs/btrfs/sysfs.c                            |   2 +
> >>>   fs/btrfs/tests/btrfs-tests.c                |   4 +
> >>>   fs/btrfs/tests/btrfs-tests.h                |   2 +
> >>>   fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
> >>>   fs/btrfs/volumes.c                          |  34 +-
> >>>   fs/btrfs/write-intent.c                     | 923 ++++++++++++++++++++
> >>>   fs/btrfs/write-intent.h                     | 303 +++++++
> >>>   fs/btrfs/zoned.c                            |   8 +
> >>>   include/uapi/linux/btrfs.h                  |  17 +
> >>>   14 files changed, 1812 insertions(+), 21 deletions(-)
> >>>   create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
> >>>   create mode 100644 fs/btrfs/write-intent.c
> >>>   create mode 100644 fs/btrfs/write-intent.h
> >>>
> >>> --
> >>> 2.37.0
> >>>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-08-03 21:01 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-25  5:37 [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
2022-07-25  5:37 ` [PATCH 01/14] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
2022-07-25  5:37 ` [PATCH 02/14] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
2022-07-25  5:37 ` [PATCH 03/14] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
2022-07-25  5:37 ` [PATCH 04/14] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
2022-07-25  5:37 ` [PATCH 05/14] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
2022-07-25  5:37 ` [PATCH 06/14] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
2022-07-25  5:37 ` [PATCH 07/14] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
2022-07-25  5:37 ` [PATCH 08/14] btrfs: selftests: add selftests for write-intent bitmaps Qu Wenruo
2022-07-25  5:37 ` [PATCH 09/14] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
2022-07-25  5:37 ` [PATCH 10/14] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
2022-07-25  5:37 ` [PATCH 11/14] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
2022-07-25  5:38 ` [PATCH 12/14] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
2022-07-25  5:38 ` [PATCH 13/14] btrfs: avoid recording full stripe write into write-intent bitmaps Qu Wenruo
2022-07-25  5:38 ` [PATCH 14/14] btrfs: scrub the full stripe which had sub-stripe write at mount time Qu Wenruo
2022-08-03  0:29 ` [PATCH 00/14] btrfs: introduce write-intent bitmaps for RAID56 me
2022-08-03  0:58   ` Qu Wenruo
2022-08-03  9:11     ` Goffredo Baroncelli
2022-08-03  9:39       ` Qu Wenruo
2022-08-03 21:00       ` me
2022-08-03  8:48 ` Goffredo Baroncelli
2022-08-03  9:52   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.