linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
@ 2022-07-07  5:32 Qu Wenruo
  2022-07-07  5:32 ` [PATCH 01/12] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
                   ` (13 more replies)
  0 siblings, 14 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
RFC->v1:
- Fix a corner case in write_intent_set_bits()
  If the range covers the last existing entry, but still needs a new
  entry, the old code will not insert the new entry, causing
  write_intent_clear_bits() to cause a warning.

- Add selftests for the write intent bitmaps
  The write intent bitmaps is an sparse array of bitmaps.
  There are some corner cases tricky to get it done correctly in the
  first try (see above case).
  The test case would prevent such problems from happening again.

- Fix hang with dev-replace, and better bitmaps bio submission
  Previously we will hold device_list_mutex while submitting the bitmaps
  bio, this can lead to deadlock with dev-replace/dev-removal.
  Fix it by using RCU to keep an local copy of devices and use them
  to submit the bitmaps bio.

  Furthermore, there is no need to follow the way of superblocks
  writeback, as the content of bitmaps are always the same for all
  devices, we can just submitting the same page and use atomic counter
  to wait for them to finish.

  Now there is no more crash/warning/deadlock in btrfs/070.

[BACKGROUND]
Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
loss happens.

For pure mirror based profiles it's fine as btrfs can utilize its csums
to find the correct mirror the repair the bad ones.

But for RAID56, the repair itself needs the data from other devices,
thus any out-of-sync data can degrade the tolerance.

Even worse, incorrect RMW can use the stale data to generate P/Q,
removing the possibility of recovery the data.


For md-raid, it goes with write-intent bitmap, to do faster resilver,
and goes journal (partial parity log for RAID5) to ensure it can even
stand a powerloss + device lose.

[OBJECTIVE]

This patchset will introduce a btrfs specific write-intent bitmap.

The bitmap will locate at physical offset 1MiB of each device, and the
content is the same between all devices.

When there is a RAID56 write (currently all RAID56 write, including full
stripe write), before submitting all the real bios to disks,
write-intent bitmap will be updated and flushed to all writeable
devices.

So even if a powerloss happened, at the next mount time we know which
full stripes needs to check, and can start a scrub for those involved
logical bytenr ranges.

[NO RECOVERY CODE YET]

Unfortunately, this patchset only implements the write-intent bitmap
code, the recovery part is still a place holder, as we need some scrub
refactor to make it only scrub a logical bytenr range.

[ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]

Since btrfs can utilize csum for its metadata and CoWed data, unlike
dm-bitmap which can only be used for faster re-silver, we can fully
rebuild the full stripe, as long as:

1) There is no missing device
   For missing device case, we still need to go full journal.

2) Untouched data stays untouched
   This should be mostly sane for sane hardware.

And since the btrfs specific write-intent bitmaps are pretty small (4KiB
in size), the overhead much lower than full journal.

In the future, we may allow users to choose between just bitmaps or full
journal to meet their requirement.

[BITMAPS DESIGN]

The bitmaps on-disk format looks like this:

 [ super ][ entry 1 ][ entry 2 ] ... [entry N]
 |<---------  super::size (4K) ------------->|

Super block contains how many entires are in use.

Each entry is 128 bits (16 bytes) in size, containing one u64 for
bytenr, and u64 for one bitmap.

And all utilized entries will be sorted in their bytenr order, and no
bit can overlap.

The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
can contain at most 4MiB, and the whole bitmaps can contain 224 entries.

For the worst case, it can contain 14MiB dirty ranges.
(1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).

For the best case, it can contain 896MiB dirty ranges.
(all bits set per bitmap)

[WHY NOT BTRFS BTREE]

Current write-intent structure needs two features:

- Its data needs to survive cross stripe boundary
  Normally this means write-intent btree needs to acts like a proper
  tree root, has METADATA_ITEMs for all its tree blocks.

- Its data update must be outside of a transaction
  Currently only log tree can do such thing.
  But unfortunately log tree can not survive across transaction
  boundary.

Thus write-intent btree can only meet one of the requirement, not a
suitable solution here.

[TESTING AND BENCHMARK]

For performance benchmark, unfortunately I don't have 3 HDDs to test.
Will do the benchmark after secured enough hardware.

For testing, it can survive volume/raid/dev-replace test groups, and no
write-intent bitmap leakage.

Unfortunately there is still a warning triggered in btrfs/070, still
under investigation, hopefully to be a false alert in bitmap clearing
path.

[TODO]
- Scrub refactor to allow us to do proper recovery at mount time
  Need to change scrub interface to scrub based on logical bytenr.

  This can be a super big work, thus currently we will focus only on
  RAID56 new scrub interface for write-intent recovery only.

- Extra optimizations
  * Skip full stripe writes
  * Enlarge the window between btrfs_write_intent_mark_dirty() and
    btrfs_write_intent_writeback()
    So that we can merge more dirty bites and cause less bitmaps
    writeback

- Proper performance benchmark
  Needs hardware/baremetal VMs, since I don't have any physical machine
  large enough to contian 3 3.5" HDDs.


Qu Wenruo (12):
  btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
  btrfs: introduce a new experimental compat RO flag,
    WRITE_INTENT_BITMAP
  btrfs: introduce the on-disk format of btrfs write intent bitmaps
  btrfs: load/create write-intent bitmaps at mount time
  btrfs: write-intent: write the newly created bitmaps to all disks
  btrfs: write-intent: introduce an internal helper to set bits for a
    range.
  btrfs: write-intent: introduce an internal helper to clear bits for a
    range.
  btrfs: selftests: add selftests for write-intent bitmaps
  btrfs: write back write intent bitmap after barrier_all_devices()
  btrfs: update and writeback the write-intent bitmap for RAID56 write.
  btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
  btrfs: warn and clear bitmaps if there is dirty bitmap at mount time

 fs/btrfs/Makefile                           |   5 +-
 fs/btrfs/ctree.h                            |  24 +-
 fs/btrfs/disk-io.c                          |  54 ++
 fs/btrfs/raid56.c                           |  16 +
 fs/btrfs/sysfs.c                            |   2 +
 fs/btrfs/tests/btrfs-tests.c                |   4 +
 fs/btrfs/tests/btrfs-tests.h                |   2 +
 fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
 fs/btrfs/volumes.c                          |  34 +-
 fs/btrfs/write-intent.c                     | 903 ++++++++++++++++++++
 fs/btrfs/write-intent.h                     | 303 +++++++
 fs/btrfs/zoned.c                            |   8 +
 include/uapi/linux/btrfs.h                  |  17 +
 13 files changed, 1610 insertions(+), 9 deletions(-)
 create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
 create mode 100644 fs/btrfs/write-intent.c
 create mode 100644 fs/btrfs/write-intent.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/12] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 02/12] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

From the beginning of btrfs kernel module, kernel will avoid allocating
dev extents into the first 1MiB of each device, not only for the
superblock at 64KiB, but also for legacy bootloaders to store their
data.

Here for later expansion, we introduce a new compat RO flag,
EXTRA_SUPER_RESERVED, this allows btrfs to have extra reserved space
beyond the default 1MiB.

The extra reserved space can be utilized by things like write-intent
log.

Several new checks are introduced:

- No super_reserved_bytes should be smaller than the old 1MiB limit
- No zoned device support
  Such reserved space will be utilized in a no COW way, thus it can
  not be supported by zoned device.

We still allow dev extents to exist in the new reserved_bytes range,
this is to allow btrfstune to set reserved_bytes, then rely on kernel
balance to reserve the new space.

But later, if there is any feature relying on the reserved_bytes, we
will disable those features automatically.

There is a special catch, the new member @reserved_bytes is located at
the end of the superblock, to utilize the padding bytes.

The reason is, due to some undetermined extent tree v2 on-disk format
mess, we don't have reliable location inside the reserved8[] array.
I just want a stable location undisturbed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h           | 14 +++++++++++---
 fs/btrfs/disk-io.c         |  9 +++++++++
 fs/btrfs/sysfs.c           |  2 ++
 fs/btrfs/volumes.c         | 22 ++++++++++++++++++----
 fs/btrfs/zoned.c           |  8 ++++++++
 include/uapi/linux/btrfs.h | 10 ++++++++++
 6 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4e2569f84aab..12019904f1cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -293,14 +293,19 @@ struct btrfs_super_block {
 	__le64 block_group_root_generation;
 	u8 block_group_root_level;
 
-	/* future expansion */
 	u8 reserved8[7];
 	__le64 reserved[25];
 	u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
 	struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 
+	/*
+	 * How many bytes are reserved at the beginning of a device.
+	 * Should be >= BTRFS_DEFAULT_RESERVED.
+	 */
+	__le32 reserved_bytes;
+
 	/* Padded to 4096 bytes */
-	u8 padding[565];
+	u8 padding[561];
 } __attribute__ ((__packed__));
 static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 
@@ -315,7 +320,8 @@ static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
 	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
-	 BTRFS_FEATURE_COMPAT_RO_VERITY)
+	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
+	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -2539,6 +2545,8 @@ BTRFS_SETGET_STACK_FUNCS(super_block_group_root_generation,
 			 block_group_root_generation, 64);
 BTRFS_SETGET_STACK_FUNCS(super_block_group_root_level, struct btrfs_super_block,
 			 block_group_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_reserved_bytes, struct btrfs_super_block,
+			 reserved_bytes, 32);
 
 int btrfs_super_csum_size(const struct btrfs_super_block *s);
 const char *btrfs_super_csum_name(u16 csum_type);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 70b388de4d66..1df2da2509ca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2822,6 +2822,15 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 		ret = -EINVAL;
 	}
 
+	if (btrfs_super_compat_ro_flags(sb) &
+	    BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED &&
+	    btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+		btrfs_err(fs_info,
+"EXTRA_SUPER_RESERVED feature enabled, but reserved space is smaller than default (%u)",
+			    BTRFS_DEVICE_RANGE_RESERVED);
+		ret = -EINVAL;
+	}
+
 	/*
 	 * The generation is a global counter, we'll trust it more than the others
 	 * but it's still possible that it's the one that's wrong.
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index d5d0717fd09a..7fd38539315f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -287,6 +287,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_COMPAT_RO(extra_super_reserved, EXTRA_SUPER_RESERVED);
 #ifdef CONFIG_BLK_DEV_ZONED
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
@@ -317,6 +318,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
+	BTRFS_FEAT_ATTR_PTR(extra_super_reserved),
 #ifdef CONFIG_BLK_DEV_ZONED
 	BTRFS_FEAT_ATTR_PTR(zoned),
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2d788a351c1f..2a4ac905e39f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1401,8 +1401,13 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 
 static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 {
+	struct btrfs_fs_info *fs_info = device->fs_devices->fs_info;
+
 	switch (device->fs_devices->chunk_alloc_policy) {
 	case BTRFS_CHUNK_ALLOC_REGULAR:
+		if (btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED))
+			return max_t(u64, start,
+				btrfs_super_reserved_bytes(fs_info->super_copy));
 		return max_t(u64, start, BTRFS_DEVICE_RANGE_RESERVED);
 	case BTRFS_CHUNK_ALLOC_ZONED:
 		/*
@@ -7969,11 +7974,14 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	struct extent_map *em;
 	struct map_lookup *map;
 	struct btrfs_device *dev;
+	u32 super_reserved = BTRFS_DEVICE_RANGE_RESERVED;
 	u64 stripe_len;
 	bool found = false;
 	int ret = 0;
 	int i;
 
+	if (btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED))
+		super_reserved = btrfs_super_reserved_bytes(fs_info->super_copy);
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, chunk_offset, 1);
 	read_unlock(&em_tree->lock);
@@ -7998,11 +8006,17 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	}
 
 	/*
-	 * Very old mkfs.btrfs (before v4.1) will not respect the reserved
-	 * space. Although kernel can handle it without problem, better to warn
-	 * the users.
+	 * This can be caused by two cases:
+	 * - Very old mkfs.btrfs (before v4.1) and no balance at all
+	 *   This should be pretty rare now, as balance will relocate
+	 *   those dev extents in reserved range.
+	 *
+	 * - Newly set btrfs_super_block::reserved_bytes
+	 *   We are rely on this mount to relocate those dev extents.
+	 *
+	 * So here, we just give a warning and continue the mount.
 	 */
-	if (physical_offset < BTRFS_DEVICE_RANGE_RESERVED)
+	if (physical_offset < super_reserved)
 		btrfs_warn(fs_info,
 		"devid %llu physical %llu len %llu inside the reserved space",
 			   devid, physical_offset, physical_len);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 7a0f8fa44800..ec16c0a6fb22 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -703,6 +703,14 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	if (incompat_zoned && btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED)) {
+		btrfs_err(fs_info,
+			"zoned: incompatible optional feature detected: 0x%llx",
+			BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED);
+		ret = -EINVAL;
+		goto out;
+	}
+
 	/*
 	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
 	 * btrfs_create_chunk(). Since we want stripe_len == zone_size,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index f54dc91e4025..4a0c9f4f55d1 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -290,6 +290,16 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
 #define BTRFS_FEATURE_COMPAT_RO_VERITY			(1ULL << 2)
 
+/*
+ * Allow btrfs to have extra reserved space (other than the default 1MiB) at
+ * the beginning of each device.
+ *
+ * This feature will enable the usage of btrfs_super_block::reserved_bytes.
+ *
+ * This feature would only be available for non-zoned filesystems.
+ */
+#define BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED	(1ULL << 3)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/12] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
  2022-07-07  5:32 ` [PATCH 01/12] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 03/12] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

The new flag is for the incoming write intent bitmap, mostly to address
the RAID56 write-hole, by doing a mandatory scrub for partial written
stripes at mount time.

Currently the feature is still under development, this patch is mostly
a placeholder for the extra reserved bytes for write intent bitmap.

We will utilize the newly introduce EXTRA_SUPER_RESERVED compat RO flags
to enlarge the reserved bytes to at least (1MiB + 64KiB), and use that
64KiB (exact value is not yet fully determined) for write-intent bitmap.

Only one extra check is introduced, to ensure we have enough space to
place the write-intent bitmap at 1MiB physical offset.

This patch is only a place holder for the incoming on-disk format
change, no real write-intent functionality is implemented yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h           |  9 +++++++++
 fs/btrfs/disk-io.c         | 20 ++++++++++++++++++++
 fs/btrfs/volumes.c         | 14 +++++++++++++-
 include/uapi/linux/btrfs.h |  7 +++++++
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 12019904f1cf..908a735a66cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -317,11 +317,20 @@ static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
+#ifdef CONFIG_BTRFS_DEBUG
+#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
+	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
+	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
+	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED |	\
+	 BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP)
+#else
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
 	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
 	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
 	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)
+#endif
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1df2da2509ca..967c020c380a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2830,6 +2830,26 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 			    BTRFS_DEVICE_RANGE_RESERVED);
 		ret = -EINVAL;
 	}
+	if (btrfs_super_compat_ro_flags(sb) &
+	    BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP) {
+		/* Write intent bitmap requires extra reserve. */
+		if (!(btrfs_super_compat_ro_flags(sb) &
+		      BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)) {
+			btrfs_err(fs_info,
+"WRITE_INTENT_BITMAP feature enabled, but missing EXTRA_SUPER_RESERVED feature");
+			ret = -EINVAL;
+		}
+		/*
+		 * Write intent bitmap is always located at 1MiB.
+		 * Extra check like the length check against the reserved space
+		 * will happen at bitmap load time.
+		 */
+		if (btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+			btrfs_err(fs_info,
+			"not enough reserved space for write intent bitmap");
+			ret = -EINVAL;
+		}
+	}
 
 	/*
 	 * The generation is a global counter, we'll trust it more than the others
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2a4ac905e39f..4882c616768c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -8016,11 +8016,23 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	 *
 	 * So here, we just give a warning and continue the mount.
 	 */
-	if (physical_offset < super_reserved)
+	if (physical_offset < super_reserved) {
 		btrfs_warn(fs_info,
 		"devid %llu physical %llu len %llu inside the reserved space",
 			   devid, physical_offset, physical_len);
 
+		/* Disable any feature relying on the new reserved_bytes. */
+		if (btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP)) {
+			struct btrfs_super_block *sb = fs_info->super_copy;
+
+			btrfs_warn(fs_info,
+	"disabling write intent bitmap due to the lack of reserved space.");
+			btrfs_set_super_compat_ro_flags(sb,
+				btrfs_super_compat_ro_flags(sb) |
+				~BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP);
+		}
+	}
+
 	for (i = 0; i < map->num_stripes; i++) {
 		if (map->stripes[i].dev->devid == devid &&
 		    map->stripes[i].physical == physical_offset) {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 4a0c9f4f55d1..38c74a50323e 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -300,6 +300,13 @@ struct btrfs_ioctl_fs_info_args {
  */
 #define BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED	(1ULL << 3)
 
+/*
+ * Allow btrfs to have per-device write-intent bitmap.
+ * Will be utilized to close the RAID56 write-hole (by forced scrub for dirty
+ * partial written stripes at mount time).
+ */
+#define BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP	(1ULL << 4)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/12] btrfs: introduce the on-disk format of btrfs write intent bitmaps
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
  2022-07-07  5:32 ` [PATCH 01/12] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
  2022-07-07  5:32 ` [PATCH 02/12] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 04/12] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

With extra comments explaining the on-disk format and the basic
workflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c      |   4 +-
 fs/btrfs/raid56.c       |   1 +
 fs/btrfs/write-intent.h | 199 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 203 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/write-intent.h

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 967c020c380a..963a89cd4bfb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "space-info.h"
 #include "zoned.h"
 #include "subpage.h"
+#include "write-intent.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -2844,7 +2845,8 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 		 * Extra check like the length check against the reserved space
 		 * will happen at bitmap load time.
 		 */
-		if (btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+		if (btrfs_super_reserved_bytes(sb) <
+		    BTRFS_DEVICE_RANGE_RESERVED + WRITE_INTENT_BITMAPS_SIZE) {
 			btrfs_err(fs_info,
 			"not enough reserved space for write intent bitmap");
 			ret = -EINVAL;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c6411c849fea..7b2d2b6c8c61 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -19,6 +19,7 @@
 #include "volumes.h"
 #include "raid56.h"
 #include "async-thread.h"
+#include "write-intent.h"
 
 /* set when additional merges to this rbio are not allowed */
 #define RBIO_RMW_LOCKED_BIT	1
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
new file mode 100644
index 000000000000..b851917bb0b6
--- /dev/null
+++ b/fs/btrfs/write-intent.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Btrfs specific write-intent bitmaps. */
+
+#ifndef BTRFS_WRITE_INTENT_H
+#define BTRFS_WRITE_INTENT_H
+
+#include <linux/types.h>
+#include <linux/sizes.h>
+/* For BTRFS_STRIPE_LEN. */
+#include "volumes.h"
+
+#define WRITE_INTENT_SUPER_MAGIC 0x515F62536249775FULL /* ascii _wIbSb_Q */
+
+/* This write-intent bitmap records writes for RAID56 writes. */
+#define WRITE_INTENT_FLAGS_TARGET_RAID56	(1ULL << 0)
+
+/* This write-intent bitmap is internal, aka locates at 1MiB of each device. */
+#define WRITE_INTENT_FLAGS_INTERNAL		(1ULL << 1)
+
+/* This write-intent bitmap uses logical bytenr. */
+#define WRITE_INTENT_FLAGS_BYTENR_LOGICAL	(1ULL << 2)
+
+#define WRITE_INTENT_FLAGS_SUPPORTED			\
+	(WRITE_INTENT_FLAGS_TARGET_RAID56 |		\
+	 WRITE_INTENT_FLAGS_INTERNAL |			\
+	 WRITE_INTENT_FLAGS_BYTENR_LOGICAL)
+
+/*
+ * We use BTRFS_STRIPE_LEN as blocksize.
+ * This makes a RAID56 full stripe to @nr_data bits, and greatly
+ * enlarge how many bytes we can represent just using 4KiB.
+ */
+#define WRITE_INTENT_BLOCKSIZE			(BTRFS_STRIPE_LEN)
+
+/*
+ * For now, 4KiB is enough, as using 64KiB blocksize we can save bitmaps
+ * for 896MiB (224 entries, each entri can cache 64KiB * 64) sized logical
+ * range.
+ */
+#define WRITE_INTENT_BITMAPS_SIZE		(SZ_4K)
+
+
+/*
+ * The super block of write intent bitmaps, should be at physical offset 1MiB of
+ * every writeable device.
+ */
+struct write_intent_super {
+	/* Csum for the super block and all the internal entries. */
+	__u8 csum[BTRFS_CSUM_SIZE];
+	__u8 fsid[BTRFS_FSID_SIZE];
+
+	__le64 magic;
+
+	/* Important behavior flags would be set here. */
+	__le64 flags;
+
+	/*
+	 * Event counter for the bitmap, every time the bitmaps get written
+	 * this value increases.
+	 */
+	__le64 events;
+
+	/*
+	 * Total size of the bitmaps, including this super block and all the
+	 * entries.
+	 *
+	 * U32 should be enough for internal bitmaps, but just in case we want
+	 * to support external device as dedicated journal/bitmap device.
+	 */
+	__le64 size;
+
+	/* How many entries we have utilized. */
+	__le64 nr_entries;
+
+	/* How many bytes one bit represents. */
+	__le32 blocksize;
+	/*
+	 * This should be the same as btrfs_super_block::csum_type.
+	 * Cache csum type here so we read the write intent superblock without
+	 * a fully opened btrfs (for dump purpose).
+	 */
+	__le16 csum_type;
+
+	/* For future expansion, padding to 512 bytes. */
+	__u8 reserved1[418];
+} __attribute__ ((__packed__));
+
+static_assert(sizeof(struct write_intent_super) == 512);
+
+struct write_intent_entry {
+	/*
+	 * Bytenr 0 is special, means this entry is empty, and also means the
+	 * end of the bitmaps.
+	 */
+	__le64 bytenr;
+	__le64 bitmap;
+};
+
+/* How many entries we can have in the bitmaps. */
+#define WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES		\
+	((WRITE_INTENT_BITMAPS_SIZE -				\
+	  sizeof(struct write_intent_super)) /			\
+	 sizeof(struct write_intent_entry))
+
+/* The number of bits we can have in one entry. */
+#define WRITE_INTENT_BITS_PER_ENTRY		(64)
+
+/*
+ * ON-DISK FORMAT
+ * ==============
+ *
+ * [ super ][ entry 1 ][ entry 2 ] ... [entry N]
+ * |<------------  super::size --------------->|
+ *
+ * Normally it's 4KiB in size for internal bitmap.
+ *
+ * Currently the write-intent bitmaps is only for RAID56 writes, thus its
+ * blocksize is always 64KiB.
+ * Thus one entry can represent at most 4MiB (64 * 64 KiB) of logical range.
+ *
+ * When one raid56 full stripe needs partial writeback, the full stripe logical
+ * bytenr range will be included into at least one entry.
+ *
+ * After the last used entry, the remaining entries will all be filled with 0.
+ *
+ * WORKFLOW
+ * ========
+ *
+ * 1) Write bio arrive
+ *    Every write meets the requirement (so far, only RAID56 partial write) will
+ *    have its bio delayed, until corresponding range are marked in the entry.
+ *
+ * 2) Update the write-intent bitmaps
+ *    The entries will be populated, and write back to all writeable devices,
+ *    with FUA flag set.
+ *    Will wait until the write reaches disks.
+ *
+ * 3) Allow the involved write bios to be submitted
+ *
+ * 4) Write bios finish
+ *    The corresponding range will be recorded to be freed at next flush.
+ *
+ * 5) All devices get flushed (caused by btrfs super block writeback or bitmaps
+ *    pressure)
+ *    The recorded ranges will be cleared. And if an entry is empty, it will be
+ *    freed.
+ *    Then update the write-intent bitmaps with its superblock (writeback with FUA
+ *    flag and wait for it).
+ */
+
+#define WRITE_INTENT_SETGET_FUNCS(name, type, member, bits)	\
+static inline u##bits wi_##name(const type *s)			\
+{								\
+	return le##bits##_to_cpu(s->member);			\
+}								\
+static inline void wi_set_##name(type *s, u##bits val)		\
+{								\
+	s->member = cpu_to_le##bits(val);			\
+}
+
+WRITE_INTENT_SETGET_FUNCS(super_magic, struct write_intent_super, magic, 64);
+WRITE_INTENT_SETGET_FUNCS(super_flags, struct write_intent_super, flags, 64);
+WRITE_INTENT_SETGET_FUNCS(super_events, struct write_intent_super, events, 64);
+WRITE_INTENT_SETGET_FUNCS(super_size, struct write_intent_super, size, 64);
+WRITE_INTENT_SETGET_FUNCS(super_nr_entries, struct write_intent_super,
+			  nr_entries, 64);
+WRITE_INTENT_SETGET_FUNCS(super_blocksize, struct write_intent_super,
+			  blocksize, 32);
+WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
+			  csum_type, 16);
+WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
+
+static inline void wie_get_bitmap(struct write_intent_entry *entry,
+				  unsigned long *bitmap)
+{
+#ifdef __LITTLE_ENDIAN
+	bitmap_from_arr64(bitmap, &entry->bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+#else
+	u64 val = le64_to_cpu(entry->bitmap);
+
+	bitmap_from_arr64(bitmap, &val, WRITE_INTENT_BITS_PER_ENTRY);
+#endif
+}
+
+static inline void wie_set_bitmap(struct write_intent_entry *entry,
+				  unsigned long *bitmap)
+{
+#ifdef __LITTLE_ENDIAN
+	bitmap_to_arr64(&entry->bitmap, bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+#else
+	u64 val;
+
+	bitmap_to_arr64(&val, bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+	entry->bitmap = cpu_to_le64(val);
+#endif
+}
+
+#endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/12] btrfs: load/create write-intent bitmaps at mount time
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 03/12] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 05/12] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce btrfs_fs_info::wi_ctrl, which will have a
non-highmem page for the write intent bitmaps.

Please note that, if we can't find a valid bitmaps, the newly create one
will only be in memory for now, the bitmaps writeback functionality will
be introduced in the next commit.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile       |   2 +-
 fs/btrfs/ctree.h        |   1 +
 fs/btrfs/disk-io.c      |   7 ++
 fs/btrfs/write-intent.c | 174 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  15 ++++
 5 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/write-intent.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 99f9995670ea..af93119d52e2 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o
+	   subpage.o tree-mod-log.o write-intent.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 908a735a66cf..fcc8ae4b7fb4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -989,6 +989,7 @@ struct btrfs_fs_info {
 	struct workqueue_struct *scrub_wr_completion_workers;
 	struct workqueue_struct *scrub_parity_workers;
 	struct btrfs_subpage_info *subpage_info;
+	struct write_intent_ctrl *wi_ctrl;
 
 	struct btrfs_discard_ctl discard_ctl;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 963a89cd4bfb..edbb21706bda 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3699,6 +3699,12 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			  ret);
 		goto fail_block_groups;
 	}
+	ret = btrfs_write_intent_init(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init write-intent bitmaps: %d",
+			  ret);
+		goto fail_block_groups;
+	}
 	ret = btrfs_recover_balance(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to recover balance: %d", ret);
@@ -4639,6 +4645,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 		ret = btrfs_commit_super(fs_info);
 		if (ret)
 			btrfs_err(fs_info, "commit super ret %d", ret);
+		btrfs_write_intent_free(fs_info);
 	}
 
 	if (BTRFS_FS_ERROR(fs_info))
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
new file mode 100644
index 000000000000..a7ed21182525
--- /dev/null
+++ b/fs/btrfs/write-intent.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "ctree.h"
+#include "volumes.h"
+#include "write-intent.h"
+
+/*
+ * Return 0 if a valid write intent bitmap can be found.
+ * Return 1 if no valid write intent bitmap can be found.
+ * Return <0 for other fatal errors.
+ */
+static int write_intent_load(struct btrfs_device *device, struct page *dst)
+{
+	struct btrfs_fs_info *fs_info = device->fs_info;
+	struct write_intent_super *wis;
+	struct bio *bio;
+	int ret;
+
+	bio = bio_alloc(device->bdev, 1, REQ_OP_READ | REQ_SYNC | REQ_META,
+			GFP_NOFS);
+	/* It's backed by fs_bioset. */
+	ASSERT(bio);
+	bio->bi_iter.bi_sector = BTRFS_DEVICE_RANGE_RESERVED >> SECTOR_SHIFT;
+	__bio_add_page(bio, dst, WRITE_INTENT_BITMAPS_SIZE, 0);
+	ret = submit_bio_wait(bio);
+	if (ret < 0)
+		return ret;
+
+	wis = page_address(dst);
+	if (wi_super_magic(wis) != WRITE_INTENT_SUPER_MAGIC)
+		return 1;
+
+	/* Stale bitmaps, doesn't belong to our fs. */
+	if (memcmp(wis->fsid, device->fs_devices->fsid, BTRFS_FSID_SIZE))
+		return 1;
+
+	/* Above checks pass, but still csum mismatch, a big problem. */
+	if (btrfs_super_csum_type(fs_info->super_copy) !=
+	    wi_super_csum_type(wis)) {
+		btrfs_err(fs_info,
+		"csum type mismatch, write intent bitmap has %u fs has %u",
+			  wi_super_csum_type(wis),
+			  btrfs_super_csum_type(fs_info->super_copy));
+		return -EUCLEAN;
+	}
+
+	if (wi_super_flags(wis) & ~WRITE_INTENT_FLAGS_SUPPORTED) {
+		btrfs_err(fs_info, "unsupported flags 0x%llx",
+			  wi_super_flags(wis) & ~WRITE_INTENT_FLAGS_SUPPORTED);
+		return -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+static void write_intent_init(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+
+	ASSERT(ctrl);
+	ASSERT(ctrl->page);
+
+	/* Always start event count from 1. */
+	atomic64_set(&ctrl->event, 1);
+	ctrl->blocksize = WRITE_INTENT_BLOCKSIZE;
+	memzero_page(ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
+
+	wis = page_address(ctrl->page);
+	memcpy(wis->fsid, fs_info->fs_devices->fsid, BTRFS_FSID_SIZE);
+	wi_set_super_magic(wis, WRITE_INTENT_SUPER_MAGIC);
+	wi_set_super_csum_type(wis, btrfs_super_csum_type(fs_info->super_copy));
+	wi_set_super_events(wis, 1);
+	wi_set_super_flags(wis, WRITE_INTENT_FLAGS_SUPPORTED);
+	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
+	wi_set_super_blocksize(wis, ctrl->blocksize);
+	wi_set_super_nr_entries(wis, 0);
+	btrfs_info(fs_info, "creating new write intent bitmaps");
+}
+
+int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_device *highest_dev = NULL;
+	struct btrfs_device *dev;
+	struct write_intent_super *wis;
+	u64 highest_event = 0;
+	int ret;
+
+	ASSERT(fs_info->wi_ctrl == NULL);
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return 0;
+
+	fs_info->wi_ctrl = kzalloc(sizeof(*fs_info->wi_ctrl), GFP_NOFS);
+	if (!fs_info->wi_ctrl)
+		return -ENOMEM;
+
+	fs_info->wi_ctrl->page = alloc_page(GFP_NOFS);
+	if (!fs_info->wi_ctrl->page) {
+		ret = -ENOMEM;
+		goto cleanup;
+	}
+
+	/*
+	 * Go through every writeable device to find the highest event.
+	 *
+	 * Only the write-intent with highest event number makes sense.
+	 * If during bitmap writeback we lost power, some dev may have old
+	 * bitmap which is already stale.
+	 */
+	list_for_each_entry(dev, &fs_info->fs_devices->devices, dev_list) {
+		u64 cur_event;
+
+		if (!dev->bdev)
+			continue;
+
+		ret = write_intent_load(dev, fs_info->wi_ctrl->page);
+		if (ret > 0)
+			continue;
+		if (ret < 0) {
+			btrfs_err(fs_info,
+			"failed to load write intent from devid %llu: %d",
+				  dev->devid, ret);
+			goto cleanup;
+		}
+		wis = page_address(fs_info->wi_ctrl->page);
+		cur_event = wi_super_events(wis);
+		if (cur_event > highest_event) {
+			highest_dev = dev;
+			highest_event = cur_event;
+		}
+	}
+
+	/* Load the bitmap with lowest event as our bitmap. */
+	if (highest_dev) {
+		ret = write_intent_load(highest_dev, fs_info->wi_ctrl->page);
+		if (ret < 0) {
+			btrfs_err(fs_info,
+			"failed to load write intent from devid %llu: %d",
+				  dev->devid, ret);
+			goto cleanup;
+		}
+		wis = page_address(fs_info->wi_ctrl->page);
+		atomic64_set(&fs_info->wi_ctrl->event, wi_super_events(wis));
+		fs_info->wi_ctrl->blocksize = wi_super_blocksize(wis);
+		btrfs_info(fs_info,
+			"loaded write intent bitmaps, event count %llu",
+			atomic64_read(&fs_info->wi_ctrl->event));
+		return 0;
+	}
+
+	/* No valid bitmap found, create a new one. */
+	write_intent_init(fs_info);
+	return 0;
+cleanup:
+	if (fs_info->wi_ctrl) {
+		if (fs_info->wi_ctrl->page)
+			__free_page(fs_info->wi_ctrl->page);
+		kfree(fs_info->wi_ctrl);
+		fs_info->wi_ctrl = NULL;
+	}
+	return ret;
+}
+
+void btrfs_write_intent_free(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+
+	if (!ctrl)
+		return;
+	ASSERT(ctrl->page);
+	__free_page(ctrl->page);
+	kfree(ctrl);
+	fs_info->wi_ctrl = NULL;
+}
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index b851917bb0b6..2c5cd434e978 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -106,6 +106,18 @@ struct write_intent_entry {
 /* The number of bits we can have in one entry. */
 #define WRITE_INTENT_BITS_PER_ENTRY		(64)
 
+/* In-memory write-intent control structure. */
+struct write_intent_ctrl {
+	/* For the write_intent super and entries. */
+	struct page *page;
+
+	/* Cached event counter.*/
+	atomic64_t event;
+
+	/* Cached blocksize from write intent super. */
+	u32 blocksize;
+};
+
 /*
  * ON-DISK FORMAT
  * ==============
@@ -196,4 +208,7 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 #endif
 }
 
+int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
+void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/12] btrfs: write-intent: write the newly created bitmaps to all disks
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 04/12] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

This write back will happen even for RO mounts.

This will ensure we always have write-intent bitmaps for fses with that
compat RO flags set.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 154 ++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/write-intent.h |   6 ++
 2 files changed, 155 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index a7ed21182525..d1c5e8e206ba 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -1,8 +1,140 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <crypto/hash.h>
+#include <linux/bio.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "write-intent.h"
+#include "rcu-string.h"
+
+struct bitmap_writeback_contrl {
+	atomic_t pending_bios;
+	atomic_t errors;
+	wait_queue_head_t wait;
+};
+
+static void write_intent_end_write(struct bio *bio)
+{
+	struct bitmap_writeback_contrl *wb_ctrl = bio->bi_private;
+
+	if (bio->bi_status)
+		atomic_inc(&wb_ctrl->errors);
+	atomic_dec(&wb_ctrl->pending_bios);
+	wake_up(&wb_ctrl->wait);
+
+	bio_put(bio);
+}
+
+static int submit_one_device(struct btrfs_device *dev,
+			     struct bitmap_writeback_contrl *wb_ctrl)
+{
+	struct btrfs_fs_info *fs_info = dev->fs_info;
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct bio *bio;
+
+	if (!dev->bdev)
+		return -EIO;
+
+	if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &dev->dev_state) ||
+	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))
+		return 0;
+
+	atomic_inc(&wb_ctrl->pending_bios);
+	bio = bio_alloc(dev->bdev, 1, REQ_OP_WRITE | REQ_SYNC |
+			REQ_META | REQ_PRIO | REQ_FUA, GFP_NOFS);
+	bio->bi_iter.bi_sector = BTRFS_DEVICE_RANGE_RESERVED >> SECTOR_SHIFT;
+	bio->bi_private = wb_ctrl;
+	bio->bi_end_io = write_intent_end_write;
+	__bio_add_page(bio, ctrl->commit_page, WRITE_INTENT_BITMAPS_SIZE,
+			offset_in_page(BTRFS_DEVICE_RANGE_RESERVED));
+	submit_bio(bio);
+	return 0;
+}
+
+/* Write back the bitmaps page to all devices. */
+static int write_intent_writeback(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	struct btrfs_device *dev;
+	struct btrfs_device **found_devs;
+	struct bitmap_writeback_contrl wb_ctrl = {0};
+	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	const int nr_devs_max = fs_info->fs_devices->open_devices + 4;
+	int nr_devs = 0;
+	int total_errors = 0;
+	int ret;
+	int i;
+
+	ASSERT(ctrl);
+
+	found_devs = kcalloc(nr_devs_max, sizeof(struct btrfs_device *),
+			     GFP_NOFS);
+	if (!found_devs)
+		return -ENOMEM;
+
+	shash->tfm = fs_info->csum_shash;
+
+	spin_lock(&ctrl->lock);
+	wis = page_address(ctrl->page);
+
+	/*
+	 * Bump up the event counter each time this bitmap is going to be
+	 * written.
+	 */
+	wi_set_super_events(wis, wi_super_events(wis) + 1);
+	crypto_shash_digest(shash, (unsigned char *)wis + BTRFS_CSUM_SIZE,
+			    WRITE_INTENT_BITMAPS_SIZE - BTRFS_CSUM_SIZE,
+			    wis->csum);
+	atomic64_inc(&ctrl->event);
+	memcpy_page(ctrl->commit_page, 0, ctrl->page, 0,
+		    WRITE_INTENT_BITMAPS_SIZE);
+	spin_unlock(&ctrl->lock);
+
+	init_waitqueue_head(&wb_ctrl.wait);
+	atomic_set(&wb_ctrl.pending_bios, 0);
+	atomic_set(&wb_ctrl.errors, 0);
+
+	rcu_read_lock();
+	/*
+	 * Record all the writeable devices into found_devs[].
+	 *
+	 * We have to do this to keep a consistent view of writeable devices,
+	 * without holding device_list_mutex.
+	 * As dev-replace/dev-removal will all hold that mutex and wait for
+	 * submitted bios to finish.
+	 * If we try to hold device_list_mutex at bio submission path, we will
+	 * deadlock with above dev-replace/dev-removal
+	 *
+	 * So here we just grab a local list of devices, and since we're at
+	 * bio submission path, the device will never disapper before the bio
+	 * finished.
+	 */
+	list_for_each_entry_rcu(dev, &fs_info->fs_devices->devices, dev_list) {
+		found_devs[nr_devs] = dev;
+		nr_devs++;
+
+		if (unlikely(nr_devs >= nr_devs_max))
+			break;
+	}
+	rcu_read_unlock();
+
+	/* Go through all the recorded devices, and submit the commit_page. */
+	for (i = 0; i < nr_devs; i++) {
+		ret = submit_one_device(found_devs[i], &wb_ctrl);
+		if (ret < 0)
+			total_errors++;
+	}
+	wait_event(wb_ctrl.wait, atomic_read(&wb_ctrl.pending_bios) == 0);
+
+	if (total_errors + atomic_read(&wb_ctrl.errors) >
+	    btrfs_super_num_devices(fs_info->super_copy) - 1) {
+		btrfs_err(fs_info, "failed to writeback write-intent bitmaps");
+		ret = -EIO;
+	}
+	kfree(found_devs);
+	return ret;
+}
 
 /*
  * Return 0 if a valid write intent bitmap can be found.
@@ -53,10 +185,11 @@ static int write_intent_load(struct btrfs_device *device, struct page *dst)
 	return ret;
 }
 
-static void write_intent_init(struct btrfs_fs_info *fs_info)
+static int write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
 	struct write_intent_super *wis;
+	int ret;
 
 	ASSERT(ctrl);
 	ASSERT(ctrl->page);
@@ -75,7 +208,12 @@ static void write_intent_init(struct btrfs_fs_info *fs_info)
 	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
 	wi_set_super_blocksize(wis, ctrl->blocksize);
 	wi_set_super_nr_entries(wis, 0);
-	btrfs_info(fs_info, "creating new write intent bitmaps");
+
+	ret = write_intent_writeback(fs_info);
+	if (ret < 0)
+		return ret;
+	btrfs_info(fs_info, "new write intent bitmaps created");
+	return 0;
 }
 
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
@@ -95,11 +233,14 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 		return -ENOMEM;
 
 	fs_info->wi_ctrl->page = alloc_page(GFP_NOFS);
-	if (!fs_info->wi_ctrl->page) {
+	fs_info->wi_ctrl->commit_page = alloc_page(GFP_NOFS);
+	if (!fs_info->wi_ctrl->page || !fs_info->wi_ctrl->commit_page) {
 		ret = -ENOMEM;
 		goto cleanup;
 	}
 
+	spin_lock_init(&fs_info->wi_ctrl->lock);
+
 	/*
 	 * Go through every writeable device to find the highest event.
 	 *
@@ -149,12 +290,15 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 	}
 
 	/* No valid bitmap found, create a new one. */
-	write_intent_init(fs_info);
-	return 0;
+	ret = write_intent_init(fs_info);
+
+	return ret;
 cleanup:
 	if (fs_info->wi_ctrl) {
 		if (fs_info->wi_ctrl->page)
 			__free_page(fs_info->wi_ctrl->page);
+		if (fs_info->wi_ctrl->commit_page)
+			__free_page(fs_info->wi_ctrl->commit_page);
 		kfree(fs_info->wi_ctrl);
 		fs_info->wi_ctrl = NULL;
 	}
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 2c5cd434e978..797e57aef0e1 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -111,9 +111,15 @@ struct write_intent_ctrl {
 	/* For the write_intent super and entries. */
 	struct page *page;
 
+	/* A copy for writeback. */
+	struct page *commit_page;
+
 	/* Cached event counter.*/
 	atomic64_t event;
 
+	/* Lock for reading/writing above @page. */
+	spinlock_t lock;
+
 	/* Cached blocksize from write intent super. */
 	u32 blocksize;
 };
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 05/12] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-08  1:55   ` kernel test robot
  2022-07-08  7:23   ` kernel test robot
  2022-07-07  5:32 ` [PATCH 07/12] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 246 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  19 ++++
 2 files changed, 265 insertions(+)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index d1c5e8e206ba..eaf6d010462e 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -216,6 +216,252 @@ static int write_intent_init(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
+static struct write_intent_entry *write_intent_entry_nr(
+				struct write_intent_ctrl *ctrl, int nr)
+{
+
+	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+	return (page_address(ctrl->page) +
+		sizeof(struct write_intent_super) +
+		nr * sizeof(struct write_intent_entry));
+}
+
+/*
+ * Return <0 if the bytenr is before the given entry.
+ * Return 0 if the bytenr is inside the given entry.
+ * Return >0 if the bytenr is after the given entry.
+ */
+static int compare_bytenr_to_range(u64 bytenr, u64 start, u32 len)
+{
+	if (bytenr < start)
+		return -1;
+	if (start <= bytenr && bytenr < start + len)
+		return 0;
+	return 1;
+}
+
+/*
+ * Move all non-empty entries starting from @nr, to the right, and make room
+ * for @nr_new entries.
+ * Those new entries will be all zero filled.
+ *
+ * Caller should ensure we have enough room for @nr_new new entries.
+ */
+static void move_entries_right(struct write_intent_ctrl *ctrl, int nr,
+			       int nr_new)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	int move_size;
+
+	ASSERT(nr_new > 0);
+	ASSERT(wi_super_nr_entries(wis) + nr_new <=
+	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+
+	move_size = (wi_super_nr_entries(wis) - nr) *
+		     sizeof(struct write_intent_entry);
+
+	memmove(write_intent_entry_nr(ctrl, nr + nr_new),
+		write_intent_entry_nr(ctrl, nr), move_size);
+	memset(write_intent_entry_nr(ctrl, nr), 0,
+	       nr_new * sizeof(struct write_intent_entry));
+	wi_set_super_nr_entries(wis, wi_super_nr_entries(wis) + nr_new);
+}
+
+static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
+				  struct write_intent_entry *entry,
+				  u64 bytenr, u32 len)
+{
+	const u64 entry_start = wi_entry_bytenr(entry);
+	const u32 entry_len = write_intent_entry_size(ctrl);
+	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + entry_len);
+	bitmap_set(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
+		   len / ctrl->blocksize);
+	wie_set_bitmap(entry, bitmaps);
+}
+
+/*
+ * Insert new entries for the range [@bytenr, @bytenr + @len) at slot @nr
+ * and fill the new entries with proper bytenr and bitmaps.
+ */
+static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
+			       u64 bytenr, u32 len)
+{
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	u64 entry_start;
+	u64 new_start = round_down(bytenr, entry_size);
+	u64 new_end;
+	int nr_new_entries;
+	u64 cur;
+
+	if (nr >= wi_super_nr_entries(page_address(ctrl->page)) ||
+	    nr >= WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES)
+		entry_start = U64_MAX;
+	else
+		entry_start = wi_entry_bytenr(write_intent_entry_nr(ctrl, nr));
+
+	ASSERT(bytenr < entry_start);
+
+	new_end = min(entry_start, round_up(bytenr + len, entry_size));
+	nr_new_entries = (new_end - new_start) / entry_size;
+
+	if (nr_new_entries == 0)
+		return;
+
+	move_entries_right(ctrl, nr, nr_new_entries);
+
+	for (cur = new_start; cur < new_end; cur += entry_size, nr++) {
+		struct write_intent_entry *entry =
+			write_intent_entry_nr(ctrl, nr);
+		u64 range_start = max(cur, bytenr);
+		u64 range_len = min(cur + entry_size, bytenr + len) -
+				range_start;
+
+		/* Fill the bytenr into the new empty entries.*/
+		wi_set_entry_bytenr(entry, cur);
+
+		/* And set the bitmap. */
+		set_bits_in_one_entry(ctrl, entry, range_start, range_len);
+	}
+}
+
+/*
+ * This should be only called when we have enough room in the bitmaps, and hold
+ * the wi_ctrl->lock.
+ *
+ * This function is only exported for selftests, which doesn't need to hold any
+ * lock.
+ */
+void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	int i;
+	u64 nr_entries = wi_super_nr_entries(wis);
+	u64 cur_bytenr;
+
+	/*
+	 * Currently we only accept full stripe length, which should be
+	 * aligned to 64KiB.
+	 */
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	/*
+	 * We should have room to contain the worst case scenario, in which we
+	 * need to create one or more new entry.
+	 */
+	ASSERT(nr_entries + bytes_to_entries(bytenr, len, BTRFS_STRIPE_LEN) <=
+	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+
+	/*
+	 * Iterate through the existing entries to insert new entries or set
+	 * bits in the existing ones.
+	 */
+	for (i = 0, cur_bytenr = bytenr;
+	     i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; i++) {
+		struct write_intent_entry *entry = write_intent_entry_nr(ctrl, i);
+		u64 entry_start = wi_entry_bytenr(entry);
+		u64 entry_end = entry_start + entry_size;
+
+		/*
+		 *			|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *		|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *	|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * We need to insert one or more new entries for the range not
+		 * covered by the existing entry.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start,
+					    entry_size) < 0) {
+			u64 new_range_end;
+
+			new_range_end = min(entry_start, bytenr + len);
+			insert_new_entries(ctrl, i, cur_bytenr,
+					   new_range_end - cur_bytenr);
+
+			cur_bytenr = new_range_end;
+			continue;
+		}
+		/*
+		 * |<-- entry -->|
+		 *	|<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 * |<-------- entry ------->|
+		 *	|<- bytenr/len ->|
+		 *
+		 * In this case, we just set the bitmap in the current entry, and
+		 * advance @cur_bytenr to the end of the existing entry.
+		 * By this, we either go check the range against the next entry,
+		 * or we finish our current range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start,
+					    entry_size) == 0) {
+			u64 range_end = min(entry_end, bytenr + len);
+
+			set_bits_in_one_entry(ctrl, entry, cur_bytenr,
+					      range_end - cur_bytenr);
+			cur_bytenr = range_end;
+			continue;
+		}
+
+		/*
+		 * (A)
+		 * |<-- entry -->|			|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (B)
+		 * |<-- entry -->|		|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (C)
+		 * |<-- entry -->|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (D)
+		 * |<-- entry -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * For all above cases, we just skip to the next entry.
+		 *
+		 * For case (A) and (B), we will insert new entries before
+		 * the next one at the next loop.
+		 *
+		 * For case (C), we just do the regular set bits.
+		 * Thus case (A) ~ (C) are all handled properly.
+		 *
+		 * For case (D), we will handle it after the loop.
+		 */
+	}
+
+	/*
+	 * We still have some range not handled after the existing entries,
+	 * just insert new entries.
+	 */
+	if (cur_bytenr < bytenr + len)
+		insert_new_entries(ctrl, i, cur_bytenr,
+				   bytenr + len - cur_bytenr);
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 797e57aef0e1..707ccf73e13a 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -106,6 +106,15 @@ struct write_intent_entry {
 /* The number of bits we can have in one entry. */
 #define WRITE_INTENT_BITS_PER_ENTRY		(64)
 
+static inline u32 bytes_to_entries(u64 start, u32 length, u32 blocksize)
+{
+	u32 entry_len = blocksize * WRITE_INTENT_BITS_PER_ENTRY;
+	u64 entry_start = round_down(start, entry_len);
+	u64 entry_end = round_up(start + length, entry_len);
+
+	return DIV_ROUND_UP((u32)(entry_end - entry_start), entry_len);
+}
+
 /* In-memory write-intent control structure. */
 struct write_intent_ctrl {
 	/* For the write_intent super and entries. */
@@ -189,6 +198,13 @@ WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
 			  csum_type, 16);
 WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
 
+static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+
+	return wi_super_blocksize(wis) * WRITE_INTENT_BITS_PER_ENTRY;
+}
+
 static inline void wie_get_bitmap(struct write_intent_entry *entry,
 				  unsigned long *bitmap)
 {
@@ -214,6 +230,9 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 #endif
 }
 
+/* This function is only exported for selftests. */
+void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len);
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
 void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/12] btrfs: write-intent: introduce an internal helper to clear bits for a range.
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 08/12] btrfs: selftests: add selftests for write-intent bitmaps Qu Wenruo
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

This new helper. write_intent_clear_bits(), is much simpler than the set
bits counter part.

As if we can not find a entry for our target range, then it must be
something wrong, and we only need to warn and skip to next entry.

Although there has one extra thing to do, if we have emptied one entry,
we have to delete that empty entry.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 172 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |   4 +-
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index eaf6d010462e..8a69bf39a994 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -283,6 +283,36 @@ static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
 	wie_set_bitmap(entry, bitmaps);
 }
 
+static bool is_entry_empty(struct write_intent_ctrl *ctrl,
+			   struct write_intent_entry *entry)
+{
+	unsigned long bitmaps[2];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	return bitmap_empty(bitmaps, WRITE_INTENT_BITS_PER_ENTRY);
+}
+
+/*
+ * NOTE: This function may clear all bits of an entry, caller must check if
+ * that's the case, and delete the empty entry if needed.
+ */
+static void clear_bits_in_one_entry(struct write_intent_ctrl *ctrl,
+				    struct write_intent_entry *entry,
+				    u64 bytenr, u32 len)
+{
+	const u64 entry_start = wi_entry_bytenr(entry);
+	const u32 entry_len = write_intent_entry_size(ctrl);
+	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + entry_len);
+	bitmap_clear(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
+		   len / ctrl->blocksize);
+	wie_set_bitmap(entry, bitmaps);
+}
+
 /*
  * Insert new entries for the range [@bytenr, @bytenr + @len) at slot @nr
  * and fill the new entries with proper bytenr and bitmaps.
@@ -328,6 +358,25 @@ static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
 	}
 }
 
+static void delete_one_entry(struct write_intent_ctrl *ctrl, int nr)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	int cur_nr_entries = wi_super_nr_entries(wis);
+
+	ASSERT(is_entry_empty(ctrl, write_intent_entry_nr(ctrl, nr)));
+	ASSERT(nr < cur_nr_entries);
+
+	/* Move all the entries after slot @nr by one slot. */
+	memmove(write_intent_entry_nr(ctrl, nr),
+		write_intent_entry_nr(ctrl, nr + 1),
+		(cur_nr_entries - nr - 1) * sizeof(struct write_intent_entry));
+
+	/* Memzero the right most entry. */
+	memset(write_intent_entry_nr(ctrl, cur_nr_entries - 1), 0,
+	       sizeof(struct write_intent_entry));
+	wi_set_super_nr_entries(wis, cur_nr_entries - 1);
+}
+
 /*
  * This should be only called when we have enough room in the bitmaps, and hold
  * the wi_ctrl->lock.
@@ -462,6 +511,129 @@ void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len)
 				   bytenr + len - cur_bytenr);
 }
 
+/* This should be only called with wi_ctrl->lock hold, except for selftests. */
+void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
+			     u32 len)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	int i;
+	u64 cur_bytenr;
+
+	/*
+	 * Currently we only accept full stripe length, which should be
+	 * aligned to 64KiB.
+	 */
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	/*
+	 * Iterate through the existing entries to delete entries or clear
+	 * bits in the existing ones.
+	 */
+	for (i = 0, cur_bytenr = bytenr;
+	     i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; i++) {
+		struct write_intent_entry *entry = write_intent_entry_nr(ctrl, i);
+		u64 entry_start = wi_entry_bytenr(entry);
+		u64 entry_end = entry_start + entry_size;
+
+		/*
+		 *			|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *		|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *	|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * This case should not happen, it means we have some logged
+		 * dirty range, but it's no longer there.
+		 * Just warn and skip to the next covered range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start, entry_size) < 0) {
+			WARN_ON_ONCE(1);
+			cur_bytenr = min(bytenr + len, entry_start);
+			continue;
+		}
+
+		/*
+		 * |<-- entry -->|
+		 *	|<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 * |<-------- entry ------->|
+		 *	|<- bytenr/len ->|
+		 *
+		 * In this case, we just clear the bitmap in current entry, and
+		 * advance @cur_bytenr.
+		 * By this, we either go check the range against the next entry,
+		 * or we finish our current range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start, entry_size) == 0) {
+			u64 range_end = min(entry_end, bytenr + len);
+
+			clear_bits_in_one_entry(ctrl, entry, cur_bytenr,
+						range_end - cur_bytenr);
+			cur_bytenr = range_end;
+			/*
+			 * If the current entry is empty, we need to delete the
+			 * entry and decrease @nr.
+			 */
+			if (is_entry_empty(ctrl, entry)) {
+				delete_one_entry(ctrl, i);
+				i--;
+			}
+			continue;
+		}
+
+		/*
+		 * (A)
+		 * |<-- entry -->|			|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (B)
+		 * |<-- entry -->|		|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (C)
+		 * |<-- entry -->|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (D)
+		 * |<-- entry -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * For all above cases, we just skip to the next entry.
+		 *
+		 * For case (A) and (B), we will trigger wanring as we
+		 * don't have expected entries to clear.
+		 *
+		 * For case (C), we just do the regular clear bits.
+		 * Thus case (A) ~ (C) are all handled properly.
+		 *
+		 * For case (D), we will handle it after the loop.
+		 */
+	}
+	/*
+	 * There is some range not handled, and there is no more entries,
+	 * another unexpected case.
+	 * Just do warning.
+	 */
+	if (cur_bytenr < bytenr + len)
+		WARN_ON_ONCE(1);
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 707ccf73e13a..0da1b7421590 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -230,8 +230,10 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 #endif
 }
 
-/* This function is only exported for selftests. */
+/* These two functions are only exported for selftests. */
 void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr, u32 len);
+void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
+			     u32 len);
 
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
 void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/12] btrfs: selftests: add selftests for write-intent bitmaps
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 07/12] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 09/12] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

It turns out such sparse bitmap still has a lot of things to go wrong,
definitely needs some tests to cover all the different corner cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile                           |   3 +-
 fs/btrfs/tests/btrfs-tests.c                |   4 +
 fs/btrfs/tests/btrfs-tests.h                |   2 +
 fs/btrfs/tests/write-intent-bitmaps-tests.c | 245 ++++++++++++++++++++
 fs/btrfs/write-intent.c                     |  10 -
 fs/btrfs/write-intent.h                     |  12 +
 6 files changed, 265 insertions(+), 11 deletions(-)
 create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index af93119d52e2..a9658782d363 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -42,4 +42,5 @@ btrfs-$(CONFIG_FS_VERITY) += verity.o
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
 	tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
-	tests/free-space-tree-tests.o tests/extent-map-tests.o
+	tests/free-space-tree-tests.o tests/extent-map-tests.o \
+	tests/write-intent-bitmaps-tests.o
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index 1591bfa55bcc..27bb34acb156 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -27,6 +27,7 @@ const char *test_error[] = {
 	[TEST_ALLOC_INODE]	     = "cannot allocate inode",
 	[TEST_ALLOC_BLOCK_GROUP]     = "cannot allocate block group",
 	[TEST_ALLOC_EXTENT_MAP]      = "cannot allocate extent map",
+	[TEST_ALLOC_WRITE_INTENT_CTRL] = "cannot allocate write intent control",
 };
 
 static const struct super_operations btrfs_test_super_ops = {
@@ -279,6 +280,9 @@ int btrfs_run_sanity_tests(void)
 		}
 	}
 	ret = btrfs_test_extent_map();
+	if (ret)
+		goto out;
+	ret = btrfs_test_write_intent_bitmaps();
 
 out:
 	btrfs_destroy_test_fs();
diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
index 7a2d7ffbe30e..1845bfb6465d 100644
--- a/fs/btrfs/tests/btrfs-tests.h
+++ b/fs/btrfs/tests/btrfs-tests.h
@@ -23,6 +23,7 @@ enum {
 	TEST_ALLOC_INODE,
 	TEST_ALLOC_BLOCK_GROUP,
 	TEST_ALLOC_EXTENT_MAP,
+	TEST_ALLOC_WRITE_INTENT_CTRL,
 };
 
 extern const char *test_error[];
@@ -37,6 +38,7 @@ int btrfs_test_inodes(u32 sectorsize, u32 nodesize);
 int btrfs_test_qgroups(u32 sectorsize, u32 nodesize);
 int btrfs_test_free_space_tree(u32 sectorsize, u32 nodesize);
 int btrfs_test_extent_map(void);
+int btrfs_test_write_intent_bitmaps(void);
 struct inode *btrfs_new_test_inode(void);
 struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize);
 void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/tests/write-intent-bitmaps-tests.c b/fs/btrfs/tests/write-intent-bitmaps-tests.c
new file mode 100644
index 000000000000..c956acaea1c7
--- /dev/null
+++ b/fs/btrfs/tests/write-intent-bitmaps-tests.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../ctree.h"
+#include "../volumes.h"
+#include "../write-intent.h"
+#include "btrfs-tests.h"
+
+static struct write_intent_ctrl *alloc_dummy_ctrl(void)
+{
+	struct write_intent_ctrl *ctrl;
+	struct write_intent_super *wis;
+
+	ctrl = kzalloc(sizeof(*ctrl), GFP_NOFS);
+	if (!ctrl)
+		return NULL;
+	/*
+	 * For dummy tests, we only need the primary page, no need for the
+	 * commit page.
+	 */
+	ctrl->page = alloc_page(GFP_NOFS);
+	if (!ctrl->page) {
+		kfree(ctrl);
+		return NULL;
+	}
+	ctrl->blocksize = BTRFS_STRIPE_LEN;
+	atomic64_set(&ctrl->event, 1);
+	spin_lock_init(&ctrl->lock);
+	memzero_page(ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
+	wis = page_address(ctrl->page);
+	wi_set_super_magic(wis, WRITE_INTENT_SUPER_MAGIC);
+	wi_set_super_csum_type(wis, 0);
+	wi_set_super_events(wis, 1);
+	wi_set_super_flags(wis, WRITE_INTENT_FLAGS_SUPPORTED);
+	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
+	wi_set_super_blocksize(wis, ctrl->blocksize);
+	wi_set_super_nr_entries(wis, 0);
+	return ctrl;
+}
+
+static void zero_bitmaps(struct write_intent_ctrl *ctrl)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+
+	memzero_page(ctrl->page, sizeof(struct write_intent_super),
+			WRITE_INTENT_BITMAPS_SIZE -
+			sizeof(struct write_intent_super));
+	wi_set_super_nr_entries(wis, 0);
+}
+
+static int compare_bitmaps(struct write_intent_ctrl *ctrl,
+			   int nr_entries, u64 *bytenrs, u64 *bitmaps)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	struct write_intent_entry empty = {0};
+	int i;
+
+	if (wi_super_nr_entries(wis) != nr_entries) {
+		test_err("nr entries mismatch, has %llu expect %u",
+			wi_super_nr_entries(wis), nr_entries);
+		goto err;
+	}
+
+	for (i = 0; i < nr_entries; i++) {
+		struct write_intent_entry *entry =
+			write_intent_entry_nr(ctrl, i);
+
+		if (wi_entry_bytenr(entry) != bytenrs[i]) {
+			test_err("bytenr mismatch, has %llu expect %llu",
+				  wi_entry_bytenr(entry), bytenrs[i]);
+			goto err;
+		}
+		if (wi_entry_raw_bitmap(entry) != bitmaps[i]) {
+			test_err("bitmap mismatch, has 0x%016llx expect 0x%016llx",
+				  wi_entry_raw_bitmap(entry), bitmaps[i]);
+			goto err;
+		}
+	}
+
+	/* The unused entries should all be zero. */
+	for (i = nr_entries; i < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES;
+	     i++) {
+		if (memcmp(write_intent_entry_nr(ctrl, i), &empty,
+			   sizeof(empty))) {
+			test_err(
+			"unused entry is not empty, entry %u nr_entries %u",
+				 i, nr_entries);
+			goto err;
+		}
+	}
+	return 0;
+err:
+	/* Dump the bitmaps for better debugging. */
+	test_err("dumping bitmaps, nr_entries=%llu:", wi_super_nr_entries(wis));
+	for (i = 0; i < wi_super_nr_entries(wis); i++) {
+		struct write_intent_entry *entry =
+			write_intent_entry_nr(ctrl, i);
+
+		test_err("  entry=%u bytenr=%llu bitmap=0x%016llx\n",
+			 i, wi_entry_bytenr(entry), wi_entry_raw_bitmap(entry));
+	}
+	return -EUCLEAN;
+}
+
+static void free_dummy_ctrl(struct write_intent_ctrl *ctrl)
+{
+	__free_page(ctrl->page);
+	ASSERT(ctrl->commit_page == NULL);
+	kfree(ctrl);
+}
+
+/*
+ * Basic tests to ensure set and clear can properly handle bits set/clear in
+ * one entry.
+ */
+static int test_case_simple_entry(struct write_intent_ctrl *ctrl)
+{
+	const u32 blocksize = BTRFS_STRIPE_LEN;
+	u64 bitmaps[1] = { 0 };
+	u64 bytenrs[1] = { 0 };
+	int ret;
+
+	zero_bitmaps(ctrl);
+
+	write_intent_set_bits(ctrl, 0, blocksize * 3);
+
+	bitmaps[0] = 0x7;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, 0, blocksize * 3);
+	ret = compare_bitmaps(ctrl, 0, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_set_bits(ctrl, blocksize * 8, blocksize * 3);
+
+	bitmaps[0] = 0x700;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, blocksize * 9, blocksize * 2);
+	bitmaps[0] = 0x100;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, blocksize * 8, blocksize * 1);
+	ret = compare_bitmaps(ctrl, 0, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	/* Tests at high bits. */
+	write_intent_set_bits(ctrl, blocksize * 61, blocksize * 3);
+	bitmaps[0] = 0xe000000000000000L;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+	write_intent_clear_bits(ctrl, blocksize * 61, blocksize * 1);
+	bitmaps[0] = 0xc000000000000000L;
+	bytenrs[0] = 0;
+	ret = compare_bitmaps(ctrl, 1, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+	write_intent_clear_bits(ctrl, blocksize * 62, blocksize * 2);
+	ret = compare_bitmaps(ctrl, 0, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+/* Tests set/clear that cross entry boundaries. */
+static int test_case_cross_entries(struct write_intent_ctrl *ctrl)
+{
+	const u32 blocksize = BTRFS_STRIPE_LEN;
+	u64 bitmaps[3] = { 0 };
+	u64 bytenrs[3] = { 0 };
+	int ret;
+
+	zero_bitmaps(ctrl);
+
+	write_intent_set_bits(ctrl, blocksize * 32, blocksize * 64);
+	bitmaps[0] = 0xffffffff00000000L;
+	bytenrs[0] = 0;
+	bitmaps[1] = 0x00000000ffffffffL;
+	bytenrs[1] = 4194304;
+	ret = compare_bitmaps(ctrl, 2, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_set_bits(ctrl, blocksize * 96, blocksize * 64);
+	bitmaps[0] = 0xffffffff00000000L;
+	bytenrs[0] = 0;
+	bitmaps[1] = 0xffffffffffffffffL;
+	bytenrs[1] = 4194304;
+	bitmaps[2] = 0x00000000ffffffffL;
+	bytenrs[2] = 8388608;
+	ret = compare_bitmaps(ctrl, 3, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	write_intent_clear_bits(ctrl, blocksize * 33, blocksize * 126);
+	bitmaps[0] = 0x0000000100000000L;
+	bytenrs[0] = 0;
+	bitmaps[1] = 0x0000000080000000L;
+	bytenrs[1] = 8388608;
+	ret = compare_bitmaps(ctrl, 2, bytenrs, bitmaps);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
+int btrfs_test_write_intent_bitmaps(void)
+{
+	struct write_intent_ctrl *ctrl;
+	int ret;
+
+	ctrl = alloc_dummy_ctrl();
+	if (!ctrl) {
+		test_std_err(TEST_ALLOC_WRITE_INTENT_CTRL);
+		return -ENOMEM;
+	}
+	test_msg("running extent_map tests");
+
+	ret = test_case_simple_entry(ctrl);
+	if (ret < 0) {
+		test_err("failed set/clear tests in one simple entry");
+		goto out;
+	}
+
+	ret = test_case_cross_entries(ctrl);
+	if (ret < 0) {
+		test_err("failed set/clear tests across entry boundaries");
+		goto out;
+	}
+out:
+	free_dummy_ctrl(ctrl);
+	return ret;
+}
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 8a69bf39a994..b4a205cb0c88 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -216,16 +216,6 @@ static int write_intent_init(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
-static struct write_intent_entry *write_intent_entry_nr(
-				struct write_intent_ctrl *ctrl, int nr)
-{
-
-	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
-	return (page_address(ctrl->page) +
-		sizeof(struct write_intent_super) +
-		nr * sizeof(struct write_intent_entry));
-}
-
 /*
  * Return <0 if the bytenr is before the given entry.
  * Return 0 if the bytenr is inside the given entry.
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 0da1b7421590..00eb116e4c38 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -197,6 +197,8 @@ WRITE_INTENT_SETGET_FUNCS(super_blocksize, struct write_intent_super,
 WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
 			  csum_type, 16);
 WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
+WRITE_INTENT_SETGET_FUNCS(entry_raw_bitmap, struct write_intent_entry,
+			  bitmap, 64);
 
 static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
 {
@@ -205,6 +207,16 @@ static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
 	return wi_super_blocksize(wis) * WRITE_INTENT_BITS_PER_ENTRY;
 }
 
+static inline struct write_intent_entry *write_intent_entry_nr(
+				struct write_intent_ctrl *ctrl, int nr)
+{
+
+	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+	return (page_address(ctrl->page) +
+		sizeof(struct write_intent_super) +
+		nr * sizeof(struct write_intent_entry));
+}
+
 static inline void wie_get_bitmap(struct write_intent_entry *entry,
 				  unsigned long *bitmap)
 {
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/12] btrfs: write back write intent bitmap after barrier_all_devices()
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (7 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 08/12] btrfs: selftests: add selftests for write-intent bitmaps Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 10/12] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

The new function, btrfs_write_intent_writeback(), will also accept a u64
parameter, @event, to do extra fastpath check to avoid unnecessary
writeback.

For now we just pass 0 to write the current bitmap to disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c                          |  9 +++
 fs/btrfs/tests/write-intent-bitmaps-tests.c |  2 +
 fs/btrfs/write-intent.c                     | 79 ++++++++++++++++++++-
 fs/btrfs/write-intent.h                     | 25 +++++++
 4 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index edbb21706bda..bd30decd38e4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4357,6 +4357,15 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 		}
 	}
 
+	ret = btrfs_write_intent_writeback(fs_info, 0);
+	if (ret < 0) {
+		mutex_unlock(
+			&fs_info->fs_devices->device_list_mutex);
+		btrfs_handle_fs_error(fs_info, ret,
+			"errors while writing back write intent bitmaps.");
+		return ret;
+	}
+
 	list_for_each_entry(dev, head, dev_list) {
 		if (!dev->bdev) {
 			total_errors++;
diff --git a/fs/btrfs/tests/write-intent-bitmaps-tests.c b/fs/btrfs/tests/write-intent-bitmaps-tests.c
index c956acaea1c7..1b81720f21d1 100644
--- a/fs/btrfs/tests/write-intent-bitmaps-tests.c
+++ b/fs/btrfs/tests/write-intent-bitmaps-tests.c
@@ -25,6 +25,8 @@ static struct write_intent_ctrl *alloc_dummy_ctrl(void)
 	ctrl->blocksize = BTRFS_STRIPE_LEN;
 	atomic64_set(&ctrl->event, 1);
 	spin_lock_init(&ctrl->lock);
+	init_waitqueue_head(&ctrl->overflow_wait);
+	init_waitqueue_head(&ctrl->write_wait);
 	memzero_page(ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
 	wis = page_address(ctrl->page);
 	wi_set_super_magic(wis, WRITE_INTENT_SUPER_MAGIC);
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index b4a205cb0c88..5e26813a06e0 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -76,8 +76,17 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	shash->tfm = fs_info->csum_shash;
 
 	spin_lock(&ctrl->lock);
-	wis = page_address(ctrl->page);
 
+	/* No update on the bitmap, just skip this writeback. */
+	if (!memcmp(page_address(ctrl->page), page_address(ctrl->commit_page),
+		    WRITE_INTENT_BITMAPS_SIZE)) {
+		ctrl->writing_event = 0;
+		spin_unlock(&ctrl->lock);
+		wake_up(&ctrl->write_wait);
+		return 0;
+	}
+
+	wis = page_address(ctrl->page);
 	/*
 	 * Bump up the event counter each time this bitmap is going to be
 	 * written.
@@ -127,6 +136,13 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	}
 	wait_event(wb_ctrl.wait, atomic_read(&wb_ctrl.pending_bios) == 0);
 
+	spin_lock(&ctrl->lock);
+	if (ctrl->writing_event > ctrl->committed_event)
+		ctrl->committed_event = ctrl->writing_event;
+	ctrl->writing_event = 0;
+	spin_unlock(&ctrl->lock);
+	wake_up(&ctrl->write_wait);
+
 	if (total_errors + atomic_read(&wb_ctrl.errors) >
 	    btrfs_super_num_devices(fs_info->super_copy) - 1) {
 		btrfs_err(fs_info, "failed to writeback write-intent bitmaps");
@@ -624,6 +640,63 @@ void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 		WARN_ON_ONCE(1);
 }
 
+int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return 0;
+
+	ASSERT(ctrl);
+
+again:
+	spin_lock(&ctrl->lock);
+
+	/*
+	 * The bitmap has already been written to disk at least once. Our update
+	 * has already reached disk.
+	 */
+	if (event && ctrl->committed_event > event) {
+		spin_unlock(&ctrl->lock);
+		return 0;
+	}
+
+	/* Previous writing hasn't finished, wait for it and retry. */
+	if (ctrl->writing_event && ctrl->writing_event <= event) {
+		DEFINE_WAIT(__wait);
+
+		prepare_to_wait_event(&ctrl->write_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		goto again;
+	}
+
+	/* Someone is already writing back the bitmap. */
+	if (ctrl->writing_event) {
+		DEFINE_WAIT(__wait);
+
+		ASSERT(ctrl->writing_event > event);
+		prepare_to_wait_event(&ctrl->write_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		return 0;
+	}
+
+	/*
+	 * We're the one to write back the bitmap, update @writing_event so
+	 * all the other caller will just wait for us.
+	 */
+	ctrl->writing_event = atomic64_read(&ctrl->event) + 1;
+	spin_unlock(&ctrl->lock);
+
+	/* Slow path, do the submission and wait. */
+	return write_intent_writeback(fs_info);
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
@@ -648,6 +721,8 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 	}
 
 	spin_lock_init(&fs_info->wi_ctrl->lock);
+	init_waitqueue_head(&fs_info->wi_ctrl->overflow_wait);
+	init_waitqueue_head(&fs_info->wi_ctrl->write_wait);
 
 	/*
 	 * Go through every writeable device to find the highest event.
@@ -688,6 +763,8 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 				  dev->devid, ret);
 			goto cleanup;
 		}
+		memcpy_page(fs_info->wi_ctrl->commit_page, 0,
+			    fs_info->wi_ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
 		wis = page_address(fs_info->wi_ctrl->page);
 		atomic64_set(&fs_info->wi_ctrl->event, wi_super_events(wis));
 		fs_info->wi_ctrl->blocksize = wi_super_blocksize(wis);
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 00eb116e4c38..bf84737e0706 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -123,12 +123,30 @@ struct write_intent_ctrl {
 	/* A copy for writeback. */
 	struct page *commit_page;
 
+	/*
+	 * For callers who has updated their bitmap, wait for the bitmap to be
+	 * flushed to disk.
+	 */
+	wait_queue_head_t write_wait;
+
+	/*
+	 * For callers whose bits can not be updated as no enough space left
+	 * int the bitmap.
+	 */
+	wait_queue_head_t overflow_wait;
+
 	/* Cached event counter.*/
 	atomic64_t event;
 
 	/* Lock for reading/writing above @page. */
 	spinlock_t lock;
 
+	/* Event number for the bitmap being written. */
+	u64 writing_event;
+
+	/* Event number for the last committed bitmap. */
+	u64 committed_event;
+
 	/* Cached blocksize from write intent super. */
 	u32 blocksize;
 };
@@ -250,4 +268,11 @@ void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
 void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
 
+/*
+ * Ensure the bitmap for @event is already written to disk.
+ *
+ * If @event is 0, it means to write current bitmap to disk.
+ */
+int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/12] btrfs: update and writeback the write-intent bitmap for RAID56 write.
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (8 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 09/12] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 11/12] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

This allows us to mark the write-intent bitmaps dirty for later recovery
usage.

For now, we only mark the bitmaps dirty but without really clearing
them, this is going to cause problems (hang the fs if the bitmap is
full), but this also allows us to debug the bitmap with the new
dump-write-intent tool:

  csum_type		0 (crc32c)
  csum			0xad622029 [match]
  magic			_wIbSb_Q [match]
  fsid			46bcd711-6c9b-400f-aaba-bf99aa3dd321
  flags			0x7
  			( TARGET_RAID56 |
  			  INTERNAL |
  			  BYTENR_LOGICAL )
  events			10
  size			4096
  blocksize		65536
  nr_entries		1
  entry 0, bytenr 385875968, bitmap 0x000000000003fffc

This is doing 1MiB write for logical 388104192, which matches the above
output bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c       |  8 +++++++
 fs/btrfs/write-intent.c | 46 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  9 ++++++++
 3 files changed, 63 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7b2d2b6c8c61..9f18d6f6f4dc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1185,6 +1185,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	bool has_qstripe;
 	struct bio_list bio_list;
 	struct bio *bio;
+	u64 event;
 	int ret;
 
 	bio_list_init(&bio_list);
@@ -1338,6 +1339,13 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
 	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
+	/* Update the write intent bitmap before we start submitting bios. */
+	btrfs_write_intent_mark_dirty(bioc->fs_info, rbio->bioc->raid_map[0],
+				     rbio->nr_data * BTRFS_STRIPE_LEN, &event);
+	ret = btrfs_write_intent_writeback(bioc->fs_info, event);
+
+	if (ret < 0)
+		goto cleanup;
 	while ((bio = bio_list_pop(&bio_list))) {
 		bio->bi_end_io = raid_write_end_io;
 
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 5e26813a06e0..8520105d0d84 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -381,6 +381,9 @@ static void delete_one_entry(struct write_intent_ctrl *ctrl, int nr)
 	memset(write_intent_entry_nr(ctrl, cur_nr_entries - 1), 0,
 	       sizeof(struct write_intent_entry));
 	wi_set_super_nr_entries(wis, cur_nr_entries - 1);
+
+	/* We freed one entry, wake up who are waiting for the extra space. */
+	wake_up(&ctrl->overflow_wait);
 }
 
 /*
@@ -640,6 +643,49 @@ void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 		WARN_ON_ONCE(1);
 }
 
+void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 len, u64 *event_ret)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	u32 entry_len;
+	int nr_entries;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return;
+
+	ASSERT(ctrl);
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+again:
+	spin_lock(&ctrl->lock);
+	entry_len = ctrl->blocksize * WRITE_INTENT_BITS_PER_ENTRY;
+	nr_entries = (round_up(logical + len, entry_len) -
+		      round_down(logical, entry_len)) / entry_len;
+	wis = page_address(ctrl->page);
+
+	/*
+	 * May not have enough space left. This calculation is definitely
+	 * overkilled, but will ensure we have enough space for it.
+	 */
+	if (unlikely(wi_super_nr_entries(wis) + nr_entries) >=
+		     WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES) {
+		DEFINE_WAIT(__wait);
+
+		prepare_to_wait_event(&ctrl->overflow_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		goto again;
+	}
+
+	/* Update the bitmap. */
+	write_intent_set_bits(ctrl, logical, len);
+	*event_ret = atomic64_read(&ctrl->event);
+	spin_unlock(&ctrl->lock);
+}
+
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index bf84737e0706..e1e3a16f8929 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -275,4 +275,13 @@ void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
  */
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
 
+/*
+ * Mark the range dirty in write intent bitmaps.
+ *
+ * May (but unlikely) sleep if there is not enough free entries.
+ * In that case, we will wait for enough free entries to be released.
+ */
+void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 len, u64 *event_ret);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 11/12] btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (9 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 10/12] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:32 ` [PATCH 12/12] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c       |  7 +++++++
 fs/btrfs/write-intent.c | 45 +++++++++++++++++++++++++++++------------
 fs/btrfs/write-intent.h |  8 ++++++++
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 9f18d6f6f4dc..f2b0ea2315c5 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -817,6 +817,13 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 
 	if (rbio->generic_bio_cnt)
 		btrfs_bio_counter_sub(rbio->bioc->fs_info, rbio->generic_bio_cnt);
+
+	/* Clear the write-intent bitmap range for write operation. */
+	if (rbio->operation == BTRFS_RBIO_WRITE)
+		btrfs_write_intent_clear_dirty(rbio->bioc->fs_info,
+				       rbio->bioc->raid_map[0],
+				       rbio->nr_data * BTRFS_STRIPE_LEN);
+
 	/*
 	 * Clear the data bitmap, as the rbio may be cached for later usage.
 	 * do this before before unlock_stripe() so there will be no new bio
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 8520105d0d84..40d579574f3d 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -61,6 +61,7 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	struct bitmap_writeback_contrl wb_ctrl = {0};
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	const int nr_devs_max = fs_info->fs_devices->open_devices + 4;
+	unsigned long flags;
 	int nr_devs = 0;
 	int total_errors = 0;
 	int ret;
@@ -75,13 +76,13 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 
 	shash->tfm = fs_info->csum_shash;
 
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	/* No update on the bitmap, just skip this writeback. */
 	if (!memcmp(page_address(ctrl->page), page_address(ctrl->commit_page),
 		    WRITE_INTENT_BITMAPS_SIZE)) {
 		ctrl->writing_event = 0;
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		wake_up(&ctrl->write_wait);
 		return 0;
 	}
@@ -98,7 +99,7 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	atomic64_inc(&ctrl->event);
 	memcpy_page(ctrl->commit_page, 0, ctrl->page, 0,
 		    WRITE_INTENT_BITMAPS_SIZE);
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	init_waitqueue_head(&wb_ctrl.wait);
 	atomic_set(&wb_ctrl.pending_bios, 0);
@@ -136,11 +137,11 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	}
 	wait_event(wb_ctrl.wait, atomic_read(&wb_ctrl.pending_bios) == 0);
 
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 	if (ctrl->writing_event > ctrl->committed_event)
 		ctrl->committed_event = ctrl->writing_event;
 	ctrl->writing_event = 0;
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 	wake_up(&ctrl->write_wait);
 
 	if (total_errors + atomic_read(&wb_ctrl.errors) >
@@ -648,6 +649,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
 	struct write_intent_super *wis;
+	unsigned long flags;
 	u32 entry_len;
 	int nr_entries;
 
@@ -658,7 +660,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
 
 again:
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 	entry_len = ctrl->blocksize * WRITE_INTENT_BITS_PER_ENTRY;
 	nr_entries = (round_up(logical + len, entry_len) -
 		      round_down(logical, entry_len)) / entry_len;
@@ -674,7 +676,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 
 		prepare_to_wait_event(&ctrl->overflow_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		goto again;
@@ -683,12 +685,29 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	/* Update the bitmap. */
 	write_intent_set_bits(ctrl, logical, len);
 	*event_ret = atomic64_read(&ctrl->event);
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+
+void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				    u32 len)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	unsigned long flags;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return;
+	ASSERT(ctrl);
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	write_intent_clear_bits(ctrl, logical, len);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	unsigned long flags;
 
 	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
 		return 0;
@@ -696,14 +715,14 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 	ASSERT(ctrl);
 
 again:
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	/*
 	 * The bitmap has already been written to disk at least once. Our update
 	 * has already reached disk.
 	 */
 	if (event && ctrl->committed_event > event) {
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		return 0;
 	}
 
@@ -713,7 +732,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 
 		prepare_to_wait_event(&ctrl->write_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		goto again;
@@ -726,7 +745,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 		ASSERT(ctrl->writing_event > event);
 		prepare_to_wait_event(&ctrl->write_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		return 0;
@@ -737,7 +756,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 	 * all the other caller will just wait for us.
 	 */
 	ctrl->writing_event = atomic64_read(&ctrl->event) + 1;
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	/* Slow path, do the submission and wait. */
 	return write_intent_writeback(fs_info);
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index e1e3a16f8929..872a707ef67d 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -284,4 +284,12 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
 void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 				   u32 len, u64 *event_ret);
 
+/*
+ * Clear the range dirty in write intent bitmaps.
+ *
+ * This function should not sleep, and no need to wait for the bitmap to be
+ * flushed.
+ */
+void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				    u32 len);
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 12/12] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (10 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 11/12] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
@ 2022-07-07  5:32 ` Qu Wenruo
  2022-07-07  5:36 ` [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Christoph Hellwig
  2022-07-13 16:18 ` Lukas Straub
  13 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:32 UTC (permalink / raw)
  To: linux-btrfs

To properly cleanup the bitmaps, we need to scrub the logical ranges in
the bitmaps.

Unfortunately there is no such convient interface at all (scrub only
works at device offset for now).

So just introduce a place holder to warn and clear the bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c      |  7 +++++++
 fs/btrfs/write-intent.c | 35 +++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  8 ++++++++
 3 files changed, 50 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bd30decd38e4..ccc136023303 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3705,6 +3705,13 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			  ret);
 		goto fail_block_groups;
 	}
+
+	ret = btrfs_write_intent_recover(fs_info);
+	if (ret < 0) {
+		btrfs_err(fs_info, "failed to recover write-intent bitmap: %d",
+			  ret);
+		goto fail_block_groups;
+	}
 	ret = btrfs_recover_balance(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to recover balance: %d", ret);
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 40d579574f3d..82228713e621 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -704,6 +704,41 @@ void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
+int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	int ret = 0;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return ret;
+
+	ASSERT(ctrl);
+	wis = page_address(ctrl->page);
+
+	if (wi_super_nr_entries(wis) != 0) {
+		int i;
+
+		btrfs_warn(fs_info, "dirty write intent bitmap found:");
+		for (i = 0; i < wi_super_nr_entries(wis); i++) {
+			struct write_intent_entry *entry =
+				write_intent_entry_nr(ctrl, i);
+
+			btrfs_warn(fs_info,
+				"  entry=%u bytenr=%llu bitmap=0x%016llx\n", i,
+				   wi_entry_bytenr(entry),
+				   wi_entry_raw_bitmap(entry));
+		}
+		/* For now, we just clear the whole bitmap. */
+		memzero_page(ctrl->page, sizeof(struct write_intent_super),
+			     WRITE_INTENT_BITMAPS_SIZE -
+			     sizeof(struct write_intent_super));
+		wi_set_super_nr_entries(wis, 0);
+		ret = write_intent_writeback(fs_info);
+	}
+	return ret;
+}
+
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 872a707ef67d..bb6e9b599373 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -292,4 +292,12 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
  */
 void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 				    u32 len);
+
+/*
+ * Rebuild the range in the write-intent bitmaps.
+ *
+ * Currently not working, it will just output a warning and clear the bitmap.
+ */
+int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (11 preceding siblings ...)
  2022-07-07  5:32 ` [PATCH 12/12] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
@ 2022-07-07  5:36 ` Christoph Hellwig
  2022-07-07  5:48   ` Qu Wenruo
  2022-07-13 16:18 ` Lukas Straub
  13 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-07-07  5:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Just a high level comment from someone who has no direct stake in this:

The pain point of classic parity RAID is the read-modify-write cycles,
and this series is a bandaid to work around the write holes caused by
them, but does not help with the performane problems associated with
them.

But btrfs is a file system fundamentally designed to write out of place
and thus can get around these problems by writing data in a way that
fundamentally never does a read-modify-write for a set of parity RAID
stripes.

Wouldn't it be a better idea to use the raid stripe tree that Johannes
suggest and apply that to parity writes instead of beating the dead
horse of classic RAID?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  5:36 ` [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Christoph Hellwig
@ 2022-07-07  5:48   ` Qu Wenruo
  2022-07-07  9:37     ` Johannes Thumshirn
  2022-07-07 13:36     ` Christoph Hellwig
  0 siblings, 2 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:48 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/7/7 13:36, Christoph Hellwig wrote:
> Just a high level comment from someone who has no direct stake in this:
>
> The pain point of classic parity RAID is the read-modify-write cycles,
> and this series is a bandaid to work around the write holes caused by
> them, but does not help with the performane problems associated with
> them.
>
> But btrfs is a file system fundamentally designed to write out of place
> and thus can get around these problems by writing data in a way that
> fundamentally never does a read-modify-write for a set of parity RAID
> stripes.
>
> Wouldn't it be a better idea to use the raid stripe tree that Johannes
> suggest and apply that to parity writes instead of beating the dead
> horse of classic RAID?

This has been discussed before, in short there are some unknown aspects
of RST tree from Johannes:

- It may not support RAID56 for metadata forever.
   This may not be a big deal though.

- Not yet determined how RST to handle non-COW RAID56 writes
   Just CoW the partial data write and its parity to other location?
   How to reclaim the unreferred part?

   Currently RST is still mainly to address RAID0/1 support for zoned
   device, RAID56 support is a good thing to come, but not yet focused on
   RAID56.

- ENOSPC
   If we go COW way, the ENOSPC situation will be more pressing.

   Now all partial writes must be COWed.
   This is going to need way more work, I'm not sure if the existing
   data space handling code is able to handle it at all.

In fact, I think the RAID56 in modern COW environment is worthy a full
talk in some conference.
If I had the chance, I really want to co-host a talk with Johannes on
this (but the stupid zero-covid policy is definitely not helping at all
here).

Thus I go the tried and true solution, write-intent bitmap and later
full journal. To provide a way to close the write-hole before we had a
better solution, instead of marking RAID56 unsafe and drive away new users.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  5:48   ` Qu Wenruo
@ 2022-07-07  9:37     ` Johannes Thumshirn
  2022-07-07  9:45       ` Qu Wenruo
  2022-07-07 13:36     ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2022-07-07  9:37 UTC (permalink / raw)
  To: Qu Wenruo, hch, Qu Wenruo; +Cc: linux-btrfs

On 07.07.22 07:49, Qu Wenruo wrote:
> 
> 
> On 2022/7/7 13:36, Christoph Hellwig wrote:
>> Just a high level comment from someone who has no direct stake in this:
>>
>> The pain point of classic parity RAID is the read-modify-write cycles,
>> and this series is a bandaid to work around the write holes caused by
>> them, but does not help with the performane problems associated with
>> them.
>>
>> But btrfs is a file system fundamentally designed to write out of place
>> and thus can get around these problems by writing data in a way that
>> fundamentally never does a read-modify-write for a set of parity RAID
>> stripes.
>>
>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>> suggest and apply that to parity writes instead of beating the dead
>> horse of classic RAID?
> 
> This has been discussed before, in short there are some unknown aspects
> of RST tree from Johannes:
> 
> - It may not support RAID56 for metadata forever.
>    This may not be a big deal though.

This might be a problem indeed, but a) we have the RAID1C[34] profiles and
b) we can place the RST for meta-data into the SYSTEM block-group.
 
> - Not yet determined how RST to handle non-COW RAID56 writes
>    Just CoW the partial data write and its parity to other location?
>    How to reclaim the unreferred part?
> 
>    Currently RST is still mainly to address RAID0/1 support for zoned
>    device, RAID56 support is a good thing to come, but not yet focused on
>    RAID56.

Well RAID1 was the low hanging fruit and I had to start somewhere. Now that
the overall concept and RAID1 looks promising I've started to look into RAID0.

RAID0 introduces srtiping for the 1st time in the context of RST, so there might
be dragons, but nothing unsolvable.

Once this is done, RAID10 should just fall into place (famous last words?) and
with this completed most of the things we need for RAID56 are there as well, from
the RST and volumes.c side of things. What's left is getting rid of the RMW cycles
that are done for sub stripe size writes.

> 
> - ENOSPC
>    If we go COW way, the ENOSPC situation will be more pressing.
> 
>    Now all partial writes must be COWed.
>    This is going to need way more work, I'm not sure if the existing
>    data space handling code is able to handle it at all.

Well just as with normal btrfs as well, either you can override the "garbage"
space with valid data again or you need to do garbage collection in case of
a zoned file-system.

> 
> In fact, I think the RAID56 in modern COW environment is worthy a full
> talk in some conference.
> If I had the chance, I really want to co-host a talk with Johannes on
> this (but the stupid zero-covid policy is definitely not helping at all
> here).
> 
> Thus I go the tried and true solution, write-intent bitmap and later
> full journal. To provide a way to close the write-hole before we had a
> better solution, instead of marking RAID56 unsafe and drive away new users.
> 
> Thanks,
> Qu
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  9:37     ` Johannes Thumshirn
@ 2022-07-07  9:45       ` Qu Wenruo
  2022-07-07 10:42         ` Qu Wenruo
  2022-07-07 12:23         ` Johannes Thumshirn
  0 siblings, 2 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07  9:45 UTC (permalink / raw)
  To: Johannes Thumshirn, hch, Qu Wenruo; +Cc: linux-btrfs



On 2022/7/7 17:37, Johannes Thumshirn wrote:
> On 07.07.22 07:49, Qu Wenruo wrote:
>>
>>
>> On 2022/7/7 13:36, Christoph Hellwig wrote:
>>> Just a high level comment from someone who has no direct stake in this:
>>>
>>> The pain point of classic parity RAID is the read-modify-write cycles,
>>> and this series is a bandaid to work around the write holes caused by
>>> them, but does not help with the performane problems associated with
>>> them.
>>>
>>> But btrfs is a file system fundamentally designed to write out of place
>>> and thus can get around these problems by writing data in a way that
>>> fundamentally never does a read-modify-write for a set of parity RAID
>>> stripes.
>>>
>>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>>> suggest and apply that to parity writes instead of beating the dead
>>> horse of classic RAID?
>>
>> This has been discussed before, in short there are some unknown aspects
>> of RST tree from Johannes:
>>
>> - It may not support RAID56 for metadata forever.
>>     This may not be a big deal though.
>
> This might be a problem indeed, but a) we have the RAID1C[34] profiles and
> b) we can place the RST for meta-data into the SYSTEM block-group.
>
>> - Not yet determined how RST to handle non-COW RAID56 writes
>>     Just CoW the partial data write and its parity to other location?
>>     How to reclaim the unreferred part?
>>
>>     Currently RST is still mainly to address RAID0/1 support for zoned
>>     device, RAID56 support is a good thing to come, but not yet focused on
>>     RAID56.
>
> Well RAID1 was the low hanging fruit and I had to start somewhere. Now that
> the overall concept and RAID1 looks promising I've started to look into RAID0.
>
> RAID0 introduces srtiping for the 1st time in the context of RST, so there might
> be dragons, but nothing unsolvable.
>
> Once this is done, RAID10 should just fall into place (famous last words?) and
> with this completed most of the things we need for RAID56 are there as well, from
> the RST and volumes.c side of things. What's left is getting rid of the RMW cycles
> that are done for sub stripe size writes.
>
>>
>> - ENOSPC
>>     If we go COW way, the ENOSPC situation will be more pressing.
>>
>>     Now all partial writes must be COWed.
>>     This is going to need way more work, I'm not sure if the existing
>>     data space handling code is able to handle it at all.
>
> Well just as with normal btrfs as well, either you can override the "garbage"
> space with valid data again or you need to do garbage collection in case of
> a zoned file-system.

My concern is, (not considering zoned yet) if we do a partial write into
a full stripe, what would happen?

Like this:

D1	| Old data | Old data |
D2	| To write |  Unused  |
P	| Parity 1 | Parity 2 |

The "To write" part will definite need a new RST entry, so no double.

But what would happen to Parity 1? We need to do a CoW to some new
location, right?

So the old physical location for "Parity 1" will be mark reserved and
only freed after we did a full transaction?


Another concern is, what if the following case happen?

D1	| Old 1 | Old 2 |
D2	| Old 3 | Old 4 |
P	|  P 1  |  P 2  |

If we only write part of data into "Old 3" do we need to read the whole
"Old 3" out and CoW the full stripe? Or can we do sectors CoW only?

Thanks,
Qu
>
>>
>> In fact, I think the RAID56 in modern COW environment is worthy a full
>> talk in some conference.
>> If I had the chance, I really want to co-host a talk with Johannes on
>> this (but the stupid zero-covid policy is definitely not helping at all
>> here).
>>
>> Thus I go the tried and true solution, write-intent bitmap and later
>> full journal. To provide a way to close the write-hole before we had a
>> better solution, instead of marking RAID56 unsafe and drive away new users.
>>
>> Thanks,
>> Qu
>>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  9:45       ` Qu Wenruo
@ 2022-07-07 10:42         ` Qu Wenruo
  2022-07-07 12:23         ` Johannes Thumshirn
  1 sibling, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07 10:42 UTC (permalink / raw)
  To: Johannes Thumshirn, hch, Qu Wenruo; +Cc: linux-btrfs



On 2022/7/7 17:45, Qu Wenruo wrote:
>
>
> On 2022/7/7 17:37, Johannes Thumshirn wrote:
>> On 07.07.22 07:49, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/7/7 13:36, Christoph Hellwig wrote:
>>>> Just a high level comment from someone who has no direct stake in this:
>>>>
>>>> The pain point of classic parity RAID is the read-modify-write cycles,
>>>> and this series is a bandaid to work around the write holes caused by
>>>> them, but does not help with the performane problems associated with
>>>> them.
>>>>
>>>> But btrfs is a file system fundamentally designed to write out of place
>>>> and thus can get around these problems by writing data in a way that
>>>> fundamentally never does a read-modify-write for a set of parity RAID
>>>> stripes.
>>>>
>>>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>>>> suggest and apply that to parity writes instead of beating the dead
>>>> horse of classic RAID?
>>>
>>> This has been discussed before, in short there are some unknown aspects
>>> of RST tree from Johannes:
>>>
>>> - It may not support RAID56 for metadata forever.
>>>     This may not be a big deal though.
>>
>> This might be a problem indeed, but a) we have the RAID1C[34] profiles
>> and
>> b) we can place the RST for meta-data into the SYSTEM block-group.
>>
>>> - Not yet determined how RST to handle non-COW RAID56 writes
>>>     Just CoW the partial data write and its parity to other location?
>>>     How to reclaim the unreferred part?
>>>
>>>     Currently RST is still mainly to address RAID0/1 support for zoned
>>>     device, RAID56 support is a good thing to come, but not yet
>>> focused on
>>>     RAID56.
>>
>> Well RAID1 was the low hanging fruit and I had to start somewhere. Now
>> that
>> the overall concept and RAID1 looks promising I've started to look
>> into RAID0.
>>
>> RAID0 introduces srtiping for the 1st time in the context of RST, so
>> there might
>> be dragons, but nothing unsolvable.
>>
>> Once this is done, RAID10 should just fall into place (famous last
>> words?) and
>> with this completed most of the things we need for RAID56 are there as
>> well, from
>> the RST and volumes.c side of things. What's left is getting rid of
>> the RMW cycles
>> that are done for sub stripe size writes.
>>
>>>
>>> - ENOSPC
>>>     If we go COW way, the ENOSPC situation will be more pressing.
>>>
>>>     Now all partial writes must be COWed.
>>>     This is going to need way more work, I'm not sure if the existing
>>>     data space handling code is able to handle it at all.
>>
>> Well just as with normal btrfs as well, either you can override the
>> "garbage"
>> space with valid data again or you need to do garbage collection in
>> case of
>> a zoned file-system.
>
> My concern is, (not considering zoned yet) if we do a partial write into
> a full stripe, what would happen?
>
> Like this:
>
> D1    | Old data | Old data |
> D2    | To write |  Unused  |
> P    | Parity 1 | Parity 2 |
>
> The "To write" part will definite need a new RST entry, so no double.
>
> But what would happen to Parity 1? We need to do a CoW to some new
> location, right?
>
> So the old physical location for "Parity 1" will be mark reserved and
> only freed after we did a full transaction?
>
>
> Another concern is, what if the following case happen?
>
> D1    | Old 1 | Old 2 |
> D2    | Old 3 | Old 4 |
> P    |  P 1  |  P 2  |
>
> If we only write part of data into "Old 3" do we need to read the whole
> "Old 3" out and CoW the full stripe? Or can we do sectors CoW only?

To be more accurate, if we go the COW path for Data and Parity stripes,
we will hit the following situation finally:
	0		  64K
Disk 1  | Old D1  | Old D2  |
Disk 2  | Free D3 | Free D4 |
Disk 3  | Old P1  | Old P2  |

Currently for the extent allocator, it will consider the range at D3 and
D4 available.
Although if we write into D3 and D4, we need write-intent (or full
journal) to close the write-hole (write-intent need no device missing,
while full journal doesn't).


If we go CoW, this means if we want to write into D3, we have to Cow P1.
But if all stripes are like above, even with rotation, we can still have
all data/parity on disk 3 used.
Thus really no space to do extra CoW.

To me, as long as we go CoW for parity, the core idea is no different
than a new extent allocator policy to avoid partial writes.
And that can be a very challenging work to do.

Thanks,
Qu
>
> Thanks,
> Qu
>>
>>>
>>> In fact, I think the RAID56 in modern COW environment is worthy a full
>>> talk in some conference.
>>> If I had the chance, I really want to co-host a talk with Johannes on
>>> this (but the stupid zero-covid policy is definitely not helping at all
>>> here).
>>>
>>> Thus I go the tried and true solution, write-intent bitmap and later
>>> full journal. To provide a way to close the write-hole before we had a
>>> better solution, instead of marking RAID56 unsafe and drive away new
>>> users.
>>>
>>> Thanks,
>>> Qu
>>>
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  9:45       ` Qu Wenruo
  2022-07-07 10:42         ` Qu Wenruo
@ 2022-07-07 12:23         ` Johannes Thumshirn
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2022-07-07 12:23 UTC (permalink / raw)
  To: Qu Wenruo, hch, Qu Wenruo; +Cc: linux-btrfs

On 07.07.22 11:45, Qu Wenruo wrote:
> 
> 
> On 2022/7/7 17:37, Johannes Thumshirn wrote:
>> On 07.07.22 07:49, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/7/7 13:36, Christoph Hellwig wrote:
>>>> Just a high level comment from someone who has no direct stake in this:
>>>>
>>>> The pain point of classic parity RAID is the read-modify-write cycles,
>>>> and this series is a bandaid to work around the write holes caused by
>>>> them, but does not help with the performane problems associated with
>>>> them.
>>>>
>>>> But btrfs is a file system fundamentally designed to write out of place
>>>> and thus can get around these problems by writing data in a way that
>>>> fundamentally never does a read-modify-write for a set of parity RAID
>>>> stripes.
>>>>
>>>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>>>> suggest and apply that to parity writes instead of beating the dead
>>>> horse of classic RAID?
>>>
>>> This has been discussed before, in short there are some unknown aspects
>>> of RST tree from Johannes:
>>>
>>> - It may not support RAID56 for metadata forever.
>>>     This may not be a big deal though.
>>
>> This might be a problem indeed, but a) we have the RAID1C[34] profiles and
>> b) we can place the RST for meta-data into the SYSTEM block-group.
>>
>>> - Not yet determined how RST to handle non-COW RAID56 writes
>>>     Just CoW the partial data write and its parity to other location?
>>>     How to reclaim the unreferred part?
>>>
>>>     Currently RST is still mainly to address RAID0/1 support for zoned
>>>     device, RAID56 support is a good thing to come, but not yet focused on
>>>     RAID56.
>>
>> Well RAID1 was the low hanging fruit and I had to start somewhere. Now that
>> the overall concept and RAID1 looks promising I've started to look into RAID0.
>>
>> RAID0 introduces srtiping for the 1st time in the context of RST, so there might
>> be dragons, but nothing unsolvable.
>>
>> Once this is done, RAID10 should just fall into place (famous last words?) and
>> with this completed most of the things we need for RAID56 are there as well, from
>> the RST and volumes.c side of things. What's left is getting rid of the RMW cycles
>> that are done for sub stripe size writes.
>>
>>>
>>> - ENOSPC
>>>     If we go COW way, the ENOSPC situation will be more pressing.
>>>
>>>     Now all partial writes must be COWed.
>>>     This is going to need way more work, I'm not sure if the existing
>>>     data space handling code is able to handle it at all.
>>
>> Well just as with normal btrfs as well, either you can override the "garbage"
>> space with valid data again or you need to do garbage collection in case of
>> a zoned file-system.
> 
> My concern is, (not considering zoned yet) if we do a partial write into
> a full stripe, what would happen?
> 
> Like this:
> 
> D1	| Old data | Old data |
> D2	| To write |  Unused  |
> P	| Parity 1 | Parity 2 |
> 
> The "To write" part will definite need a new RST entry, so no double.
> 
> But what would happen to Parity 1? We need to do a CoW to some new
> location, right?
> 
> So the old physical location for "Parity 1" will be mark reserved and
> only freed after we did a full transaction?

Correct

> 
> 
> Another concern is, what if the following case happen?
> 
> D1	| Old 1 | Old 2 |
> D2	| Old 3 | Old 4 |
> P	|  P 1  |  P 2  |
> 
> If we only write part of data into "Old 3" do we need to read the whole
> "Old 3" out and CoW the full stripe? Or can we do sectors CoW only?

Well that'll be an implementation detail. I'll see what I can come up with 
to make this working.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  5:48   ` Qu Wenruo
  2022-07-07  9:37     ` Johannes Thumshirn
@ 2022-07-07 13:36     ` Christoph Hellwig
  2022-07-07 13:48       ` Qu Wenruo
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2022-07-07 13:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Thu, Jul 07, 2022 at 01:48:11PM +0800, Qu Wenruo wrote:
> - It may not support RAID56 for metadata forever.
>   This may not be a big deal though.

Does parity raid for metadata make much sense to start with?

> - Not yet determined how RST to handle non-COW RAID56 writes

The whole point is that with parity raid you really do not ever want
to do non-COW writes, as that is what gets you the raid hole problem.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07 13:36     ` Christoph Hellwig
@ 2022-07-07 13:48       ` Qu Wenruo
  0 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-07 13:48 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



On 2022/7/7 21:36, Christoph Hellwig wrote:
> On Thu, Jul 07, 2022 at 01:48:11PM +0800, Qu Wenruo wrote:
>> - It may not support RAID56 for metadata forever.
>>    This may not be a big deal though.
> 
> Does parity raid for metadata make much sense to start with?
> 
>> - Not yet determined how RST to handle non-COW RAID56 writes
> 
> The whole point is that with parity raid you really do not ever want
> to do non-COW writes, as that is what gets you the raid hole problem.

You don't need non-COW writes and can still get screwed up.

Just write some new data into unused space, while there is already some 
other data in the same vertical stripe, (aka, partial write), then the 
raid hole problem comes again.

Let me say this again, the extent allocator is not P/Q friendly at all, 
and even if we go update the allocator policy, the ENOSPC problem will 
come to bother the data profiles too.

In fact, considering real-wolrd large fs data chunk usage (under most 
cases over 95%, easily go 98%), I'm pretty sure ENOSPC will be a real 
problem.

Thus any COW based policy for P/Q based RAID will not only need a big 
update on extent allocator, but also ENOSPC related problems.

Thus to me, if we go COW policy directly, we are not really going to 
provide a better RAID56, we're just trading a problem for another.

Thus that's why I still believe traditional write-intent bitmaps/journal 
still makes sense.
It doesn't rely on big extent allocator change, but still solves the 
write-hole problem.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-07  5:32 ` [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
@ 2022-07-08  1:55   ` kernel test robot
  2022-07-08  2:22     ` Qu Wenruo
  2022-07-08  7:23   ` kernel test robot
  1 sibling, 1 reply; 26+ messages in thread
From: kernel test robot @ 2022-07-08  1:55 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20220707]
[cannot apply to linus/master v5.19-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-write-intent-bitmaps-for-RAID56/20220707-133435
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: mips-allyesconfig (https://download.01.org/0day-ci/archive/20220708/202207080925.VUcOcv89-lkp@intel.com/config)
compiler: mips-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/2b051857a66f0310589455c06f962908016b5f9b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-introduce-write-intent-bitmaps-for-RAID56/20220707-133435
        git checkout 2b051857a66f0310589455c06f962908016b5f9b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/mips/kernel/head.o: in function `kernel_entry':
   (.ref.text+0xac): relocation truncated to fit: R_MIPS_26 against `start_kernel'
   init/main.o: in function `set_reset_devices':
   main.c:(.init.text+0x20): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0x30): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   init/main.o: in function `debug_kernel':
   main.c:(.init.text+0xa4): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0xb4): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   init/main.o: in function `quiet_kernel':
   main.c:(.init.text+0x128): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0x138): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   init/main.o: in function `warn_bootconfig':
   main.c:(.init.text+0x1ac): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0x1bc): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   init/main.o: in function `init_setup':
   main.c:(.init.text+0x238): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0x258): additional relocation overflows omitted from the output
   mips-linux-ld: fs/btrfs/write-intent.o: in function `set_bits_in_one_entry':
>> write-intent.c:(.text.set_bits_in_one_entry+0x1ec): undefined reference to `__udivdi3'
>> mips-linux-ld: write-intent.c:(.text.set_bits_in_one_entry+0x2b0): undefined reference to `__udivdi3'
   mips-linux-ld: fs/btrfs/write-intent.o: in function `insert_new_entries':
>> write-intent.c:(.text.insert_new_entries+0x294): undefined reference to `__udivdi3'

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-08  1:55   ` kernel test robot
@ 2022-07-08  2:22     ` Qu Wenruo
  0 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-08  2:22 UTC (permalink / raw)
  To: kernel test robot, Qu Wenruo, linux-btrfs; +Cc: kbuild-all



On 2022/7/8 09:55, kernel test robot wrote:
> Hi Qu,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on kdave/for-next]
> [also build test ERROR on next-20220707]
> [cannot apply to linus/master v5.19-rc5]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-write-intent-bitmaps-for-RAID56/20220707-133435
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> config: mips-allyesconfig (https://download.01.org/0day-ci/archive/20220708/202207080925.VUcOcv89-lkp@intel.com/config)
> compiler: mips-linux-gcc (GCC) 11.3.0
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # https://github.com/intel-lab-lkp/linux/commit/2b051857a66f0310589455c06f962908016b5f9b
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Qu-Wenruo/btrfs-introduce-write-intent-bitmaps-for-RAID56/20220707-133435
>          git checkout 2b051857a66f0310589455c06f962908016b5f9b
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>     arch/mips/kernel/head.o: in function `kernel_entry':
>     (.ref.text+0xac): relocation truncated to fit: R_MIPS_26 against `start_kernel'
>     init/main.o: in function `set_reset_devices':
>     main.c:(.init.text+0x20): relocation truncated to fit: R_MIPS_26 against `_mcount'
>     main.c:(.init.text+0x30): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
>     init/main.o: in function `debug_kernel':
>     main.c:(.init.text+0xa4): relocation truncated to fit: R_MIPS_26 against `_mcount'
>     main.c:(.init.text+0xb4): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
>     init/main.o: in function `quiet_kernel':
>     main.c:(.init.text+0x128): relocation truncated to fit: R_MIPS_26 against `_mcount'
>     main.c:(.init.text+0x138): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
>     init/main.o: in function `warn_bootconfig':
>     main.c:(.init.text+0x1ac): relocation truncated to fit: R_MIPS_26 against `_mcount'
>     main.c:(.init.text+0x1bc): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
>     init/main.o: in function `init_setup':
>     main.c:(.init.text+0x238): relocation truncated to fit: R_MIPS_26 against `_mcount'
>     main.c:(.init.text+0x258): additional relocation overflows omitted from the output
>     mips-linux-ld: fs/btrfs/write-intent.o: in function `set_bits_in_one_entry':
>>> write-intent.c:(.text.set_bits_in_one_entry+0x1ec): undefined reference to `__udivdi3'
>>> mips-linux-ld: write-intent.c:(.text.set_bits_in_one_entry+0x2b0): undefined reference to `__udivdi3'
>     mips-linux-ld: fs/btrfs/write-intent.o: in function `insert_new_entries':

Thanks for the report, it looks like there are still u64/u32 cases used
in bitmap_clear()/bitmap_set().

In fact, that two locations can go with u32 for the dividend.

Anyway, I'd go the regular blocksize_bits way instead in the next update.

THanks,
Qu

>>> write-intent.c:(.text.insert_new_entries+0x294): undefined reference to `__udivdi3'
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-07  5:32 ` [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
  2022-07-08  1:55   ` kernel test robot
@ 2022-07-08  7:23   ` kernel test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kernel test robot @ 2022-07-08  7:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20220707]
[cannot apply to linus/master v5.19-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-write-intent-bitmaps-for-RAID56/20220707-133435
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: i386-randconfig-a003 (https://download.01.org/0day-ci/archive/20220708/202207081504.knAOU6FY-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/2b051857a66f0310589455c06f962908016b5f9b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-introduce-write-intent-bitmaps-for-RAID56/20220707-133435
        git checkout 2b051857a66f0310589455c06f962908016b5f9b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (12 preceding siblings ...)
  2022-07-07  5:36 ` [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Christoph Hellwig
@ 2022-07-13 16:18 ` Lukas Straub
  2022-07-13 23:00   ` Qu Wenruo
  13 siblings, 1 reply; 26+ messages in thread
From: Lukas Straub @ 2022-07-13 16:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

[-- Attachment #1: Type: text/plain, Size: 6687 bytes --]

Hello Qu,

I think I mentioned it elsewhere, but could this be made general to all
raid levels (e.g. raid1/10 too)? This way, the bitmap can also be used
to fix the corruption of nocow files on btrfs raid. IMHO this is very
important since openSUSE and Fedora use nocow for certain files
(databases, vm images) by default and currently anyone using btrfs raid
there will be shot in the foot by this.

More comments below.

On Thu,  7 Jul 2022 13:32:25 +0800
Qu Wenruo <wqu@suse.com> 
> [...]
> [OBJECTIVE]
> 
> This patchset will introduce a btrfs specific write-intent bitmap.
> 
> The bitmap will locate at physical offset 1MiB of each device, and the
> content is the same between all devices.
> 
> When there is a RAID56 write (currently all RAID56 write, _including full
> stripe write_), before submitting all the real bios to disks,
> write-intent bitmap will be updated and flushed to all writeable
> devices.

You'll need to update the bitmap even with full stripe writes. I don't
know btrfs internals well, but this example should apply:

1. Powerloss happens during a full stripe write. If the bitmap wasn't set,
the whole stripe will contain inconsistent data:

	0		32K		64K
Disk 1	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 3	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)

2. Partial stripe write happens, only updates one data + parity:

	0		32K		64K
Disk 1	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 3	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)

3. We loose Disk 1. We try to recover Disk 1 data by using Disk 2 data
+ parity. Because Disk 2 is inconsistent we get invalid data.

Thus, we need to scrub the stripe even after a full stripe write to
prevent this.

> So even if a powerloss happened, at the next mount time we know which
> full stripes needs to check, and can start a scrub for those involved
> logical bytenr ranges.
> 
> [...]
> 
> [BITMAPS DESIGN]
> 
> The bitmaps on-disk format looks like this:
> 
>  [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>  |<---------  super::size (4K) ------------->|
> 
> Super block contains how many entires are in use.
> 
> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> bytenr, and u64 for one bitmap.
> 
> And all utilized entries will be sorted in their bytenr order, and no
> bit can overlap.
> 
> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.

IMHO we can go much larger, mdraid for example uses a blocksize of
64MiB by default. Sure, we'll scrub many unrelated stripes on recovery
but write performance will be better.

Regards,
Lukas Straub

> For the worst case, it can contain 14MiB dirty ranges.
> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
> 
> For the best case, it can contain 896MiB dirty ranges.
> (all bits set per bitmap)
> 
> [WHY NOT BTRFS BTREE]
> 
> Current write-intent structure needs two features:
> 
> - Its data needs to survive cross stripe boundary
>   Normally this means write-intent btree needs to acts like a proper
>   tree root, has METADATA_ITEMs for all its tree blocks.
> 
> - Its data update must be outside of a transaction
>   Currently only log tree can do such thing.
>   But unfortunately log tree can not survive across transaction
>   boundary.
> 
> Thus write-intent btree can only meet one of the requirement, not a
> suitable solution here.
> 
> [TESTING AND BENCHMARK]
> 
> For performance benchmark, unfortunately I don't have 3 HDDs to test.
> Will do the benchmark after secured enough hardware.
> 
> For testing, it can survive volume/raid/dev-replace test groups, and no
> write-intent bitmap leakage.
> 
> Unfortunately there is still a warning triggered in btrfs/070, still
> under investigation, hopefully to be a false alert in bitmap clearing
> path.
> 
> [TODO]
> - Scrub refactor to allow us to do proper recovery at mount time
>   Need to change scrub interface to scrub based on logical bytenr.
> 
>   This can be a super big work, thus currently we will focus only on
>   RAID56 new scrub interface for write-intent recovery only.
> 
> - Extra optimizations
>   * Skip full stripe writes
>   * Enlarge the window between btrfs_write_intent_mark_dirty() and
>     btrfs_write_intent_writeback()
>     So that we can merge more dirty bites and cause less bitmaps
>     writeback
> 
> - Proper performance benchmark
>   Needs hardware/baremetal VMs, since I don't have any physical machine
>   large enough to contian 3 3.5" HDDs.
> 
> 
> Qu Wenruo (12):
>   btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>   btrfs: introduce a new experimental compat RO flag,
>     WRITE_INTENT_BITMAP
>   btrfs: introduce the on-disk format of btrfs write intent bitmaps
>   btrfs: load/create write-intent bitmaps at mount time
>   btrfs: write-intent: write the newly created bitmaps to all disks
>   btrfs: write-intent: introduce an internal helper to set bits for a
>     range.
>   btrfs: write-intent: introduce an internal helper to clear bits for a
>     range.
>   btrfs: selftests: add selftests for write-intent bitmaps
>   btrfs: write back write intent bitmap after barrier_all_devices()
>   btrfs: update and writeback the write-intent bitmap for RAID56 write.
>   btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>   btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
> 
>  fs/btrfs/Makefile                           |   5 +-
>  fs/btrfs/ctree.h                            |  24 +-
>  fs/btrfs/disk-io.c                          |  54 ++
>  fs/btrfs/raid56.c                           |  16 +
>  fs/btrfs/sysfs.c                            |   2 +
>  fs/btrfs/tests/btrfs-tests.c                |   4 +
>  fs/btrfs/tests/btrfs-tests.h                |   2 +
>  fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>  fs/btrfs/volumes.c                          |  34 +-
>  fs/btrfs/write-intent.c                     | 903 ++++++++++++++++++++
>  fs/btrfs/write-intent.h                     | 303 +++++++
>  fs/btrfs/zoned.c                            |   8 +
>  include/uapi/linux/btrfs.h                  |  17 +
>  13 files changed, 1610 insertions(+), 9 deletions(-)
>  create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>  create mode 100644 fs/btrfs/write-intent.c
>  create mode 100644 fs/btrfs/write-intent.h
> 



-- 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-13 16:18 ` Lukas Straub
@ 2022-07-13 23:00   ` Qu Wenruo
  0 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2022-07-13 23:00 UTC (permalink / raw)
  To: Lukas Straub, Qu Wenruo; +Cc: linux-btrfs, David Sterba



On 2022/7/14 00:18, Lukas Straub wrote:
> Hello Qu,
>
> I think I mentioned it elsewhere, but could this be made general to all
> raid levels (e.g. raid1/10 too)? This way, the bitmap can also be used
> to fix the corruption of nocow files on btrfs raid. IMHO this is very
> important since openSUSE and Fedora use nocow for certain files
> (databases, vm images) by default and currently anyone using btrfs raid
> there will be shot in the foot by this.

Yes, that's in the future plan.

Although currently we're still at the discussion stage to make sure
which way we should really go (RST or write-intent).
>
> More comments below.
>
> On Thu,  7 Jul 2022 13:32:25 +0800
> Qu Wenruo <wqu@suse.com>
>> [...]
>> [OBJECTIVE]
>>
>> This patchset will introduce a btrfs specific write-intent bitmap.
>>
>> The bitmap will locate at physical offset 1MiB of each device, and the
>> content is the same between all devices.
>>
>> When there is a RAID56 write (currently all RAID56 write, _including full
>> stripe write_), before submitting all the real bios to disks,
>> write-intent bitmap will be updated and flushed to all writeable
>> devices.
>
> You'll need to update the bitmap even with full stripe writes. I don't
> know btrfs internals well, but this example should apply:
>
> 1. Powerloss happens during a full stripe write. If the bitmap wasn't set,
> the whole stripe will contain inconsistent data:

That's why btrfs COW shines.

If we are doing full stripe writes and power loss, that full stripe data
won't be read at at mount at all.

All metadata will point back to old data, thus only partial writes
matter here.

Thanks,
Qu
>
> 	0		32K		64K
> Disk 1	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 3	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)
>
> 2. Partial stripe write happens, only updates one data + parity:
>
> 	0		32K		64K
> Disk 1	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 3	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)
>
> 3. We loose Disk 1. We try to recover Disk 1 data by using Disk 2 data
> + parity. Because Disk 2 is inconsistent we get invalid data.
>
> Thus, we need to scrub the stripe even after a full stripe write to
> prevent this.
>
>> So even if a powerloss happened, at the next mount time we know which
>> full stripes needs to check, and can start a scrub for those involved
>> logical bytenr ranges.
>>
>> [...]
>>
>> [BITMAPS DESIGN]
>>
>> The bitmaps on-disk format looks like this:
>>
>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>   |<---------  super::size (4K) ------------->|
>>
>> Super block contains how many entires are in use.
>>
>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>> bytenr, and u64 for one bitmap.
>>
>> And all utilized entries will be sorted in their bytenr order, and no
>> bit can overlap.
>>
>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>
> IMHO we can go much larger, mdraid for example uses a blocksize of
> 64MiB by default. Sure, we'll scrub many unrelated stripes on recovery
> but write performance will be better.
>
> Regards,
> Lukas Straub
>
>> For the worst case, it can contain 14MiB dirty ranges.
>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>
>> For the best case, it can contain 896MiB dirty ranges.
>> (all bits set per bitmap)
>>
>> [WHY NOT BTRFS BTREE]
>>
>> Current write-intent structure needs two features:
>>
>> - Its data needs to survive cross stripe boundary
>>    Normally this means write-intent btree needs to acts like a proper
>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>
>> - Its data update must be outside of a transaction
>>    Currently only log tree can do such thing.
>>    But unfortunately log tree can not survive across transaction
>>    boundary.
>>
>> Thus write-intent btree can only meet one of the requirement, not a
>> suitable solution here.
>>
>> [TESTING AND BENCHMARK]
>>
>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>> Will do the benchmark after secured enough hardware.
>>
>> For testing, it can survive volume/raid/dev-replace test groups, and no
>> write-intent bitmap leakage.
>>
>> Unfortunately there is still a warning triggered in btrfs/070, still
>> under investigation, hopefully to be a false alert in bitmap clearing
>> path.
>>
>> [TODO]
>> - Scrub refactor to allow us to do proper recovery at mount time
>>    Need to change scrub interface to scrub based on logical bytenr.
>>
>>    This can be a super big work, thus currently we will focus only on
>>    RAID56 new scrub interface for write-intent recovery only.
>>
>> - Extra optimizations
>>    * Skip full stripe writes
>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>      btrfs_write_intent_writeback()
>>      So that we can merge more dirty bites and cause less bitmaps
>>      writeback
>>
>> - Proper performance benchmark
>>    Needs hardware/baremetal VMs, since I don't have any physical machine
>>    large enough to contian 3 3.5" HDDs.
>>
>>
>> Qu Wenruo (12):
>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>    btrfs: introduce a new experimental compat RO flag,
>>      WRITE_INTENT_BITMAP
>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>    btrfs: load/create write-intent bitmaps at mount time
>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>      range.
>>    btrfs: write-intent: introduce an internal helper to clear bits for a
>>      range.
>>    btrfs: selftests: add selftests for write-intent bitmaps
>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>
>>   fs/btrfs/Makefile                           |   5 +-
>>   fs/btrfs/ctree.h                            |  24 +-
>>   fs/btrfs/disk-io.c                          |  54 ++
>>   fs/btrfs/raid56.c                           |  16 +
>>   fs/btrfs/sysfs.c                            |   2 +
>>   fs/btrfs/tests/btrfs-tests.c                |   4 +
>>   fs/btrfs/tests/btrfs-tests.h                |   2 +
>>   fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>>   fs/btrfs/volumes.c                          |  34 +-
>>   fs/btrfs/write-intent.c                     | 903 ++++++++++++++++++++
>>   fs/btrfs/write-intent.h                     | 303 +++++++
>>   fs/btrfs/zoned.c                            |   8 +
>>   include/uapi/linux/btrfs.h                  |  17 +
>>   13 files changed, 1610 insertions(+), 9 deletions(-)
>>   create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>>   create mode 100644 fs/btrfs/write-intent.c
>>   create mode 100644 fs/btrfs/write-intent.h
>>
>
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-07-13 23:01 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07  5:32 [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
2022-07-07  5:32 ` [PATCH 01/12] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
2022-07-07  5:32 ` [PATCH 02/12] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
2022-07-07  5:32 ` [PATCH 03/12] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
2022-07-07  5:32 ` [PATCH 04/12] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
2022-07-07  5:32 ` [PATCH 05/12] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
2022-07-07  5:32 ` [PATCH 06/12] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
2022-07-08  1:55   ` kernel test robot
2022-07-08  2:22     ` Qu Wenruo
2022-07-08  7:23   ` kernel test robot
2022-07-07  5:32 ` [PATCH 07/12] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
2022-07-07  5:32 ` [PATCH 08/12] btrfs: selftests: add selftests for write-intent bitmaps Qu Wenruo
2022-07-07  5:32 ` [PATCH 09/12] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
2022-07-07  5:32 ` [PATCH 10/12] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
2022-07-07  5:32 ` [PATCH 11/12] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
2022-07-07  5:32 ` [PATCH 12/12] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
2022-07-07  5:36 ` [PATCH 00/12] btrfs: introduce write-intent bitmaps for RAID56 Christoph Hellwig
2022-07-07  5:48   ` Qu Wenruo
2022-07-07  9:37     ` Johannes Thumshirn
2022-07-07  9:45       ` Qu Wenruo
2022-07-07 10:42         ` Qu Wenruo
2022-07-07 12:23         ` Johannes Thumshirn
2022-07-07 13:36     ` Christoph Hellwig
2022-07-07 13:48       ` Qu Wenruo
2022-07-13 16:18 ` Lukas Straub
2022-07-13 23:00   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).