linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
@ 2022-07-05  7:39 Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 01/11] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
                   ` (12 more replies)
  0 siblings, 13 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

[BACKGROUND]
Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
loss happens.

For pure mirror based profiles it's fine as btrfs can utilize its csums
to find the correct mirror the repair the bad ones.

But for RAID56, the repair itself needs the data from other devices,
thus any out-of-sync data can degrade the tolerance.

Even worse, incorrect RMW can use the stale data to generate P/Q,
removing the possibility of recovery the data.


For md-raid, it goes with write-intent bitmap, to do faster resilver,
and goes journal (partial parity log for RAID5) to ensure it can even
stand a powerloss + device lose.

[OBJECTIVE]

This patchset will introduce a btrfs specific write-intent bitmap.

The bitmap will locate at physical offset 1MiB of each device, and the
content is the same between all devices.

When there is a RAID56 write (currently all RAID56 write, including full
stripe write), before submitting all the real bios to disks,
write-intent bitmap will be updated and flushed to all writeable
devices.

So even if a powerloss happened, at the next mount time we know which
full stripes needs to check, and can start a scrub for those involved
logical bytenr ranges.

[NO RECOVERY CODE YET]

Unfortunately, this patchset only implements the write-intent bitmap
code, the recovery part is still a place holder, as we need some scrub
refactor to make it only scrub a logical bytenr range.

[ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]

Since btrfs can utilize csum for its metadata and CoWed data, unlike
dm-bitmap which can only be used for faster re-silver, we can fully
rebuild the full stripe, as long as:

1) There is no missing device
   For missing device case, we still need to go full journal.

2) Untouched data stays untouched
   This should be mostly sane for sane hardware.

And since the btrfs specific write-intent bitmaps are pretty small (4KiB
in size), the overhead much lower than full journal.

In the future, we may allow users to choose between just bitmaps or full
journal to meet their requirement.

[BITMAPS DESIGN]

The bitmaps on-disk format looks like this:

 [ super ][ entry 1 ][ entry 2 ] ... [entry N]
 |<---------  super::size (4K) ------------->|

Super block contains how many entires are in use.

Each entry is 128 bits (16 bytes) in size, containing one u64 for
bytenr, and u64 for one bitmap.

And all utilized entries will be sorted in their bytenr order, and no
bit can overlap.

The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
can contain at most 4MiB, and the whole bitmaps can contain 224 entries.

For the worst case, it can contain 14MiB dirty ranges.
(1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).

For the best case, it can contain 896MiB dirty ranges.
(all bits set per bitmap)

[WHY NOT BTRFS BTREE]

Current write-intent structure needs two features:

- Its data needs to survive cross stripe boundary
  Normally this means write-intent btree needs to acts like a proper
  tree root, has METADATA_ITEMs for all its tree blocks.

- Its data update must be outside of a transaction
  Currently only log tree can do such thing.
  But unfortunately log tree can not survive across transaction
  boundary.

Thus write-intent btree can only meet one of the requirement, not a
suitable solution here.

[TESTING AND BENCHMARK]

For performance benchmark, unfortunately I don't have 3 HDDs to test.
Will do the benchmark after secured enough hardware.

For testing, it can survive volume/raid/dev-replace test groups, and no
write-intent bitmap leakage.

Unfortunately there is still a warning triggered in btrfs/070, still
under investigation, hopefully to be a false alert in bitmap clearing
path.

[TODO]
- Scrub refactor to allow us to do proper recovery at mount time
  Need to change scrub interface to scrub based on logical bytenr.

  This can be a super big work, thus currently we will focus only on
  RAID56 new scrub interface for write-intent recovery only.

- Extra optimizations
  * Skip full stripe writes
  * Enlarge the window between btrfs_write_intent_mark_dirty() and
    btrfs_write_intent_writeback()

- Bug hunts and more fstests runs

- Proper performance benchmark
  Needs hardware/baremetal VMs, since I don't have any physical machine
  large enough to contian 3 3.5" HDDs.


Qu Wenruo (11):
  btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
  btrfs: introduce a new experimental compat RO flag,
    WRITE_INTENT_BITMAP
  btrfs: introduce the on-disk format of btrfs write intent bitmaps
  btrfs: load/create write-intent bitmaps at mount time
  btrfs: write-intent: write the newly created bitmaps to all disks
  btrfs: write-intent: introduce an internal helper to set bits for a
    range.
  btrfs: write-intent: introduce an internal helper to clear bits for a
    range.
  btrfs: write back write intent bitmap after barrier_all_devices()
  btrfs: update and writeback the write-intent bitmap for RAID56 write.
  btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
  btrfs: warn and clear bitmaps if there is dirty bitmap at mount time

 fs/btrfs/Makefile          |   2 +-
 fs/btrfs/ctree.h           |  24 +-
 fs/btrfs/disk-io.c         |  54 +++
 fs/btrfs/raid56.c          |  16 +
 fs/btrfs/sysfs.c           |   2 +
 fs/btrfs/volumes.c         |  34 +-
 fs/btrfs/write-intent.c    | 962 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h    | 288 +++++++++++
 fs/btrfs/zoned.c           |   8 +
 include/uapi/linux/btrfs.h |  17 +
 10 files changed, 1399 insertions(+), 8 deletions(-)
 create mode 100644 fs/btrfs/write-intent.c
 create mode 100644 fs/btrfs/write-intent.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RFC 01/11] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 02/11] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

From the beginning of btrfs kernel module, kernel will avoid allocating
dev extents into the first 1MiB of each device, not only for the
superblock at 64KiB, but also for legacy bootloaders to store their
data.

Here for later expansion, we introduce a new compat RO flag,
EXTRA_SUPER_RESERVED, this allows btrfs to have extra reserved space
beyond the default 1MiB.

The extra reserved space can be utilized by things like write-intent
log.

Several new checks are introduced:

- No super_reserved_bytes should be smaller than the old 1MiB limit
- No zoned device support
  Such reserved space will be utilized in a no COW way, thus it can
  not be supported by zoned device.

We still allow dev extents to exist in the new reserved_bytes range,
this is to allow btrfstune to set reserved_bytes, then rely on kernel
balance to reserve the new space.

But later, if there is any feature relying on the reserved_bytes, we
will disable those features automatically.

There is a special catch, the new member @reserved_bytes is located at
the end of the superblock, to utilize the padding bytes.

The reason is, due to some undetermined extent tree v2 on-disk format
mess, we don't have reliable location inside the reserved8[] array.
I just want a stable location undisturbed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h           | 14 +++++++++++---
 fs/btrfs/disk-io.c         |  9 +++++++++
 fs/btrfs/sysfs.c           |  2 ++
 fs/btrfs/volumes.c         | 22 ++++++++++++++++++----
 fs/btrfs/zoned.c           |  8 ++++++++
 include/uapi/linux/btrfs.h | 10 ++++++++++
 6 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4e2569f84aab..12019904f1cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -293,14 +293,19 @@ struct btrfs_super_block {
 	__le64 block_group_root_generation;
 	u8 block_group_root_level;
 
-	/* future expansion */
 	u8 reserved8[7];
 	__le64 reserved[25];
 	u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
 	struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 
+	/*
+	 * How many bytes are reserved at the beginning of a device.
+	 * Should be >= BTRFS_DEFAULT_RESERVED.
+	 */
+	__le32 reserved_bytes;
+
 	/* Padded to 4096 bytes */
-	u8 padding[565];
+	u8 padding[561];
 } __attribute__ ((__packed__));
 static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 
@@ -315,7 +320,8 @@ static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
 	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
-	 BTRFS_FEATURE_COMPAT_RO_VERITY)
+	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
+	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -2539,6 +2545,8 @@ BTRFS_SETGET_STACK_FUNCS(super_block_group_root_generation,
 			 block_group_root_generation, 64);
 BTRFS_SETGET_STACK_FUNCS(super_block_group_root_level, struct btrfs_super_block,
 			 block_group_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_reserved_bytes, struct btrfs_super_block,
+			 reserved_bytes, 32);
 
 int btrfs_super_csum_size(const struct btrfs_super_block *s);
 const char *btrfs_super_csum_name(u16 csum_type);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 70b388de4d66..1df2da2509ca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2822,6 +2822,15 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 		ret = -EINVAL;
 	}
 
+	if (btrfs_super_compat_ro_flags(sb) &
+	    BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED &&
+	    btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+		btrfs_err(fs_info,
+"EXTRA_SUPER_RESERVED feature enabled, but reserved space is smaller than default (%u)",
+			    BTRFS_DEVICE_RANGE_RESERVED);
+		ret = -EINVAL;
+	}
+
 	/*
 	 * The generation is a global counter, we'll trust it more than the others
 	 * but it's still possible that it's the one that's wrong.
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index d5d0717fd09a..7fd38539315f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -287,6 +287,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_COMPAT_RO(extra_super_reserved, EXTRA_SUPER_RESERVED);
 #ifdef CONFIG_BLK_DEV_ZONED
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
@@ -317,6 +318,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
+	BTRFS_FEAT_ATTR_PTR(extra_super_reserved),
 #ifdef CONFIG_BLK_DEV_ZONED
 	BTRFS_FEAT_ATTR_PTR(zoned),
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2d788a351c1f..2a4ac905e39f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1401,8 +1401,13 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 
 static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 {
+	struct btrfs_fs_info *fs_info = device->fs_devices->fs_info;
+
 	switch (device->fs_devices->chunk_alloc_policy) {
 	case BTRFS_CHUNK_ALLOC_REGULAR:
+		if (btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED))
+			return max_t(u64, start,
+				btrfs_super_reserved_bytes(fs_info->super_copy));
 		return max_t(u64, start, BTRFS_DEVICE_RANGE_RESERVED);
 	case BTRFS_CHUNK_ALLOC_ZONED:
 		/*
@@ -7969,11 +7974,14 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	struct extent_map *em;
 	struct map_lookup *map;
 	struct btrfs_device *dev;
+	u32 super_reserved = BTRFS_DEVICE_RANGE_RESERVED;
 	u64 stripe_len;
 	bool found = false;
 	int ret = 0;
 	int i;
 
+	if (btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED))
+		super_reserved = btrfs_super_reserved_bytes(fs_info->super_copy);
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, chunk_offset, 1);
 	read_unlock(&em_tree->lock);
@@ -7998,11 +8006,17 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	}
 
 	/*
-	 * Very old mkfs.btrfs (before v4.1) will not respect the reserved
-	 * space. Although kernel can handle it without problem, better to warn
-	 * the users.
+	 * This can be caused by two cases:
+	 * - Very old mkfs.btrfs (before v4.1) and no balance at all
+	 *   This should be pretty rare now, as balance will relocate
+	 *   those dev extents in reserved range.
+	 *
+	 * - Newly set btrfs_super_block::reserved_bytes
+	 *   We are rely on this mount to relocate those dev extents.
+	 *
+	 * So here, we just give a warning and continue the mount.
 	 */
-	if (physical_offset < BTRFS_DEVICE_RANGE_RESERVED)
+	if (physical_offset < super_reserved)
 		btrfs_warn(fs_info,
 		"devid %llu physical %llu len %llu inside the reserved space",
 			   devid, physical_offset, physical_len);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 7a0f8fa44800..ec16c0a6fb22 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -703,6 +703,14 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	if (incompat_zoned && btrfs_fs_compat_ro(fs_info, EXTRA_SUPER_RESERVED)) {
+		btrfs_err(fs_info,
+			"zoned: incompatible optional feature detected: 0x%llx",
+			BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED);
+		ret = -EINVAL;
+		goto out;
+	}
+
 	/*
 	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
 	 * btrfs_create_chunk(). Since we want stripe_len == zone_size,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index f54dc91e4025..4a0c9f4f55d1 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -290,6 +290,16 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
 #define BTRFS_FEATURE_COMPAT_RO_VERITY			(1ULL << 2)
 
+/*
+ * Allow btrfs to have extra reserved space (other than the default 1MiB) at
+ * the beginning of each device.
+ *
+ * This feature will enable the usage of btrfs_super_block::reserved_bytes.
+ *
+ * This feature would only be available for non-zoned filesystems.
+ */
+#define BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED	(1ULL << 3)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 02/11] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 01/11] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 03/11] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

The new flag is for the incoming write intent bitmap, mostly to address
the RAID56 write-hole, by doing a mandatory scrub for partial written
stripes at mount time.

Currently the feature is still under development, this patch is mostly
a placeholder for the extra reserved bytes for write intent bitmap.

We will utilize the newly introduce EXTRA_SUPER_RESERVED compat RO flags
to enlarge the reserved bytes to at least (1MiB + 64KiB), and use that
64KiB (exact value is not yet fully determined) for write-intent bitmap.

Only one extra check is introduced, to ensure we have enough space to
place the write-intent bitmap at 1MiB physical offset.

This patch is only a place holder for the incoming on-disk format
change, no real write-intent functionality is implemented yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h           |  9 +++++++++
 fs/btrfs/disk-io.c         | 20 ++++++++++++++++++++
 fs/btrfs/volumes.c         | 14 +++++++++++++-
 include/uapi/linux/btrfs.h |  7 +++++++
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 12019904f1cf..908a735a66cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -317,11 +317,20 @@ static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
+#ifdef CONFIG_BTRFS_DEBUG
+#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
+	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
+	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
+	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED |	\
+	 BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP)
+#else
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
 	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
 	 BTRFS_FEATURE_COMPAT_RO_VERITY |		\
 	 BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)
+#endif
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1df2da2509ca..967c020c380a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2830,6 +2830,26 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 			    BTRFS_DEVICE_RANGE_RESERVED);
 		ret = -EINVAL;
 	}
+	if (btrfs_super_compat_ro_flags(sb) &
+	    BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP) {
+		/* Write intent bitmap requires extra reserve. */
+		if (!(btrfs_super_compat_ro_flags(sb) &
+		      BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED)) {
+			btrfs_err(fs_info,
+"WRITE_INTENT_BITMAP feature enabled, but missing EXTRA_SUPER_RESERVED feature");
+			ret = -EINVAL;
+		}
+		/*
+		 * Write intent bitmap is always located at 1MiB.
+		 * Extra check like the length check against the reserved space
+		 * will happen at bitmap load time.
+		 */
+		if (btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+			btrfs_err(fs_info,
+			"not enough reserved space for write intent bitmap");
+			ret = -EINVAL;
+		}
+	}
 
 	/*
 	 * The generation is a global counter, we'll trust it more than the others
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2a4ac905e39f..4882c616768c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -8016,11 +8016,23 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	 *
 	 * So here, we just give a warning and continue the mount.
 	 */
-	if (physical_offset < super_reserved)
+	if (physical_offset < super_reserved) {
 		btrfs_warn(fs_info,
 		"devid %llu physical %llu len %llu inside the reserved space",
 			   devid, physical_offset, physical_len);
 
+		/* Disable any feature relying on the new reserved_bytes. */
+		if (btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP)) {
+			struct btrfs_super_block *sb = fs_info->super_copy;
+
+			btrfs_warn(fs_info,
+	"disabling write intent bitmap due to the lack of reserved space.");
+			btrfs_set_super_compat_ro_flags(sb,
+				btrfs_super_compat_ro_flags(sb) |
+				~BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP);
+		}
+	}
+
 	for (i = 0; i < map->num_stripes; i++) {
 		if (map->stripes[i].dev->devid == devid &&
 		    map->stripes[i].physical == physical_offset) {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 4a0c9f4f55d1..38c74a50323e 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -300,6 +300,13 @@ struct btrfs_ioctl_fs_info_args {
  */
 #define BTRFS_FEATURE_COMPAT_RO_EXTRA_SUPER_RESERVED	(1ULL << 3)
 
+/*
+ * Allow btrfs to have per-device write-intent bitmap.
+ * Will be utilized to close the RAID56 write-hole (by forced scrub for dirty
+ * partial written stripes at mount time).
+ */
+#define BTRFS_FEATURE_COMPAT_RO_WRITE_INTENT_BITMAP	(1ULL << 4)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 03/11] btrfs: introduce the on-disk format of btrfs write intent bitmaps
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 01/11] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 02/11] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 04/11] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

With extra comments explaining the on-disk format and the basic
workflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c      |   4 +-
 fs/btrfs/raid56.c       |   1 +
 fs/btrfs/write-intent.h | 199 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 203 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/write-intent.h

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 967c020c380a..963a89cd4bfb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "space-info.h"
 #include "zoned.h"
 #include "subpage.h"
+#include "write-intent.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -2844,7 +2845,8 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 		 * Extra check like the length check against the reserved space
 		 * will happen at bitmap load time.
 		 */
-		if (btrfs_super_reserved_bytes(sb) < BTRFS_DEVICE_RANGE_RESERVED) {
+		if (btrfs_super_reserved_bytes(sb) <
+		    BTRFS_DEVICE_RANGE_RESERVED + WRITE_INTENT_BITMAPS_SIZE) {
 			btrfs_err(fs_info,
 			"not enough reserved space for write intent bitmap");
 			ret = -EINVAL;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c6411c849fea..7b2d2b6c8c61 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -19,6 +19,7 @@
 #include "volumes.h"
 #include "raid56.h"
 #include "async-thread.h"
+#include "write-intent.h"
 
 /* set when additional merges to this rbio are not allowed */
 #define RBIO_RMW_LOCKED_BIT	1
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
new file mode 100644
index 000000000000..b851917bb0b6
--- /dev/null
+++ b/fs/btrfs/write-intent.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Btrfs specific write-intent bitmaps. */
+
+#ifndef BTRFS_WRITE_INTENT_H
+#define BTRFS_WRITE_INTENT_H
+
+#include <linux/types.h>
+#include <linux/sizes.h>
+/* For BTRFS_STRIPE_LEN. */
+#include "volumes.h"
+
+#define WRITE_INTENT_SUPER_MAGIC 0x515F62536249775FULL /* ascii _wIbSb_Q */
+
+/* This write-intent bitmap records writes for RAID56 writes. */
+#define WRITE_INTENT_FLAGS_TARGET_RAID56	(1ULL << 0)
+
+/* This write-intent bitmap is internal, aka locates at 1MiB of each device. */
+#define WRITE_INTENT_FLAGS_INTERNAL		(1ULL << 1)
+
+/* This write-intent bitmap uses logical bytenr. */
+#define WRITE_INTENT_FLAGS_BYTENR_LOGICAL	(1ULL << 2)
+
+#define WRITE_INTENT_FLAGS_SUPPORTED			\
+	(WRITE_INTENT_FLAGS_TARGET_RAID56 |		\
+	 WRITE_INTENT_FLAGS_INTERNAL |			\
+	 WRITE_INTENT_FLAGS_BYTENR_LOGICAL)
+
+/*
+ * We use BTRFS_STRIPE_LEN as blocksize.
+ * This makes a RAID56 full stripe to @nr_data bits, and greatly
+ * enlarge how many bytes we can represent just using 4KiB.
+ */
+#define WRITE_INTENT_BLOCKSIZE			(BTRFS_STRIPE_LEN)
+
+/*
+ * For now, 4KiB is enough, as using 64KiB blocksize we can save bitmaps
+ * for 896MiB (224 entries, each entri can cache 64KiB * 64) sized logical
+ * range.
+ */
+#define WRITE_INTENT_BITMAPS_SIZE		(SZ_4K)
+
+
+/*
+ * The super block of write intent bitmaps, should be at physical offset 1MiB of
+ * every writeable device.
+ */
+struct write_intent_super {
+	/* Csum for the super block and all the internal entries. */
+	__u8 csum[BTRFS_CSUM_SIZE];
+	__u8 fsid[BTRFS_FSID_SIZE];
+
+	__le64 magic;
+
+	/* Important behavior flags would be set here. */
+	__le64 flags;
+
+	/*
+	 * Event counter for the bitmap, every time the bitmaps get written
+	 * this value increases.
+	 */
+	__le64 events;
+
+	/*
+	 * Total size of the bitmaps, including this super block and all the
+	 * entries.
+	 *
+	 * U32 should be enough for internal bitmaps, but just in case we want
+	 * to support external device as dedicated journal/bitmap device.
+	 */
+	__le64 size;
+
+	/* How many entries we have utilized. */
+	__le64 nr_entries;
+
+	/* How many bytes one bit represents. */
+	__le32 blocksize;
+	/*
+	 * This should be the same as btrfs_super_block::csum_type.
+	 * Cache csum type here so we read the write intent superblock without
+	 * a fully opened btrfs (for dump purpose).
+	 */
+	__le16 csum_type;
+
+	/* For future expansion, padding to 512 bytes. */
+	__u8 reserved1[418];
+} __attribute__ ((__packed__));
+
+static_assert(sizeof(struct write_intent_super) == 512);
+
+struct write_intent_entry {
+	/*
+	 * Bytenr 0 is special, means this entry is empty, and also means the
+	 * end of the bitmaps.
+	 */
+	__le64 bytenr;
+	__le64 bitmap;
+};
+
+/* How many entries we can have in the bitmaps. */
+#define WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES		\
+	((WRITE_INTENT_BITMAPS_SIZE -				\
+	  sizeof(struct write_intent_super)) /			\
+	 sizeof(struct write_intent_entry))
+
+/* The number of bits we can have in one entry. */
+#define WRITE_INTENT_BITS_PER_ENTRY		(64)
+
+/*
+ * ON-DISK FORMAT
+ * ==============
+ *
+ * [ super ][ entry 1 ][ entry 2 ] ... [entry N]
+ * |<------------  super::size --------------->|
+ *
+ * Normally it's 4KiB in size for internal bitmap.
+ *
+ * Currently the write-intent bitmaps is only for RAID56 writes, thus its
+ * blocksize is always 64KiB.
+ * Thus one entry can represent at most 4MiB (64 * 64 KiB) of logical range.
+ *
+ * When one raid56 full stripe needs partial writeback, the full stripe logical
+ * bytenr range will be included into at least one entry.
+ *
+ * After the last used entry, the remaining entries will all be filled with 0.
+ *
+ * WORKFLOW
+ * ========
+ *
+ * 1) Write bio arrive
+ *    Every write meets the requirement (so far, only RAID56 partial write) will
+ *    have its bio delayed, until corresponding range are marked in the entry.
+ *
+ * 2) Update the write-intent bitmaps
+ *    The entries will be populated, and write back to all writeable devices,
+ *    with FUA flag set.
+ *    Will wait until the write reaches disks.
+ *
+ * 3) Allow the involved write bios to be submitted
+ *
+ * 4) Write bios finish
+ *    The corresponding range will be recorded to be freed at next flush.
+ *
+ * 5) All devices get flushed (caused by btrfs super block writeback or bitmaps
+ *    pressure)
+ *    The recorded ranges will be cleared. And if an entry is empty, it will be
+ *    freed.
+ *    Then update the write-intent bitmaps with its superblock (writeback with FUA
+ *    flag and wait for it).
+ */
+
+#define WRITE_INTENT_SETGET_FUNCS(name, type, member, bits)	\
+static inline u##bits wi_##name(const type *s)			\
+{								\
+	return le##bits##_to_cpu(s->member);			\
+}								\
+static inline void wi_set_##name(type *s, u##bits val)		\
+{								\
+	s->member = cpu_to_le##bits(val);			\
+}
+
+WRITE_INTENT_SETGET_FUNCS(super_magic, struct write_intent_super, magic, 64);
+WRITE_INTENT_SETGET_FUNCS(super_flags, struct write_intent_super, flags, 64);
+WRITE_INTENT_SETGET_FUNCS(super_events, struct write_intent_super, events, 64);
+WRITE_INTENT_SETGET_FUNCS(super_size, struct write_intent_super, size, 64);
+WRITE_INTENT_SETGET_FUNCS(super_nr_entries, struct write_intent_super,
+			  nr_entries, 64);
+WRITE_INTENT_SETGET_FUNCS(super_blocksize, struct write_intent_super,
+			  blocksize, 32);
+WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
+			  csum_type, 16);
+WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
+
+static inline void wie_get_bitmap(struct write_intent_entry *entry,
+				  unsigned long *bitmap)
+{
+#ifdef __LITTLE_ENDIAN
+	bitmap_from_arr64(bitmap, &entry->bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+#else
+	u64 val = le64_to_cpu(entry->bitmap);
+
+	bitmap_from_arr64(bitmap, &val, WRITE_INTENT_BITS_PER_ENTRY);
+#endif
+}
+
+static inline void wie_set_bitmap(struct write_intent_entry *entry,
+				  unsigned long *bitmap)
+{
+#ifdef __LITTLE_ENDIAN
+	bitmap_to_arr64(&entry->bitmap, bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+#else
+	u64 val;
+
+	bitmap_to_arr64(&val, bitmap, WRITE_INTENT_BITS_PER_ENTRY);
+	entry->bitmap = cpu_to_le64(val);
+#endif
+}
+
+#endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 04/11] btrfs: load/create write-intent bitmaps at mount time
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 03/11] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 05/11] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce btrfs_fs_info::wi_ctrl, which will have a
non-highmem page for the write intent bitmaps.

Please note that, if we can't find a valid bitmaps, the newly create one
will only be in memory for now, the bitmaps writeback functionality will
be introduced in the next commit.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile       |   2 +-
 fs/btrfs/ctree.h        |   1 +
 fs/btrfs/disk-io.c      |   7 ++
 fs/btrfs/write-intent.c | 174 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  15 ++++
 5 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/write-intent.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 99f9995670ea..af93119d52e2 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o
+	   subpage.o tree-mod-log.o write-intent.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 908a735a66cf..fcc8ae4b7fb4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -989,6 +989,7 @@ struct btrfs_fs_info {
 	struct workqueue_struct *scrub_wr_completion_workers;
 	struct workqueue_struct *scrub_parity_workers;
 	struct btrfs_subpage_info *subpage_info;
+	struct write_intent_ctrl *wi_ctrl;
 
 	struct btrfs_discard_ctl discard_ctl;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 963a89cd4bfb..edbb21706bda 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3699,6 +3699,12 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			  ret);
 		goto fail_block_groups;
 	}
+	ret = btrfs_write_intent_init(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init write-intent bitmaps: %d",
+			  ret);
+		goto fail_block_groups;
+	}
 	ret = btrfs_recover_balance(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to recover balance: %d", ret);
@@ -4639,6 +4645,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 		ret = btrfs_commit_super(fs_info);
 		if (ret)
 			btrfs_err(fs_info, "commit super ret %d", ret);
+		btrfs_write_intent_free(fs_info);
 	}
 
 	if (BTRFS_FS_ERROR(fs_info))
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
new file mode 100644
index 000000000000..a7ed21182525
--- /dev/null
+++ b/fs/btrfs/write-intent.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "ctree.h"
+#include "volumes.h"
+#include "write-intent.h"
+
+/*
+ * Return 0 if a valid write intent bitmap can be found.
+ * Return 1 if no valid write intent bitmap can be found.
+ * Return <0 for other fatal errors.
+ */
+static int write_intent_load(struct btrfs_device *device, struct page *dst)
+{
+	struct btrfs_fs_info *fs_info = device->fs_info;
+	struct write_intent_super *wis;
+	struct bio *bio;
+	int ret;
+
+	bio = bio_alloc(device->bdev, 1, REQ_OP_READ | REQ_SYNC | REQ_META,
+			GFP_NOFS);
+	/* It's backed by fs_bioset. */
+	ASSERT(bio);
+	bio->bi_iter.bi_sector = BTRFS_DEVICE_RANGE_RESERVED >> SECTOR_SHIFT;
+	__bio_add_page(bio, dst, WRITE_INTENT_BITMAPS_SIZE, 0);
+	ret = submit_bio_wait(bio);
+	if (ret < 0)
+		return ret;
+
+	wis = page_address(dst);
+	if (wi_super_magic(wis) != WRITE_INTENT_SUPER_MAGIC)
+		return 1;
+
+	/* Stale bitmaps, doesn't belong to our fs. */
+	if (memcmp(wis->fsid, device->fs_devices->fsid, BTRFS_FSID_SIZE))
+		return 1;
+
+	/* Above checks pass, but still csum mismatch, a big problem. */
+	if (btrfs_super_csum_type(fs_info->super_copy) !=
+	    wi_super_csum_type(wis)) {
+		btrfs_err(fs_info,
+		"csum type mismatch, write intent bitmap has %u fs has %u",
+			  wi_super_csum_type(wis),
+			  btrfs_super_csum_type(fs_info->super_copy));
+		return -EUCLEAN;
+	}
+
+	if (wi_super_flags(wis) & ~WRITE_INTENT_FLAGS_SUPPORTED) {
+		btrfs_err(fs_info, "unsupported flags 0x%llx",
+			  wi_super_flags(wis) & ~WRITE_INTENT_FLAGS_SUPPORTED);
+		return -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+static void write_intent_init(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+
+	ASSERT(ctrl);
+	ASSERT(ctrl->page);
+
+	/* Always start event count from 1. */
+	atomic64_set(&ctrl->event, 1);
+	ctrl->blocksize = WRITE_INTENT_BLOCKSIZE;
+	memzero_page(ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
+
+	wis = page_address(ctrl->page);
+	memcpy(wis->fsid, fs_info->fs_devices->fsid, BTRFS_FSID_SIZE);
+	wi_set_super_magic(wis, WRITE_INTENT_SUPER_MAGIC);
+	wi_set_super_csum_type(wis, btrfs_super_csum_type(fs_info->super_copy));
+	wi_set_super_events(wis, 1);
+	wi_set_super_flags(wis, WRITE_INTENT_FLAGS_SUPPORTED);
+	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
+	wi_set_super_blocksize(wis, ctrl->blocksize);
+	wi_set_super_nr_entries(wis, 0);
+	btrfs_info(fs_info, "creating new write intent bitmaps");
+}
+
+int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_device *highest_dev = NULL;
+	struct btrfs_device *dev;
+	struct write_intent_super *wis;
+	u64 highest_event = 0;
+	int ret;
+
+	ASSERT(fs_info->wi_ctrl == NULL);
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return 0;
+
+	fs_info->wi_ctrl = kzalloc(sizeof(*fs_info->wi_ctrl), GFP_NOFS);
+	if (!fs_info->wi_ctrl)
+		return -ENOMEM;
+
+	fs_info->wi_ctrl->page = alloc_page(GFP_NOFS);
+	if (!fs_info->wi_ctrl->page) {
+		ret = -ENOMEM;
+		goto cleanup;
+	}
+
+	/*
+	 * Go through every writeable device to find the highest event.
+	 *
+	 * Only the write-intent with highest event number makes sense.
+	 * If during bitmap writeback we lost power, some dev may have old
+	 * bitmap which is already stale.
+	 */
+	list_for_each_entry(dev, &fs_info->fs_devices->devices, dev_list) {
+		u64 cur_event;
+
+		if (!dev->bdev)
+			continue;
+
+		ret = write_intent_load(dev, fs_info->wi_ctrl->page);
+		if (ret > 0)
+			continue;
+		if (ret < 0) {
+			btrfs_err(fs_info,
+			"failed to load write intent from devid %llu: %d",
+				  dev->devid, ret);
+			goto cleanup;
+		}
+		wis = page_address(fs_info->wi_ctrl->page);
+		cur_event = wi_super_events(wis);
+		if (cur_event > highest_event) {
+			highest_dev = dev;
+			highest_event = cur_event;
+		}
+	}
+
+	/* Load the bitmap with lowest event as our bitmap. */
+	if (highest_dev) {
+		ret = write_intent_load(highest_dev, fs_info->wi_ctrl->page);
+		if (ret < 0) {
+			btrfs_err(fs_info,
+			"failed to load write intent from devid %llu: %d",
+				  dev->devid, ret);
+			goto cleanup;
+		}
+		wis = page_address(fs_info->wi_ctrl->page);
+		atomic64_set(&fs_info->wi_ctrl->event, wi_super_events(wis));
+		fs_info->wi_ctrl->blocksize = wi_super_blocksize(wis);
+		btrfs_info(fs_info,
+			"loaded write intent bitmaps, event count %llu",
+			atomic64_read(&fs_info->wi_ctrl->event));
+		return 0;
+	}
+
+	/* No valid bitmap found, create a new one. */
+	write_intent_init(fs_info);
+	return 0;
+cleanup:
+	if (fs_info->wi_ctrl) {
+		if (fs_info->wi_ctrl->page)
+			__free_page(fs_info->wi_ctrl->page);
+		kfree(fs_info->wi_ctrl);
+		fs_info->wi_ctrl = NULL;
+	}
+	return ret;
+}
+
+void btrfs_write_intent_free(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+
+	if (!ctrl)
+		return;
+	ASSERT(ctrl->page);
+	__free_page(ctrl->page);
+	kfree(ctrl);
+	fs_info->wi_ctrl = NULL;
+}
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index b851917bb0b6..2c5cd434e978 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -106,6 +106,18 @@ struct write_intent_entry {
 /* The number of bits we can have in one entry. */
 #define WRITE_INTENT_BITS_PER_ENTRY		(64)
 
+/* In-memory write-intent control structure. */
+struct write_intent_ctrl {
+	/* For the write_intent super and entries. */
+	struct page *page;
+
+	/* Cached event counter.*/
+	atomic64_t event;
+
+	/* Cached blocksize from write intent super. */
+	u32 blocksize;
+};
+
 /*
  * ON-DISK FORMAT
  * ==============
@@ -196,4 +208,7 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 #endif
 }
 
+int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
+void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 05/11] btrfs: write-intent: write the newly created bitmaps to all disks
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 04/11] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

This write back will happen even for RO mounts.

This will ensure we always have write-intent bitmaps for fses with that
compat RO flags set.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 197 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/write-intent.h |   6 ++
 2 files changed, 198 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index a7ed21182525..a43c6d94f8cd 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -1,8 +1,183 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <crypto/hash.h>
+#include <linux/bio.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "write-intent.h"
+#include "rcu-string.h"
+
+static void write_intent_end_write(struct bio *bio)
+{
+	struct btrfs_device *device = bio->bi_private;
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+	struct page *page;
+
+	bio_for_each_segment_all(bvec, bio, iter_all) {
+		page = bvec->bv_page;
+
+		if (bio->bi_status) {
+			btrfs_warn_rl_in_rcu(device->fs_info,
+				"write-intent bitmap update failed on %s (%d)",
+				rcu_str_deref(device->name),
+				blk_status_to_errno(bio->bi_status));
+			ClearPageUptodate(page);
+			btrfs_dev_stat_inc_and_print(device,
+						     BTRFS_DEV_STAT_WRITE_ERRS);
+		} else {
+			SetPageUptodate(page);
+		}
+
+		unlock_page(page);
+		put_page(page);
+	}
+
+	bio_put(bio);
+}
+
+static int submit_one_device(struct btrfs_device *dev)
+{
+	struct btrfs_fs_info *fs_info = dev->fs_info;
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct address_space *mapping;
+	struct page *page;
+	struct bio *bio;
+
+	if (!dev->bdev)
+		return -EIO;
+
+	if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &dev->dev_state) ||
+	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))
+		return 0;
+
+	mapping = dev->bdev->bd_inode->i_mapping;
+
+	page = find_or_create_page(mapping,
+				   BTRFS_DEVICE_RANGE_RESERVED >> PAGE_SHIFT,
+				   GFP_NOFS);
+	if (!page) {
+		btrfs_err(fs_info,
+		"couldn't get write intent bitmap page for devid %llu",
+			  dev->devid);
+		return -EIO;
+	}
+
+	/* Bump the refcount for later wait on this page */
+	get_page(page);
+	memcpy_page(page, offset_in_page(BTRFS_DEVICE_RANGE_RESERVED),
+		    ctrl->commit_page, 0, WRITE_INTENT_BITMAPS_SIZE);
+	bio = bio_alloc(dev->bdev, 1, REQ_OP_WRITE | REQ_SYNC |
+			REQ_META | REQ_PRIO | REQ_FUA, GFP_NOFS);
+	bio->bi_iter.bi_sector = BTRFS_DEVICE_RANGE_RESERVED >> SECTOR_SHIFT;
+	bio->bi_private = dev;
+	bio->bi_end_io = write_intent_end_write;
+	__bio_add_page(bio, page, WRITE_INTENT_BITMAPS_SIZE,
+			offset_in_page(BTRFS_DEVICE_RANGE_RESERVED));
+	submit_bio(bio);
+	return 0;
+}
+
+static int wait_one_device(struct btrfs_device *dev)
+{
+	struct btrfs_fs_info *fs_info = dev->fs_info;
+	struct page *page;
+	int ret = 0;
+
+	/*
+	 * This missing device has already been accounted in
+	 * submit_one_device(), no need to report error again.
+	 */
+	if (!dev->bdev)
+		return 0;
+
+	if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &dev->dev_state) ||
+	    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))
+		return 0;
+
+	page = find_get_page(dev->bdev->bd_inode->i_mapping,
+			     BTRFS_DEVICE_RANGE_RESERVED >> PAGE_SHIFT);
+	if (!page) {
+		btrfs_err(fs_info,
+		"couldn't wait write intent bitmap page for devid %llu",
+			  dev->devid);
+		return -EIO;
+	}
+
+	/* The endio will unlock the page. */
+	wait_on_page_locked(page);
+	if (!PageUptodate(page))
+		ret = -EIO;
+
+	/* Drop our reference */
+	put_page(page);
+
+	/* Drop the reference bumped by submit_one_device() */
+	put_page(page);
+
+	return ret;
+}
+
+/* Write back the page to all devices. */
+static int write_intent_writeback(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	struct btrfs_device *dev;
+	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	int total_errors = 0;
+	int ret;
+
+	ASSERT(ctrl);
+
+	shash->tfm = fs_info->csum_shash;
+
+	spin_lock(&ctrl->lock);
+	wis = page_address(ctrl->page);
+
+	/*
+	 * Bump up the event counter each time this bitmap is going to be
+	 * written.
+	 */
+	wi_set_super_events(wis, wi_super_events(wis) + 1);
+	crypto_shash_digest(shash, (unsigned char *)wis + BTRFS_CSUM_SIZE,
+			    WRITE_INTENT_BITMAPS_SIZE - BTRFS_CSUM_SIZE,
+			    wis->csum);
+	atomic64_inc(&ctrl->event);
+	memcpy_page(ctrl->commit_page, 0, ctrl->page, 0,
+		    WRITE_INTENT_BITMAPS_SIZE);
+	spin_unlock(&ctrl->lock);
+
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	/*
+	 * Go through all the writeable devices, copy the bitmap page into the
+	 * page cache, and submit them (without waiting).
+	 *
+	 * We will later check the page status to make sure they reached disk.
+	 */
+	list_for_each_entry(dev, &fs_info->fs_devices->devices, dev_list) {
+		ret = submit_one_device(dev);
+		if (ret < 0)
+			total_errors++;
+	}
+	/*
+	 * Wait for the submitted page to finish on each device.
+	 * By this we can submit all write intent bitmaps in one go, without
+	 * waiting each one.
+	 */
+	list_for_each_entry(dev, &fs_info->fs_devices->devices, dev_list) {
+		ret = wait_one_device(dev);
+		if (ret < 0)
+			total_errors++;
+	}
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+
+	if (total_errors > btrfs_super_num_devices(fs_info->super_copy) - 1) {
+		btrfs_err(fs_info, "failed to writeback write-intent bitmaps");
+		return -EIO;
+	}
+	return 0;
+}
 
 /*
  * Return 0 if a valid write intent bitmap can be found.
@@ -53,10 +228,11 @@ static int write_intent_load(struct btrfs_device *device, struct page *dst)
 	return ret;
 }
 
-static void write_intent_init(struct btrfs_fs_info *fs_info)
+static int write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
 	struct write_intent_super *wis;
+	int ret;
 
 	ASSERT(ctrl);
 	ASSERT(ctrl->page);
@@ -75,7 +251,12 @@ static void write_intent_init(struct btrfs_fs_info *fs_info)
 	wi_set_super_size(wis, WRITE_INTENT_BITMAPS_SIZE);
 	wi_set_super_blocksize(wis, ctrl->blocksize);
 	wi_set_super_nr_entries(wis, 0);
-	btrfs_info(fs_info, "creating new write intent bitmaps");
+
+	ret = write_intent_writeback(fs_info);
+	if (ret < 0)
+		return ret;
+	btrfs_info(fs_info, "new write intent bitmaps created");
+	return 0;
 }
 
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
@@ -95,11 +276,14 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 		return -ENOMEM;
 
 	fs_info->wi_ctrl->page = alloc_page(GFP_NOFS);
-	if (!fs_info->wi_ctrl->page) {
+	fs_info->wi_ctrl->commit_page = alloc_page(GFP_NOFS);
+	if (!fs_info->wi_ctrl->page || !fs_info->wi_ctrl->commit_page) {
 		ret = -ENOMEM;
 		goto cleanup;
 	}
 
+	spin_lock_init(&fs_info->wi_ctrl->lock);
+
 	/*
 	 * Go through every writeable device to find the highest event.
 	 *
@@ -149,12 +333,15 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 	}
 
 	/* No valid bitmap found, create a new one. */
-	write_intent_init(fs_info);
-	return 0;
+	ret = write_intent_init(fs_info);
+
+	return ret;
 cleanup:
 	if (fs_info->wi_ctrl) {
 		if (fs_info->wi_ctrl->page)
 			__free_page(fs_info->wi_ctrl->page);
+		if (fs_info->wi_ctrl->commit_page)
+			__free_page(fs_info->wi_ctrl->commit_page);
 		kfree(fs_info->wi_ctrl);
 		fs_info->wi_ctrl = NULL;
 	}
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 2c5cd434e978..797e57aef0e1 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -111,9 +111,15 @@ struct write_intent_ctrl {
 	/* For the write_intent super and entries. */
 	struct page *page;
 
+	/* A copy for writeback. */
+	struct page *commit_page;
+
 	/* Cached event counter.*/
 	atomic64_t event;
 
+	/* Lock for reading/writing above @page. */
+	spinlock_t lock;
+
 	/* Cached blocksize from write intent super. */
 	u32 blocksize;
 };
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 05/11] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-06  6:16   ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 07/11] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 251 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  16 +++
 2 files changed, 267 insertions(+)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index a43c6d94f8cd..0650f168db79 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -259,6 +259,257 @@ static int write_intent_init(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
+static struct write_intent_entry *write_intent_entry_nr(
+				struct write_intent_ctrl *ctrl, int nr)
+{
+
+	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+	return (page_address(ctrl->page) +
+		sizeof(struct write_intent_super) +
+		nr * sizeof(struct write_intent_entry));
+}
+
+/*
+ * Return <0 if the bytenr is before the given entry.
+ * Return 0 if the bytenr is inside the given entry.
+ * Return >0 if the bytenr is after the given entry.
+ */
+static int compare_bytenr_to_range(u64 bytenr, u64 start, u32 len)
+{
+	if (bytenr < start)
+		return -1;
+	if (start <= bytenr && bytenr < start + len)
+		return 0;
+	return 1;
+}
+
+/*
+ * Move all non-empty entries starting from @nr, to the right, and make room
+ * for @nr_new entries.
+ * Those new entries will be all zero filled.
+ *
+ * Caller should ensure we have enough room for @nr_new new entries.
+ */
+static void move_entries_right(struct write_intent_ctrl *ctrl, int nr,
+			       int nr_new)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	int move_size;
+
+	ASSERT(nr_new > 0);
+	ASSERT(wi_super_nr_entries(wis) + nr_new <=
+	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+
+	move_size = (wi_super_nr_entries(wis) - nr) *
+		     sizeof(struct write_intent_entry);
+
+	memmove(write_intent_entry_nr(ctrl, nr + nr_new),
+		write_intent_entry_nr(ctrl, nr), move_size);
+	memset(write_intent_entry_nr(ctrl, nr), 0,
+	       nr_new * sizeof(struct write_intent_entry));
+	wi_set_super_nr_entries(wis, wi_super_nr_entries(wis) + nr_new);
+}
+
+static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
+				  struct write_intent_entry *entry,
+				  u64 bytenr, u32 len)
+{
+	const u64 entry_start = wi_entry_bytenr(entry);
+	const u32 entry_len = write_intent_entry_size(ctrl);
+	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + entry_len);
+	bitmap_set(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
+		   len / ctrl->blocksize);
+	wie_set_bitmap(entry, bitmaps);
+}
+
+/*
+ * Insert new entries for the range [@bytenr, @bytenr + @len) at slot @nr
+ * and fill the new entries with proper bytenr and bitmaps.
+ */
+static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
+			       u64 bytenr, u32 len)
+{
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	u64 entry_start;
+	u64 new_start = round_down(bytenr, entry_size);
+	u64 new_end;
+	int nr_new_entries;
+	u64 cur;
+
+	if (nr >= wi_super_nr_entries(page_address(ctrl->page)) ||
+	    nr >= WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES)
+		entry_start = U64_MAX;
+	else
+		entry_start = wi_entry_bytenr(write_intent_entry_nr(ctrl, nr));
+
+	ASSERT(bytenr < entry_start);
+
+	new_end = min(entry_start, round_up(bytenr + len, entry_size));
+	nr_new_entries = (new_end - new_start) / entry_size;
+
+	if (nr_new_entries == 0)
+		return;
+
+	move_entries_right(ctrl, nr, nr_new_entries);
+
+	for (cur = new_start; cur < new_end; cur += entry_size, nr++) {
+		struct write_intent_entry *entry =
+			write_intent_entry_nr(ctrl, nr);
+		u64 range_start = max(cur, bytenr);
+		u64 range_len = min(cur + entry_size, bytenr + len) -
+				range_start;
+
+		/* Fill the bytenr into the new empty entries.*/
+		wi_set_entry_bytenr(entry, cur);
+
+		/* And set the bitmap. */
+		set_bits_in_one_entry(ctrl, entry, range_start, range_len);
+	}
+}
+
+/*
+ * This should be only called when we have enough room in the bitmaps, and hold
+ * the wi_ctrl->lock.
+ */
+static void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
+				  u32 len)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	int i;
+	u64 nr_entries = wi_super_nr_entries(wis);
+	u64 cur_bytenr;
+
+	/*
+	 * Currently we only accept full stripe length, which should be
+	 * aligned to 64KiB.
+	 */
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	/*
+	 * We should have room to contain the worst case scenario, in which we
+	 * need to create one or more new entry.
+	 */
+	ASSERT(nr_entries + bytes_to_entries(bytenr, len, BTRFS_STRIPE_LEN) <=
+	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
+
+	/* Empty bitmap, just insert new ones. */
+	if (wi_super_nr_entries(wis) == 0)
+		return insert_new_entries(ctrl, 0, bytenr, len);
+
+	/* Go through entries to find the one covering our range. */
+	for (i = 0, cur_bytenr = bytenr;
+	     i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; i++) {
+		struct write_intent_entry *entry = write_intent_entry_nr(ctrl, i);
+		u64 entry_start = wi_entry_bytenr(entry);
+		u64 entry_end = entry_start + entry_size;
+
+		/*
+		 *			|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *		|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *	|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * We need to insert one or more new entries for the range not
+		 * covered by the existing entry.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start,
+					    entry_size) < 0) {
+			u64 new_range_end;
+
+			new_range_end = min(entry_start, bytenr + len);
+			insert_new_entries(ctrl, i, cur_bytenr,
+					   new_range_end - cur_bytenr);
+
+			cur_bytenr = new_range_end;
+			continue;
+		}
+		/*
+		 * |<-- entry -->|
+		 *	|<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 * |<-------- entry ------->|
+		 *	|<- bytenr/len ->|
+		 *
+		 * In this case, we just set the bitmap in the current entry, and
+		 * advance @cur_bytenr to the end of the existing entry.
+		 * By this, we either go check the range against the next entry,
+		 * or we finish our current range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start,
+					    entry_size) == 0) {
+			u64 range_end = min(entry_end, bytenr + len);
+
+			set_bits_in_one_entry(ctrl, entry, cur_bytenr,
+					      range_end - cur_bytenr);
+			cur_bytenr = range_end;
+			continue;
+		}
+
+		/*
+		 * (A)
+		 * |<-- entry -->|			|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (B)
+		 * |<-- entry -->|		|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (C)
+		 * |<-- entry -->|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 */
+		if (i < wi_super_nr_entries(wis) - 1) {
+			struct write_intent_entry *next =
+				write_intent_entry_nr(ctrl, i + 1);
+			u64 next_start = wi_entry_bytenr(next);
+
+
+			/* Case (A) and (B), insert the new entries. */
+			if (cur_bytenr >= entry_end && cur_bytenr < next_start) {
+				insert_new_entries(ctrl, i + 1, cur_bytenr,
+						   bytenr + len - cur_bytenr);
+				cur_bytenr = next_start;
+				continue;
+			}
+
+			/* Case (C), just skip to next item */
+			continue;
+		}
+
+		/*
+		 * The remaining case is, @entry is already the last one.
+		 *
+		 * |<-- entry -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * We're beyond the last entry. Need to insert new entries.
+		 */
+		insert_new_entries(ctrl, i + 1, cur_bytenr,
+				   bytenr + len - cur_bytenr);
+
+		cur_bytenr = bytenr + len;
+	}
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 797e57aef0e1..d8f4d285512c 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -106,6 +106,15 @@ struct write_intent_entry {
 /* The number of bits we can have in one entry. */
 #define WRITE_INTENT_BITS_PER_ENTRY		(64)
 
+static inline u32 bytes_to_entries(u64 start, u32 length, u32 blocksize)
+{
+	u32 entry_len = blocksize * WRITE_INTENT_BITS_PER_ENTRY;
+	u64 entry_start = round_down(start, entry_len);
+	u64 entry_end = round_up(start + length, entry_len);
+
+	return DIV_ROUND_UP((u32)(entry_end - entry_start), entry_len);
+}
+
 /* In-memory write-intent control structure. */
 struct write_intent_ctrl {
 	/* For the write_intent super and entries. */
@@ -189,6 +198,13 @@ WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
 			  csum_type, 16);
 WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
 
+static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+
+	return wi_super_blocksize(wis) * WRITE_INTENT_BITS_PER_ENTRY;
+}
+
 static inline void wie_get_bitmap(struct write_intent_entry *entry,
 				  unsigned long *bitmap)
 {
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 07/11] btrfs: write-intent: introduce an internal helper to clear bits for a range.
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 08/11] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

This new helper. write_intent_clear_bits(), is much simpler than the set
bits counter part.

As if we can not find a entry for our target range, then it must be
something wrong, and we only need to warn and skip to next entry.

Although there has one extra thing to do, if we have emptied one entry,
we have to delete that empty entry.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/write-intent.c | 175 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 175 insertions(+)

diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 0650f168db79..6b99e7c70908 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -326,6 +326,36 @@ static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
 	wie_set_bitmap(entry, bitmaps);
 }
 
+static bool is_entry_empty(struct write_intent_ctrl *ctrl,
+			   struct write_intent_entry *entry)
+{
+	unsigned long bitmaps[2];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	return bitmap_empty(bitmaps, WRITE_INTENT_BITS_PER_ENTRY);
+}
+
+/*
+ * NOTE: This function may clear all bits of an entry, caller must check if
+ * that's the case, and delete the empty entry if needed.
+ */
+static void clear_bits_in_one_entry(struct write_intent_ctrl *ctrl,
+				    struct write_intent_entry *entry,
+				    u64 bytenr, u32 len)
+{
+	const u64 entry_start = wi_entry_bytenr(entry);
+	const u32 entry_len = write_intent_entry_size(ctrl);
+	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
+
+	wie_get_bitmap(entry, bitmaps);
+
+	ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + entry_len);
+	bitmap_clear(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
+		   len / ctrl->blocksize);
+	wie_set_bitmap(entry, bitmaps);
+}
+
 /*
  * Insert new entries for the range [@bytenr, @bytenr + @len) at slot @nr
  * and fill the new entries with proper bytenr and bitmaps.
@@ -371,6 +401,25 @@ static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
 	}
 }
 
+static void delete_one_entry(struct write_intent_ctrl *ctrl, int nr)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	int cur_nr_entries = wi_super_nr_entries(wis);
+
+	ASSERT(is_entry_empty(ctrl, write_intent_entry_nr(ctrl, nr)));
+	ASSERT(nr < cur_nr_entries);
+
+	/* Move all the entries after slot @nr by one slot. */
+	memmove(write_intent_entry_nr(ctrl, nr),
+		write_intent_entry_nr(ctrl, nr + 1),
+		(cur_nr_entries - nr - 1) * sizeof(struct write_intent_entry));
+
+	/* Memzero the right most entry. */
+	memset(write_intent_entry_nr(ctrl, cur_nr_entries - 1), 0,
+	       sizeof(struct write_intent_entry));
+	wi_set_super_nr_entries(wis, cur_nr_entries - 1);
+}
+
 /*
  * This should be only called when we have enough room in the bitmaps, and hold
  * the wi_ctrl->lock.
@@ -510,6 +559,132 @@ static void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 	}
 }
 
+/* This should be only called with wi_ctrl->lock hold. */
+static void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
+				    u32 len)
+{
+	struct write_intent_super *wis = page_address(ctrl->page);
+	const u32 entry_size = write_intent_entry_size(ctrl);
+	int i;
+	u64 cur_bytenr;
+
+	/*
+	 * Currently we only accept full stripe length, which should be
+	 * aligned to 64KiB.
+	 */
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	for (i = 0, cur_bytenr = bytenr;
+	     i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; i++) {
+		struct write_intent_entry *entry = write_intent_entry_nr(ctrl, i);
+		u64 entry_start = wi_entry_bytenr(entry);
+		u64 entry_end = entry_start + entry_size;
+
+		/*
+		 *			|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *		|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 *	|<-- entry -->|
+		 * |<-- bytenr/len -->|
+		 *
+		 * This case should not happen, it means we have some logged
+		 * dirty range, but it's no longer there.
+		 * Just warn and skip to the next covered range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start, entry_size) < 0) {
+			WARN_ON_ONCE(1);
+			cur_bytenr = max(bytenr + len, entry_start);
+			continue;
+		}
+
+		/*
+		 * |<-- entry -->|
+		 *	|<-- bytenr/len -->|
+		 *
+		 * Or
+		 *
+		 * |<-------- entry ------->|
+		 *	|<- bytenr/len ->|
+		 *
+		 * In this case, we just clear the bitmap in current entry, and
+		 * advance @cur_bytenr.
+		 * By this, we either go check the range against the next entry,
+		 * or we finish our current range.
+		 */
+		if (compare_bytenr_to_range(cur_bytenr, entry_start, entry_size) == 0) {
+			u64 range_end = min(entry_end, bytenr + len);
+
+			clear_bits_in_one_entry(ctrl, entry, cur_bytenr,
+						range_end - cur_bytenr);
+			cur_bytenr = range_end;
+			/*
+			 * If the current entry is empty, we need to delete the
+			 * entry and decrease @nr.
+			 */
+			if (is_entry_empty(ctrl, entry)) {
+				delete_one_entry(ctrl, i);
+				i--;
+			}
+			continue;
+		}
+
+		/*
+		 * (A)
+		 * |<-- entry -->|			|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (B)
+		 * |<-- entry -->|		|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * OR
+		 *
+		 * (C)
+		 * |<-- entry -->|<--- next -->|
+		 *		   |<-- bytenr/len -->|
+		 */
+		if (i < wi_super_nr_entries(wis) - 1) {
+			struct write_intent_entry *next =
+				write_intent_entry_nr(ctrl, i + 1);
+			u64 next_start = wi_entry_bytenr(next);
+
+			/* Case (A) and (B), insert the new entries. */
+			if (cur_bytenr >= entry_end && cur_bytenr < next_start) {
+				/*
+				 * This should not happen, as we should always
+				 * have dirty bits in range.
+				 * Just warn and skip to next entry.
+				 */
+				cur_bytenr = next_start;
+				WARN_ON_ONCE(1);
+				continue;
+			}
+
+			/* Case (C), just skip to next item */
+			continue;
+		}
+
+		/*
+		 * The remaining case is, @entry is already the last one.
+		 *
+		 * |<-- entry -->|
+		 *		   |<-- bytenr/len -->|
+		 *
+		 * Another impossible case, just warn and continue.
+		 */
+		WARN_ON_ONCE(1);
+	}
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 08/11] btrfs: write back write intent bitmap after barrier_all_devices()
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 07/11] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 09/11] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

The new function, btrfs_write_intent_writeback(), will also accept a u64
parameter, @event, to do extra fastpath check to avoid unnecessary
writeback.

For now we just pass 0 to write the current bitmap to disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c      |  9 +++++
 fs/btrfs/write-intent.c | 81 +++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/write-intent.h | 25 +++++++++++++
 3 files changed, 112 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index edbb21706bda..bd30decd38e4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4357,6 +4357,15 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 		}
 	}
 
+	ret = btrfs_write_intent_writeback(fs_info, 0);
+	if (ret < 0) {
+		mutex_unlock(
+			&fs_info->fs_devices->device_list_mutex);
+		btrfs_handle_fs_error(fs_info, ret,
+			"errors while writing back write intent bitmaps.");
+		return ret;
+	}
+
 	list_for_each_entry(dev, head, dev_list) {
 		if (!dev->bdev) {
 			total_errors++;
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 6b99e7c70908..dc8a4b46ca9e 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -133,8 +133,17 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	shash->tfm = fs_info->csum_shash;
 
 	spin_lock(&ctrl->lock);
-	wis = page_address(ctrl->page);
 
+	/* No update on the bitmap, just skip this writeback. */
+	if (!memcmp(page_address(ctrl->page), page_address(ctrl->commit_page),
+		    WRITE_INTENT_BITMAPS_SIZE)) {
+		ctrl->writing_event = 0;
+		spin_unlock(&ctrl->lock);
+		wake_up(&ctrl->write_wait);
+		return 0;
+	}
+
+	wis = page_address(ctrl->page);
 	/*
 	 * Bump up the event counter each time this bitmap is going to be
 	 * written.
@@ -148,7 +157,6 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 		    WRITE_INTENT_BITMAPS_SIZE);
 	spin_unlock(&ctrl->lock);
 
-	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	/*
 	 * Go through all the writeable devices, copy the bitmap page into the
 	 * page cache, and submit them (without waiting).
@@ -170,7 +178,13 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 		if (ret < 0)
 			total_errors++;
 	}
-	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+
+	spin_lock(&ctrl->lock);
+	if (ctrl->writing_event > ctrl->committed_event)
+		ctrl->committed_event = ctrl->writing_event;
+	ctrl->writing_event = 0;
+	spin_unlock(&ctrl->lock);
+	wake_up(&ctrl->write_wait);
 
 	if (total_errors > btrfs_super_num_devices(fs_info->super_copy) - 1) {
 		btrfs_err(fs_info, "failed to writeback write-intent bitmaps");
@@ -685,6 +699,63 @@ static void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 	}
 }
 
+int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return 0;
+
+	ASSERT(ctrl);
+
+again:
+	spin_lock(&ctrl->lock);
+
+	/*
+	 * The bitmap has already been written to disk at least once. Our update
+	 * has already reached disk.
+	 */
+	if (event && ctrl->committed_event > event) {
+		spin_unlock(&ctrl->lock);
+		return 0;
+	}
+
+	/* Previous writing hasn't finished, wait for it and retry. */
+	if (ctrl->writing_event && ctrl->writing_event <= event) {
+		DEFINE_WAIT(__wait);
+
+		prepare_to_wait_event(&ctrl->write_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		goto again;
+	}
+
+	/* Someone is already writing back the bitmap. */
+	if (ctrl->writing_event) {
+		DEFINE_WAIT(__wait);
+
+		ASSERT(ctrl->writing_event > event);
+		prepare_to_wait_event(&ctrl->write_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		return 0;
+	}
+
+	/*
+	 * We're the one to write back the bitmap, update @writing_event so
+	 * all the other caller will just wait for us.
+	 */
+	ctrl->writing_event = atomic64_read(&ctrl->event) + 1;
+	spin_unlock(&ctrl->lock);
+
+	/* Slow path, do the submission and wait. */
+	return write_intent_writeback(fs_info);
+}
+
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_device *highest_dev = NULL;
@@ -709,6 +780,8 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 	}
 
 	spin_lock_init(&fs_info->wi_ctrl->lock);
+	init_waitqueue_head(&fs_info->wi_ctrl->overflow_wait);
+	init_waitqueue_head(&fs_info->wi_ctrl->write_wait);
 
 	/*
 	 * Go through every writeable device to find the highest event.
@@ -749,6 +822,8 @@ int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
 				  dev->devid, ret);
 			goto cleanup;
 		}
+		memcpy_page(fs_info->wi_ctrl->commit_page, 0,
+			    fs_info->wi_ctrl->page, 0, WRITE_INTENT_BITMAPS_SIZE);
 		wis = page_address(fs_info->wi_ctrl->page);
 		atomic64_set(&fs_info->wi_ctrl->event, wi_super_events(wis));
 		fs_info->wi_ctrl->blocksize = wi_super_blocksize(wis);
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index d8f4d285512c..1a6cd9c723d2 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -123,12 +123,30 @@ struct write_intent_ctrl {
 	/* A copy for writeback. */
 	struct page *commit_page;
 
+	/*
+	 * For callers who has updated their bitmap, wait for the bitmap to be
+	 * flushed to disk.
+	 */
+	wait_queue_head_t write_wait;
+
+	/*
+	 * For callers whose bits can not be updated as no enough space left
+	 * int the bitmap.
+	 */
+	wait_queue_head_t overflow_wait;
+
 	/* Cached event counter.*/
 	atomic64_t event;
 
 	/* Lock for reading/writing above @page. */
 	spinlock_t lock;
 
+	/* Event number for the bitmap being written. */
+	u64 writing_event;
+
+	/* Event number for the last committed bitmap. */
+	u64 committed_event;
+
 	/* Cached blocksize from write intent super. */
 	u32 blocksize;
 };
@@ -233,4 +251,11 @@ static inline void wie_set_bitmap(struct write_intent_entry *entry,
 int btrfs_write_intent_init(struct btrfs_fs_info *fs_info);
 void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
 
+/*
+ * Ensure the bitmap for @event is already written to disk.
+ *
+ * If @event is 0, it means to write current bitmap to disk.
+ */
+int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 09/11] btrfs: update and writeback the write-intent bitmap for RAID56 write.
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (7 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 08/11] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 10/11] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

This allows us to mark the write-intent bitmaps dirty for later recovery
usage.

For now, we only mark the bitmaps dirty but without really clearing
them, this is going to cause problems (hang the fs if the bitmap is
full), but this also allows us to debug the bitmap with the new
dump-write-intent tool:

  csum_type		0 (crc32c)
  csum			0xad622029 [match]
  magic			_wIbSb_Q [match]
  fsid			46bcd711-6c9b-400f-aaba-bf99aa3dd321
  flags			0x7
  			( TARGET_RAID56 |
  			  INTERNAL |
  			  BYTENR_LOGICAL )
  events			10
  size			4096
  blocksize		65536
  nr_entries		1
  entry 0, bytenr 385875968, bitmap 0x000000000003fffc

This is doing 1MiB write for logical 388104192, which matches the above
output bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c       |  8 +++++++
 fs/btrfs/write-intent.c | 46 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h |  9 ++++++++
 3 files changed, 63 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7b2d2b6c8c61..9f18d6f6f4dc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1185,6 +1185,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	bool has_qstripe;
 	struct bio_list bio_list;
 	struct bio *bio;
+	u64 event;
 	int ret;
 
 	bio_list_init(&bio_list);
@@ -1338,6 +1339,13 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
 	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
+	/* Update the write intent bitmap before we start submitting bios. */
+	btrfs_write_intent_mark_dirty(bioc->fs_info, rbio->bioc->raid_map[0],
+				     rbio->nr_data * BTRFS_STRIPE_LEN, &event);
+	ret = btrfs_write_intent_writeback(bioc->fs_info, event);
+
+	if (ret < 0)
+		goto cleanup;
 	while ((bio = bio_list_pop(&bio_list))) {
 		bio->bi_end_io = raid_write_end_io;
 
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index dc8a4b46ca9e..08384ace432b 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -432,6 +432,9 @@ static void delete_one_entry(struct write_intent_ctrl *ctrl, int nr)
 	memset(write_intent_entry_nr(ctrl, cur_nr_entries - 1), 0,
 	       sizeof(struct write_intent_entry));
 	wi_set_super_nr_entries(wis, cur_nr_entries - 1);
+
+	/* We freed one entry, wake up who are waiting for the extra space. */
+	wake_up(&ctrl->overflow_wait);
 }
 
 /*
@@ -699,6 +702,49 @@ static void write_intent_clear_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
 	}
 }
 
+void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 len, u64 *event_ret)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	u32 entry_len;
+	int nr_entries;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return;
+
+	ASSERT(ctrl);
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+again:
+	spin_lock(&ctrl->lock);
+	entry_len = ctrl->blocksize * WRITE_INTENT_BITS_PER_ENTRY;
+	nr_entries = (round_up(logical + len, entry_len) -
+		      round_down(logical, entry_len)) / entry_len;
+	wis = page_address(ctrl->page);
+
+	/*
+	 * May not have enough space left. This calculation is definitely
+	 * overkilled, but will ensure we have enough space for it.
+	 */
+	if (unlikely(wi_super_nr_entries(wis) + nr_entries) >=
+		     WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES) {
+		DEFINE_WAIT(__wait);
+
+		prepare_to_wait_event(&ctrl->overflow_wait, &__wait,
+				      TASK_UNINTERRUPTIBLE);
+		spin_unlock(&ctrl->lock);
+		schedule();
+		finish_wait(&ctrl->write_wait, &__wait);
+		goto again;
+	}
+
+	/* Update the bitmap. */
+	write_intent_set_bits(ctrl, logical, len);
+	*event_ret = atomic64_read(&ctrl->event);
+	spin_unlock(&ctrl->lock);
+}
+
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 1a6cd9c723d2..385f65047707 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -258,4 +258,13 @@ void btrfs_write_intent_free(struct btrfs_fs_info *fs_info);
  */
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
 
+/*
+ * Mark the range dirty in write intent bitmaps.
+ *
+ * May (but unlikely) sleep if there is not enough free entries.
+ * In that case, we will wait for enough free entries to be released.
+ */
+void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				   u32 len, u64 *event_ret);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 10/11] btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (8 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 09/11] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-05  7:39 ` [PATCH RFC 11/11] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c       |  7 +++++++
 fs/btrfs/write-intent.c | 45 +++++++++++++++++++++++++++++------------
 fs/btrfs/write-intent.h |  8 ++++++++
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 9f18d6f6f4dc..f2b0ea2315c5 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -817,6 +817,13 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 
 	if (rbio->generic_bio_cnt)
 		btrfs_bio_counter_sub(rbio->bioc->fs_info, rbio->generic_bio_cnt);
+
+	/* Clear the write-intent bitmap range for write operation. */
+	if (rbio->operation == BTRFS_RBIO_WRITE)
+		btrfs_write_intent_clear_dirty(rbio->bioc->fs_info,
+				       rbio->bioc->raid_map[0],
+				       rbio->nr_data * BTRFS_STRIPE_LEN);
+
 	/*
 	 * Clear the data bitmap, as the rbio may be cached for later usage.
 	 * do this before before unlock_stripe() so there will be no new bio
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index 08384ace432b..c8bee0f89fc9 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -125,6 +125,7 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	struct write_intent_super *wis;
 	struct btrfs_device *dev;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+	unsigned long flags;
 	int total_errors = 0;
 	int ret;
 
@@ -132,13 +133,13 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 
 	shash->tfm = fs_info->csum_shash;
 
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	/* No update on the bitmap, just skip this writeback. */
 	if (!memcmp(page_address(ctrl->page), page_address(ctrl->commit_page),
 		    WRITE_INTENT_BITMAPS_SIZE)) {
 		ctrl->writing_event = 0;
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		wake_up(&ctrl->write_wait);
 		return 0;
 	}
@@ -155,7 +156,7 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 	atomic64_inc(&ctrl->event);
 	memcpy_page(ctrl->commit_page, 0, ctrl->page, 0,
 		    WRITE_INTENT_BITMAPS_SIZE);
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	/*
 	 * Go through all the writeable devices, copy the bitmap page into the
@@ -179,11 +180,11 @@ static int write_intent_writeback(struct btrfs_fs_info *fs_info)
 			total_errors++;
 	}
 
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 	if (ctrl->writing_event > ctrl->committed_event)
 		ctrl->committed_event = ctrl->writing_event;
 	ctrl->writing_event = 0;
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 	wake_up(&ctrl->write_wait);
 
 	if (total_errors > btrfs_super_num_devices(fs_info->super_copy) - 1) {
@@ -707,6 +708,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
 	struct write_intent_super *wis;
+	unsigned long flags;
 	u32 entry_len;
 	int nr_entries;
 
@@ -717,7 +719,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
 
 again:
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 	entry_len = ctrl->blocksize * WRITE_INTENT_BITS_PER_ENTRY;
 	nr_entries = (round_up(logical + len, entry_len) -
 		      round_down(logical, entry_len)) / entry_len;
@@ -733,7 +735,7 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 
 		prepare_to_wait_event(&ctrl->overflow_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		goto again;
@@ -742,12 +744,29 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	/* Update the bitmap. */
 	write_intent_set_bits(ctrl, logical, len);
 	*event_ret = atomic64_read(&ctrl->event);
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+
+void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				    u32 len)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	unsigned long flags;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return;
+	ASSERT(ctrl);
+	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	write_intent_clear_bits(ctrl, logical, len);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	unsigned long flags;
 
 	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
 		return 0;
@@ -755,14 +774,14 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 	ASSERT(ctrl);
 
 again:
-	spin_lock(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	/*
 	 * The bitmap has already been written to disk at least once. Our update
 	 * has already reached disk.
 	 */
 	if (event && ctrl->committed_event > event) {
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		return 0;
 	}
 
@@ -772,7 +791,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 
 		prepare_to_wait_event(&ctrl->write_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		goto again;
@@ -785,7 +804,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 		ASSERT(ctrl->writing_event > event);
 		prepare_to_wait_event(&ctrl->write_wait, &__wait,
 				      TASK_UNINTERRUPTIBLE);
-		spin_unlock(&ctrl->lock);
+		spin_unlock_irqrestore(&ctrl->lock, flags);
 		schedule();
 		finish_wait(&ctrl->write_wait, &__wait);
 		return 0;
@@ -796,7 +815,7 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 	 * all the other caller will just wait for us.
 	 */
 	ctrl->writing_event = atomic64_read(&ctrl->event) + 1;
-	spin_unlock(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	/* Slow path, do the submission and wait. */
 	return write_intent_writeback(fs_info);
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 385f65047707..7c0ea3545e89 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -267,4 +267,12 @@ int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event);
 void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 				   u32 len, u64 *event_ret);
 
+/*
+ * Clear the range dirty in write intent bitmaps.
+ *
+ * This function should not sleep, and no need to wait for the bitmap to be
+ * flushed.
+ */
+void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
+				    u32 len);
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 11/11] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (9 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 10/11] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
@ 2022-07-05  7:39 ` Qu Wenruo
  2022-07-06 23:36 ` [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Wang Yugui
  2022-07-07  4:24 ` Wang Yugui
  12 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-05  7:39 UTC (permalink / raw)
  To: linux-btrfs

To properly cleanup the bitmaps, we need to scrub the logical ranges in
the bitmaps.

Unfortunately there is no such convient interface at all (scrub only
works at device offset for now).

So just introduce a place holder to warn and clear the bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c      |  7 +++++++
 fs/btrfs/write-intent.c | 35 +++++++++++++++++++++++++++++++++++
 fs/btrfs/write-intent.h | 10 ++++++++++
 3 files changed, 52 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bd30decd38e4..ccc136023303 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3705,6 +3705,13 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			  ret);
 		goto fail_block_groups;
 	}
+
+	ret = btrfs_write_intent_recover(fs_info);
+	if (ret < 0) {
+		btrfs_err(fs_info, "failed to recover write-intent bitmap: %d",
+			  ret);
+		goto fail_block_groups;
+	}
 	ret = btrfs_recover_balance(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to recover balance: %d", ret);
diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
index c8bee0f89fc9..529690fdfb0c 100644
--- a/fs/btrfs/write-intent.c
+++ b/fs/btrfs/write-intent.c
@@ -763,6 +763,41 @@ void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
+int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info)
+{
+	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
+	struct write_intent_super *wis;
+	int ret = 0;
+
+	if (!btrfs_fs_compat_ro(fs_info, WRITE_INTENT_BITMAP))
+		return ret;
+
+	ASSERT(ctrl);
+	wis = page_address(ctrl->page);
+
+	if (wi_super_nr_entries(wis) != 0) {
+		int i;
+
+		btrfs_warn(fs_info, "dirty write intent bitmap found:");
+		for (i = 0; i < wi_super_nr_entries(wis); i++) {
+			struct write_intent_entry *entry =
+				write_intent_entry_nr(ctrl, i);
+
+			btrfs_warn(fs_info,
+				"  entry=%u bytenr=%llu bitmap=0x%016llx\n", i,
+				   wi_entry_bytenr(entry),
+				   wi_entry_raw_bitmap(entry));
+		}
+		/* For now, we just clear the whole bitmap. */
+		memzero_page(ctrl->page, sizeof(struct write_intent_super),
+			     WRITE_INTENT_BITMAPS_SIZE -
+			     sizeof(struct write_intent_super));
+		wi_set_super_nr_entries(wis, 0);
+		ret = write_intent_writeback(fs_info);
+	}
+	return ret;
+}
+
 int btrfs_write_intent_writeback(struct btrfs_fs_info *fs_info, u64 event)
 {
 	struct write_intent_ctrl *ctrl = fs_info->wi_ctrl;
diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
index 7c0ea3545e89..eafbd3e80c54 100644
--- a/fs/btrfs/write-intent.h
+++ b/fs/btrfs/write-intent.h
@@ -215,6 +215,8 @@ WRITE_INTENT_SETGET_FUNCS(super_blocksize, struct write_intent_super,
 WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
 			  csum_type, 16);
 WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
+WRITE_INTENT_SETGET_FUNCS(entry_raw_bitmap, struct write_intent_entry,
+			  bitmap, 64);
 
 static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
 {
@@ -275,4 +277,12 @@ void btrfs_write_intent_mark_dirty(struct btrfs_fs_info *fs_info, u64 logical,
  */
 void btrfs_write_intent_clear_dirty(struct btrfs_fs_info *fs_info, u64 logical,
 				    u32 len);
+
+/*
+ * Rebuild the range in the write-intent bitmaps.
+ *
+ * Currently not working, it will just output a warning and clear the bitmap.
+ */
+int btrfs_write_intent_recover(struct btrfs_fs_info *fs_info);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-05  7:39 ` [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
@ 2022-07-06  6:16   ` Qu Wenruo
  2022-07-06  9:00     ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2022-07-06  6:16 UTC (permalink / raw)
  To: linux-btrfs



On 2022/7/5 15:39, Qu Wenruo wrote:
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Unfortunately there seems to be corner case not handled properly.

If we have a start=37617664, len=196608, and we have an existing entry 0 
, bitmap=0x0700000000000000.

Then after calling set_bits() with above range, we only got:
entry 0 bitmap=0xc700000000000000.

Note 0xc = 1100 binary, thus we didn't create a new entry for the 
remaining 1 bit, and triggered the later WARN_ON_ONCE() for clear_bits.

I'll fix the bug in the code and add a selftest case for it.

Thanks,
Qu

> ---
>   fs/btrfs/write-intent.c | 251 ++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/write-intent.h |  16 +++
>   2 files changed, 267 insertions(+)
> 
> diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
> index a43c6d94f8cd..0650f168db79 100644
> --- a/fs/btrfs/write-intent.c
> +++ b/fs/btrfs/write-intent.c
> @@ -259,6 +259,257 @@ static int write_intent_init(struct btrfs_fs_info *fs_info)
>   	return 0;
>   }
>   
> +static struct write_intent_entry *write_intent_entry_nr(
> +				struct write_intent_ctrl *ctrl, int nr)
> +{
> +
> +	ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
> +	return (page_address(ctrl->page) +
> +		sizeof(struct write_intent_super) +
> +		nr * sizeof(struct write_intent_entry));
> +}
> +
> +/*
> + * Return <0 if the bytenr is before the given entry.
> + * Return 0 if the bytenr is inside the given entry.
> + * Return >0 if the bytenr is after the given entry.
> + */
> +static int compare_bytenr_to_range(u64 bytenr, u64 start, u32 len)
> +{
> +	if (bytenr < start)
> +		return -1;
> +	if (start <= bytenr && bytenr < start + len)
> +		return 0;
> +	return 1;
> +}
> +
> +/*
> + * Move all non-empty entries starting from @nr, to the right, and make room
> + * for @nr_new entries.
> + * Those new entries will be all zero filled.
> + *
> + * Caller should ensure we have enough room for @nr_new new entries.
> + */
> +static void move_entries_right(struct write_intent_ctrl *ctrl, int nr,
> +			       int nr_new)
> +{
> +	struct write_intent_super *wis = page_address(ctrl->page);
> +	int move_size;
> +
> +	ASSERT(nr_new > 0);
> +	ASSERT(wi_super_nr_entries(wis) + nr_new <=
> +	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
> +
> +	move_size = (wi_super_nr_entries(wis) - nr) *
> +		     sizeof(struct write_intent_entry);
> +
> +	memmove(write_intent_entry_nr(ctrl, nr + nr_new),
> +		write_intent_entry_nr(ctrl, nr), move_size);
> +	memset(write_intent_entry_nr(ctrl, nr), 0,
> +	       nr_new * sizeof(struct write_intent_entry));
> +	wi_set_super_nr_entries(wis, wi_super_nr_entries(wis) + nr_new);
> +}
> +
> +static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
> +				  struct write_intent_entry *entry,
> +				  u64 bytenr, u32 len)
> +{
> +	const u64 entry_start = wi_entry_bytenr(entry);
> +	const u32 entry_len = write_intent_entry_size(ctrl);
> +	unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
> +
> +	wie_get_bitmap(entry, bitmaps);
> +
> +	ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + entry_len);
> +	bitmap_set(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
> +		   len / ctrl->blocksize);
> +	wie_set_bitmap(entry, bitmaps);
> +}
> +
> +/*
> + * Insert new entries for the range [@bytenr, @bytenr + @len) at slot @nr
> + * and fill the new entries with proper bytenr and bitmaps.
> + */
> +static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
> +			       u64 bytenr, u32 len)
> +{
> +	const u32 entry_size = write_intent_entry_size(ctrl);
> +	u64 entry_start;
> +	u64 new_start = round_down(bytenr, entry_size);
> +	u64 new_end;
> +	int nr_new_entries;
> +	u64 cur;
> +
> +	if (nr >= wi_super_nr_entries(page_address(ctrl->page)) ||
> +	    nr >= WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES)
> +		entry_start = U64_MAX;
> +	else
> +		entry_start = wi_entry_bytenr(write_intent_entry_nr(ctrl, nr));
> +
> +	ASSERT(bytenr < entry_start);
> +
> +	new_end = min(entry_start, round_up(bytenr + len, entry_size));
> +	nr_new_entries = (new_end - new_start) / entry_size;
> +
> +	if (nr_new_entries == 0)
> +		return;
> +
> +	move_entries_right(ctrl, nr, nr_new_entries);
> +
> +	for (cur = new_start; cur < new_end; cur += entry_size, nr++) {
> +		struct write_intent_entry *entry =
> +			write_intent_entry_nr(ctrl, nr);
> +		u64 range_start = max(cur, bytenr);
> +		u64 range_len = min(cur + entry_size, bytenr + len) -
> +				range_start;
> +
> +		/* Fill the bytenr into the new empty entries.*/
> +		wi_set_entry_bytenr(entry, cur);
> +
> +		/* And set the bitmap. */
> +		set_bits_in_one_entry(ctrl, entry, range_start, range_len);
> +	}
> +}
> +
> +/*
> + * This should be only called when we have enough room in the bitmaps, and hold
> + * the wi_ctrl->lock.
> + */
> +static void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 bytenr,
> +				  u32 len)
> +{
> +	struct write_intent_super *wis = page_address(ctrl->page);
> +	const u32 entry_size = write_intent_entry_size(ctrl);
> +	int i;
> +	u64 nr_entries = wi_super_nr_entries(wis);
> +	u64 cur_bytenr;
> +
> +	/*
> +	 * Currently we only accept full stripe length, which should be
> +	 * aligned to 64KiB.
> +	 */
> +	ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
> +
> +	/*
> +	 * We should have room to contain the worst case scenario, in which we
> +	 * need to create one or more new entry.
> +	 */
> +	ASSERT(nr_entries + bytes_to_entries(bytenr, len, BTRFS_STRIPE_LEN) <=
> +	       WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
> +
> +	/* Empty bitmap, just insert new ones. */
> +	if (wi_super_nr_entries(wis) == 0)
> +		return insert_new_entries(ctrl, 0, bytenr, len);
> +
> +	/* Go through entries to find the one covering our range. */
> +	for (i = 0, cur_bytenr = bytenr;
> +	     i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; i++) {
> +		struct write_intent_entry *entry = write_intent_entry_nr(ctrl, i);
> +		u64 entry_start = wi_entry_bytenr(entry);
> +		u64 entry_end = entry_start + entry_size;
> +
> +		/*
> +		 *			|<-- entry -->|
> +		 * |<-- bytenr/len -->|
> +		 *
> +		 * Or
> +		 *
> +		 *		|<-- entry -->|
> +		 * |<-- bytenr/len -->|
> +		 *
> +		 * Or
> +		 *
> +		 *	|<-- entry -->|
> +		 * |<-- bytenr/len -->|
> +		 *
> +		 * We need to insert one or more new entries for the range not
> +		 * covered by the existing entry.
> +		 */
> +		if (compare_bytenr_to_range(cur_bytenr, entry_start,
> +					    entry_size) < 0) {
> +			u64 new_range_end;
> +
> +			new_range_end = min(entry_start, bytenr + len);
> +			insert_new_entries(ctrl, i, cur_bytenr,
> +					   new_range_end - cur_bytenr);
> +
> +			cur_bytenr = new_range_end;
> +			continue;
> +		}
> +		/*
> +		 * |<-- entry -->|
> +		 *	|<-- bytenr/len -->|
> +		 *
> +		 * Or
> +		 *
> +		 * |<-------- entry ------->|
> +		 *	|<- bytenr/len ->|
> +		 *
> +		 * In this case, we just set the bitmap in the current entry, and
> +		 * advance @cur_bytenr to the end of the existing entry.
> +		 * By this, we either go check the range against the next entry,
> +		 * or we finish our current range.
> +		 */
> +		if (compare_bytenr_to_range(cur_bytenr, entry_start,
> +					    entry_size) == 0) {
> +			u64 range_end = min(entry_end, bytenr + len);
> +
> +			set_bits_in_one_entry(ctrl, entry, cur_bytenr,
> +					      range_end - cur_bytenr);
> +			cur_bytenr = range_end;
> +			continue;
> +		}
> +
> +		/*
> +		 * (A)
> +		 * |<-- entry -->|			|<--- next -->|
> +		 *		   |<-- bytenr/len -->|
> +		 *
> +		 * OR
> +		 *
> +		 * (B)
> +		 * |<-- entry -->|		|<--- next -->|
> +		 *		   |<-- bytenr/len -->|
> +		 *
> +		 * OR
> +		 *
> +		 * (C)
> +		 * |<-- entry -->|<--- next -->|
> +		 *		   |<-- bytenr/len -->|
> +		 */
> +		if (i < wi_super_nr_entries(wis) - 1) {
> +			struct write_intent_entry *next =
> +				write_intent_entry_nr(ctrl, i + 1);
> +			u64 next_start = wi_entry_bytenr(next);
> +
> +
> +			/* Case (A) and (B), insert the new entries. */
> +			if (cur_bytenr >= entry_end && cur_bytenr < next_start) {
> +				insert_new_entries(ctrl, i + 1, cur_bytenr,
> +						   bytenr + len - cur_bytenr);
> +				cur_bytenr = next_start;
> +				continue;
> +			}
> +
> +			/* Case (C), just skip to next item */
> +			continue;
> +		}
> +
> +		/*
> +		 * The remaining case is, @entry is already the last one.
> +		 *
> +		 * |<-- entry -->|
> +		 *		   |<-- bytenr/len -->|
> +		 *
> +		 * We're beyond the last entry. Need to insert new entries.
> +		 */
> +		insert_new_entries(ctrl, i + 1, cur_bytenr,
> +				   bytenr + len - cur_bytenr);
> +
> +		cur_bytenr = bytenr + len;
> +	}
> +}
> +
>   int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
>   {
>   	struct btrfs_device *highest_dev = NULL;
> diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
> index 797e57aef0e1..d8f4d285512c 100644
> --- a/fs/btrfs/write-intent.h
> +++ b/fs/btrfs/write-intent.h
> @@ -106,6 +106,15 @@ struct write_intent_entry {
>   /* The number of bits we can have in one entry. */
>   #define WRITE_INTENT_BITS_PER_ENTRY		(64)
>   
> +static inline u32 bytes_to_entries(u64 start, u32 length, u32 blocksize)
> +{
> +	u32 entry_len = blocksize * WRITE_INTENT_BITS_PER_ENTRY;
> +	u64 entry_start = round_down(start, entry_len);
> +	u64 entry_end = round_up(start + length, entry_len);
> +
> +	return DIV_ROUND_UP((u32)(entry_end - entry_start), entry_len);
> +}
> +
>   /* In-memory write-intent control structure. */
>   struct write_intent_ctrl {
>   	/* For the write_intent super and entries. */
> @@ -189,6 +198,13 @@ WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct write_intent_super,
>   			  csum_type, 16);
>   WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, bytenr, 64);
>   
> +static inline u32 write_intent_entry_size(struct write_intent_ctrl *ctrl)
> +{
> +	struct write_intent_super *wis = page_address(ctrl->page);
> +
> +	return wi_super_blocksize(wis) * WRITE_INTENT_BITS_PER_ENTRY;
> +}
> +
>   static inline void wie_get_bitmap(struct write_intent_entry *entry,
>   				  unsigned long *bitmap)
>   {

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range.
  2022-07-06  6:16   ` Qu Wenruo
@ 2022-07-06  9:00     ` Qu Wenruo
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-06  9:00 UTC (permalink / raw)
  To: linux-btrfs



On 2022/7/6 14:16, Qu Wenruo wrote:
> 
> 
> On 2022/7/5 15:39, Qu Wenruo wrote:
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Unfortunately there seems to be corner case not handled properly.
> 
> If we have a start=37617664, len=196608, and we have an existing entry 0 
> , bitmap=0x0700000000000000.
> 
> Then after calling set_bits() with above range, we only got:
> entry 0 bitmap=0xc700000000000000.
> 
> Note 0xc = 1100 binary, thus we didn't create a new entry for the 
> remaining 1 bit, and triggered the later WARN_ON_ONCE() for clear_bits.
> 
> I'll fix the bug in the code and add a selftest case for it.
> 
> Thanks,
> Qu
> 
>> ---
>>   fs/btrfs/write-intent.c | 251 ++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/write-intent.h |  16 +++
>>   2 files changed, 267 insertions(+)
>>
>> diff --git a/fs/btrfs/write-intent.c b/fs/btrfs/write-intent.c
>> index a43c6d94f8cd..0650f168db79 100644
>> --- a/fs/btrfs/write-intent.c
>> +++ b/fs/btrfs/write-intent.c
>> @@ -259,6 +259,257 @@ static int write_intent_init(struct 
>> btrfs_fs_info *fs_info)
>>       return 0;
>>   }
>> +static struct write_intent_entry *write_intent_entry_nr(
>> +                struct write_intent_ctrl *ctrl, int nr)
>> +{
>> +
>> +    ASSERT(nr < WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
>> +    return (page_address(ctrl->page) +
>> +        sizeof(struct write_intent_super) +
>> +        nr * sizeof(struct write_intent_entry));
>> +}
>> +
>> +/*
>> + * Return <0 if the bytenr is before the given entry.
>> + * Return 0 if the bytenr is inside the given entry.
>> + * Return >0 if the bytenr is after the given entry.
>> + */
>> +static int compare_bytenr_to_range(u64 bytenr, u64 start, u32 len)
>> +{
>> +    if (bytenr < start)
>> +        return -1;
>> +    if (start <= bytenr && bytenr < start + len)
>> +        return 0;
>> +    return 1;
>> +}
>> +
>> +/*
>> + * Move all non-empty entries starting from @nr, to the right, and 
>> make room
>> + * for @nr_new entries.
>> + * Those new entries will be all zero filled.
>> + *
>> + * Caller should ensure we have enough room for @nr_new new entries.
>> + */
>> +static void move_entries_right(struct write_intent_ctrl *ctrl, int nr,
>> +                   int nr_new)
>> +{
>> +    struct write_intent_super *wis = page_address(ctrl->page);
>> +    int move_size;
>> +
>> +    ASSERT(nr_new > 0);
>> +    ASSERT(wi_super_nr_entries(wis) + nr_new <=
>> +           WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
>> +
>> +    move_size = (wi_super_nr_entries(wis) - nr) *
>> +             sizeof(struct write_intent_entry);
>> +
>> +    memmove(write_intent_entry_nr(ctrl, nr + nr_new),
>> +        write_intent_entry_nr(ctrl, nr), move_size);
>> +    memset(write_intent_entry_nr(ctrl, nr), 0,
>> +           nr_new * sizeof(struct write_intent_entry));
>> +    wi_set_super_nr_entries(wis, wi_super_nr_entries(wis) + nr_new);
>> +}
>> +
>> +static void set_bits_in_one_entry(struct write_intent_ctrl *ctrl,
>> +                  struct write_intent_entry *entry,
>> +                  u64 bytenr, u32 len)
>> +{
>> +    const u64 entry_start = wi_entry_bytenr(entry);
>> +    const u32 entry_len = write_intent_entry_size(ctrl);
>> +    unsigned long bitmaps[WRITE_INTENT_BITS_PER_ENTRY / BITS_PER_LONG];
>> +
>> +    wie_get_bitmap(entry, bitmaps);
>> +
>> +    ASSERT(entry_start <= bytenr && bytenr + len <= entry_start + 
>> entry_len);
>> +    bitmap_set(bitmaps, (bytenr - entry_start) / ctrl->blocksize,
>> +           len / ctrl->blocksize);
>> +    wie_set_bitmap(entry, bitmaps);
>> +}
>> +
>> +/*
>> + * Insert new entries for the range [@bytenr, @bytenr + @len) at slot 
>> @nr
>> + * and fill the new entries with proper bytenr and bitmaps.
>> + */
>> +static void insert_new_entries(struct write_intent_ctrl *ctrl, int nr,
>> +                   u64 bytenr, u32 len)
>> +{
>> +    const u32 entry_size = write_intent_entry_size(ctrl);
>> +    u64 entry_start;
>> +    u64 new_start = round_down(bytenr, entry_size);
>> +    u64 new_end;
>> +    int nr_new_entries;
>> +    u64 cur;
>> +
>> +    if (nr >= wi_super_nr_entries(page_address(ctrl->page)) ||
>> +        nr >= WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES)
>> +        entry_start = U64_MAX;
>> +    else
>> +        entry_start = wi_entry_bytenr(write_intent_entry_nr(ctrl, nr));
>> +
>> +    ASSERT(bytenr < entry_start);
>> +
>> +    new_end = min(entry_start, round_up(bytenr + len, entry_size));
>> +    nr_new_entries = (new_end - new_start) / entry_size;
>> +
>> +    if (nr_new_entries == 0)
>> +        return;
>> +
>> +    move_entries_right(ctrl, nr, nr_new_entries);
>> +
>> +    for (cur = new_start; cur < new_end; cur += entry_size, nr++) {
>> +        struct write_intent_entry *entry =
>> +            write_intent_entry_nr(ctrl, nr);
>> +        u64 range_start = max(cur, bytenr);
>> +        u64 range_len = min(cur + entry_size, bytenr + len) -
>> +                range_start;
>> +
>> +        /* Fill the bytenr into the new empty entries.*/
>> +        wi_set_entry_bytenr(entry, cur);
>> +
>> +        /* And set the bitmap. */
>> +        set_bits_in_one_entry(ctrl, entry, range_start, range_len);
>> +    }
>> +}
>> +
>> +/*
>> + * This should be only called when we have enough room in the 
>> bitmaps, and hold
>> + * the wi_ctrl->lock.
>> + */
>> +static void write_intent_set_bits(struct write_intent_ctrl *ctrl, u64 
>> bytenr,
>> +                  u32 len)
>> +{
>> +    struct write_intent_super *wis = page_address(ctrl->page);
>> +    const u32 entry_size = write_intent_entry_size(ctrl);
>> +    int i;
>> +    u64 nr_entries = wi_super_nr_entries(wis);
>> +    u64 cur_bytenr;
>> +
>> +    /*
>> +     * Currently we only accept full stripe length, which should be
>> +     * aligned to 64KiB.
>> +     */
>> +    ASSERT(IS_ALIGNED(len, BTRFS_STRIPE_LEN));
>> +
>> +    /*
>> +     * We should have room to contain the worst case scenario, in 
>> which we
>> +     * need to create one or more new entry.
>> +     */
>> +    ASSERT(nr_entries + bytes_to_entries(bytenr, len, 
>> BTRFS_STRIPE_LEN) <=
>> +           WRITE_INTENT_INTERNAL_BITMAPS_MAX_ENTRIES);
>> +
>> +    /* Empty bitmap, just insert new ones. */
>> +    if (wi_super_nr_entries(wis) == 0)
>> +        return insert_new_entries(ctrl, 0, bytenr, len);
>> +
>> +    /* Go through entries to find the one covering our range. */
>> +    for (i = 0, cur_bytenr = bytenr;
>> +         i < wi_super_nr_entries(wis) && cur_bytenr < bytenr + len; 
>> i++) {
>> +        struct write_intent_entry *entry = 
>> write_intent_entry_nr(ctrl, i);
>> +        u64 entry_start = wi_entry_bytenr(entry);
>> +        u64 entry_end = entry_start + entry_size;
>> +
>> +        /*
>> +         *            |<-- entry -->|
>> +         * |<-- bytenr/len -->|
>> +         *
>> +         * Or
>> +         *
>> +         *        |<-- entry -->|
>> +         * |<-- bytenr/len -->|
>> +         *
>> +         * Or
>> +         *
>> +         *    |<-- entry -->|
>> +         * |<-- bytenr/len -->|
>> +         *
>> +         * We need to insert one or more new entries for the range not
>> +         * covered by the existing entry.
>> +         */
>> +        if (compare_bytenr_to_range(cur_bytenr, entry_start,
>> +                        entry_size) < 0) {
>> +            u64 new_range_end;
>> +
>> +            new_range_end = min(entry_start, bytenr + len);
>> +            insert_new_entries(ctrl, i, cur_bytenr,
>> +                       new_range_end - cur_bytenr);
>> +
>> +            cur_bytenr = new_range_end;
>> +            continue;
>> +        }
>> +        /*
>> +         * |<-- entry -->|
>> +         *    |<-- bytenr/len -->|
>> +         *
>> +         * Or
>> +         *
>> +         * |<-------- entry ------->|
>> +         *    |<- bytenr/len ->|
>> +         *
>> +         * In this case, we just set the bitmap in the current entry, 
>> and
>> +         * advance @cur_bytenr to the end of the existing entry.
>> +         * By this, we either go check the range against the next entry,
>> +         * or we finish our current range.
>> +         */
>> +        if (compare_bytenr_to_range(cur_bytenr, entry_start,
>> +                        entry_size) == 0) {
>> +            u64 range_end = min(entry_end, bytenr + len);
>> +
>> +            set_bits_in_one_entry(ctrl, entry, cur_bytenr,
>> +                          range_end - cur_bytenr);
>> +            cur_bytenr = range_end;
>> +            continue;

This is where the problem happens.

If this entry is also the last entry, and we still have something left, 
then we're screwed as we will no longer insert new entries for it, 
leaving some bits not properly set.

Will rework the big for loop to make the last entry insert consistent.

Thanks,
Qu
>> +        }
>> +
>> +        /*
>> +         * (A)
>> +         * |<-- entry -->|            |<--- next -->|
>> +         *           |<-- bytenr/len -->|
>> +         *
>> +         * OR
>> +         *
>> +         * (B)
>> +         * |<-- entry -->|        |<--- next -->|
>> +         *           |<-- bytenr/len -->|
>> +         *
>> +         * OR
>> +         *
>> +         * (C)
>> +         * |<-- entry -->|<--- next -->|
>> +         *           |<-- bytenr/len -->|
>> +         */
>> +        if (i < wi_super_nr_entries(wis) - 1) {
>> +            struct write_intent_entry *next =
>> +                write_intent_entry_nr(ctrl, i + 1);
>> +            u64 next_start = wi_entry_bytenr(next);
>> +
>> +
>> +            /* Case (A) and (B), insert the new entries. */
>> +            if (cur_bytenr >= entry_end && cur_bytenr < next_start) {
>> +                insert_new_entries(ctrl, i + 1, cur_bytenr,
>> +                           bytenr + len - cur_bytenr);
>> +                cur_bytenr = next_start;
>> +                continue;
>> +            }
>> +
>> +            /* Case (C), just skip to next item */
>> +            continue;
>> +        }
>> +
>> +        /*
>> +         * The remaining case is, @entry is already the last one.
>> +         *
>> +         * |<-- entry -->|
>> +         *           |<-- bytenr/len -->|
>> +         *
>> +         * We're beyond the last entry. Need to insert new entries.
>> +         */
>> +        insert_new_entries(ctrl, i + 1, cur_bytenr,
>> +                   bytenr + len - cur_bytenr);
>> +
>> +        cur_bytenr = bytenr + len;
>> +    }
>> +}
>> +
>>   int btrfs_write_intent_init(struct btrfs_fs_info *fs_info)
>>   {
>>       struct btrfs_device *highest_dev = NULL;
>> diff --git a/fs/btrfs/write-intent.h b/fs/btrfs/write-intent.h
>> index 797e57aef0e1..d8f4d285512c 100644
>> --- a/fs/btrfs/write-intent.h
>> +++ b/fs/btrfs/write-intent.h
>> @@ -106,6 +106,15 @@ struct write_intent_entry {
>>   /* The number of bits we can have in one entry. */
>>   #define WRITE_INTENT_BITS_PER_ENTRY        (64)
>> +static inline u32 bytes_to_entries(u64 start, u32 length, u32 blocksize)
>> +{
>> +    u32 entry_len = blocksize * WRITE_INTENT_BITS_PER_ENTRY;
>> +    u64 entry_start = round_down(start, entry_len);
>> +    u64 entry_end = round_up(start + length, entry_len);
>> +
>> +    return DIV_ROUND_UP((u32)(entry_end - entry_start), entry_len);
>> +}
>> +
>>   /* In-memory write-intent control structure. */
>>   struct write_intent_ctrl {
>>       /* For the write_intent super and entries. */
>> @@ -189,6 +198,13 @@ WRITE_INTENT_SETGET_FUNCS(super_csum_type, struct 
>> write_intent_super,
>>                 csum_type, 16);
>>   WRITE_INTENT_SETGET_FUNCS(entry_bytenr, struct write_intent_entry, 
>> bytenr, 64);
>> +static inline u32 write_intent_entry_size(struct write_intent_ctrl 
>> *ctrl)
>> +{
>> +    struct write_intent_super *wis = page_address(ctrl->page);
>> +
>> +    return wi_super_blocksize(wis) * WRITE_INTENT_BITS_PER_ENTRY;
>> +}
>> +
>>   static inline void wie_get_bitmap(struct write_intent_entry *entry,
>>                     unsigned long *bitmap)
>>   {

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (10 preceding siblings ...)
  2022-07-05  7:39 ` [PATCH RFC 11/11] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
@ 2022-07-06 23:36 ` Wang Yugui
  2022-07-07  1:14   ` Qu Wenruo
  2022-07-07  4:24 ` Wang Yugui
  12 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2022-07-06 23:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi,

> [BACKGROUND]
> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
> loss happens.
> 
> For pure mirror based profiles it's fine as btrfs can utilize its csums
> to find the correct mirror the repair the bad ones.
> 
> But for RAID56, the repair itself needs the data from other devices,
> thus any out-of-sync data can degrade the tolerance.
> 
> Even worse, incorrect RMW can use the stale data to generate P/Q,
> removing the possibility of recovery the data.
> 
> 
> For md-raid, it goes with write-intent bitmap, to do faster resilver,
> and goes journal (partial parity log for RAID5) to ensure it can even
> stand a powerloss + device lose.
> 
> [OBJECTIVE]
> 
> This patchset will introduce a btrfs specific write-intent bitmap.
> 
> The bitmap will locate at physical offset 1MiB of each device, and the
> content is the same between all devices.
> 
> When there is a RAID56 write (currently all RAID56 write, including full
> stripe write), before submitting all the real bios to disks,
> write-intent bitmap will be updated and flushed to all writeable
> devices.
> 
> So even if a powerloss happened, at the next mount time we know which
> full stripes needs to check, and can start a scrub for those involved
> logical bytenr ranges.
> 
> [NO RECOVERY CODE YET]
> 
> Unfortunately, this patchset only implements the write-intent bitmap
> code, the recovery part is still a place holder, as we need some scrub
> refactor to make it only scrub a logical bytenr range.
> 
> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
> 
> Since btrfs can utilize csum for its metadata and CoWed data, unlike
> dm-bitmap which can only be used for faster re-silver, we can fully
> rebuild the full stripe, as long as:
> 
> 1) There is no missing device
>    For missing device case, we still need to go full journal.
> 
> 2) Untouched data stays untouched
>    This should be mostly sane for sane hardware.

Is there a case that write-intent bitmap log is rotated?

For hardware raid of broadcom/LSI,  once a disk is unpluged,
the whole disk will be rebuild after this disk  is plugged again.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/07/07


> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
> in size), the overhead much lower than full journal.
> 
> In the future, we may allow users to choose between just bitmaps or full
> journal to meet their requirement.
> 
> [BITMAPS DESIGN]
> 
> The bitmaps on-disk format looks like this:
> 
>  [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>  |<---------  super::size (4K) ------------->|
> 
> Super block contains how many entires are in use.
> 
> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> bytenr, and u64 for one bitmap.
> 
> And all utilized entries will be sorted in their bytenr order, and no
> bit can overlap.
> 
> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
> 
> For the worst case, it can contain 14MiB dirty ranges.
> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
> 
> For the best case, it can contain 896MiB dirty ranges.
> (all bits set per bitmap)
> 
> [WHY NOT BTRFS BTREE]
> 
> Current write-intent structure needs two features:
> 
> - Its data needs to survive cross stripe boundary
>   Normally this means write-intent btree needs to acts like a proper
>   tree root, has METADATA_ITEMs for all its tree blocks.
> 
> - Its data update must be outside of a transaction
>   Currently only log tree can do such thing.
>   But unfortunately log tree can not survive across transaction
>   boundary.
> 
> Thus write-intent btree can only meet one of the requirement, not a
> suitable solution here.
> 
> [TESTING AND BENCHMARK]
> 
> For performance benchmark, unfortunately I don't have 3 HDDs to test.
> Will do the benchmark after secured enough hardware.
> 
> For testing, it can survive volume/raid/dev-replace test groups, and no
> write-intent bitmap leakage.
> 
> Unfortunately there is still a warning triggered in btrfs/070, still
> under investigation, hopefully to be a false alert in bitmap clearing
> path.
> 
> [TODO]
> - Scrub refactor to allow us to do proper recovery at mount time
>   Need to change scrub interface to scrub based on logical bytenr.
> 
>   This can be a super big work, thus currently we will focus only on
>   RAID56 new scrub interface for write-intent recovery only.
> 
> - Extra optimizations
>   * Skip full stripe writes
>   * Enlarge the window between btrfs_write_intent_mark_dirty() and
>     btrfs_write_intent_writeback()
> 
> - Bug hunts and more fstests runs
> 
> - Proper performance benchmark
>   Needs hardware/baremetal VMs, since I don't have any physical machine
>   large enough to contian 3 3.5" HDDs.
> 
> 
> Qu Wenruo (11):
>   btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>   btrfs: introduce a new experimental compat RO flag,
>     WRITE_INTENT_BITMAP
>   btrfs: introduce the on-disk format of btrfs write intent bitmaps
>   btrfs: load/create write-intent bitmaps at mount time
>   btrfs: write-intent: write the newly created bitmaps to all disks
>   btrfs: write-intent: introduce an internal helper to set bits for a
>     range.
>   btrfs: write-intent: introduce an internal helper to clear bits for a
>     range.
>   btrfs: write back write intent bitmap after barrier_all_devices()
>   btrfs: update and writeback the write-intent bitmap for RAID56 write.
>   btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>   btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
> 
>  fs/btrfs/Makefile          |   2 +-
>  fs/btrfs/ctree.h           |  24 +-
>  fs/btrfs/disk-io.c         |  54 +++
>  fs/btrfs/raid56.c          |  16 +
>  fs/btrfs/sysfs.c           |   2 +
>  fs/btrfs/volumes.c         |  34 +-
>  fs/btrfs/write-intent.c    | 962 +++++++++++++++++++++++++++++++++++++
>  fs/btrfs/write-intent.h    | 288 +++++++++++
>  fs/btrfs/zoned.c           |   8 +
>  include/uapi/linux/btrfs.h |  17 +
>  10 files changed, 1399 insertions(+), 8 deletions(-)
>  create mode 100644 fs/btrfs/write-intent.c
>  create mode 100644 fs/btrfs/write-intent.h
> 
> -- 
> 2.36.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-06 23:36 ` [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Wang Yugui
@ 2022-07-07  1:14   ` Qu Wenruo
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-07  1:14 UTC (permalink / raw)
  To: Wang Yugui, Qu Wenruo; +Cc: linux-btrfs



On 2022/7/7 07:36, Wang Yugui wrote:
> Hi,
>
>> [BACKGROUND]
>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
>> loss happens.
>>
>> For pure mirror based profiles it's fine as btrfs can utilize its csums
>> to find the correct mirror the repair the bad ones.
>>
>> But for RAID56, the repair itself needs the data from other devices,
>> thus any out-of-sync data can degrade the tolerance.
>>
>> Even worse, incorrect RMW can use the stale data to generate P/Q,
>> removing the possibility of recovery the data.
>>
>>
>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
>> and goes journal (partial parity log for RAID5) to ensure it can even
>> stand a powerloss + device lose.
>>
>> [OBJECTIVE]
>>
>> This patchset will introduce a btrfs specific write-intent bitmap.
>>
>> The bitmap will locate at physical offset 1MiB of each device, and the
>> content is the same between all devices.
>>
>> When there is a RAID56 write (currently all RAID56 write, including full
>> stripe write), before submitting all the real bios to disks,
>> write-intent bitmap will be updated and flushed to all writeable
>> devices.
>>
>> So even if a powerloss happened, at the next mount time we know which
>> full stripes needs to check, and can start a scrub for those involved
>> logical bytenr ranges.
>>
>> [NO RECOVERY CODE YET]
>>
>> Unfortunately, this patchset only implements the write-intent bitmap
>> code, the recovery part is still a place holder, as we need some scrub
>> refactor to make it only scrub a logical bytenr range.
>>
>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>>
>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
>> dm-bitmap which can only be used for faster re-silver, we can fully
>> rebuild the full stripe, as long as:
>>
>> 1) There is no missing device
>>     For missing device case, we still need to go full journal.
>>
>> 2) Untouched data stays untouched
>>     This should be mostly sane for sane hardware.
>
> Is there a case that write-intent bitmap log is rotated?
>
> For hardware raid of broadcom/LSI,  once a disk is unpluged,
> the whole disk will be rebuild after this disk  is plugged again.

There is a plan in the future, adding a new flag,
WRITE_INTENT_TARGET_DEGRADED, to record such new writes after a degraded
rw operation.

But that would be something with lower priority, like after full journal.

Thanks,
Qu

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/07/07
>
>
>> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
>> in size), the overhead much lower than full journal.
>>
>> In the future, we may allow users to choose between just bitmaps or full
>> journal to meet their requirement.
>>
>> [BITMAPS DESIGN]
>>
>> The bitmaps on-disk format looks like this:
>>
>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>   |<---------  super::size (4K) ------------->|
>>
>> Super block contains how many entires are in use.
>>
>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>> bytenr, and u64 for one bitmap.
>>
>> And all utilized entries will be sorted in their bytenr order, and no
>> bit can overlap.
>>
>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>>
>> For the worst case, it can contain 14MiB dirty ranges.
>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>
>> For the best case, it can contain 896MiB dirty ranges.
>> (all bits set per bitmap)
>>
>> [WHY NOT BTRFS BTREE]
>>
>> Current write-intent structure needs two features:
>>
>> - Its data needs to survive cross stripe boundary
>>    Normally this means write-intent btree needs to acts like a proper
>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>
>> - Its data update must be outside of a transaction
>>    Currently only log tree can do such thing.
>>    But unfortunately log tree can not survive across transaction
>>    boundary.
>>
>> Thus write-intent btree can only meet one of the requirement, not a
>> suitable solution here.
>>
>> [TESTING AND BENCHMARK]
>>
>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>> Will do the benchmark after secured enough hardware.
>>
>> For testing, it can survive volume/raid/dev-replace test groups, and no
>> write-intent bitmap leakage.
>>
>> Unfortunately there is still a warning triggered in btrfs/070, still
>> under investigation, hopefully to be a false alert in bitmap clearing
>> path.
>>
>> [TODO]
>> - Scrub refactor to allow us to do proper recovery at mount time
>>    Need to change scrub interface to scrub based on logical bytenr.
>>
>>    This can be a super big work, thus currently we will focus only on
>>    RAID56 new scrub interface for write-intent recovery only.
>>
>> - Extra optimizations
>>    * Skip full stripe writes
>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>      btrfs_write_intent_writeback()
>>
>> - Bug hunts and more fstests runs
>>
>> - Proper performance benchmark
>>    Needs hardware/baremetal VMs, since I don't have any physical machine
>>    large enough to contian 3 3.5" HDDs.
>>
>>
>> Qu Wenruo (11):
>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>    btrfs: introduce a new experimental compat RO flag,
>>      WRITE_INTENT_BITMAP
>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>    btrfs: load/create write-intent bitmaps at mount time
>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>      range.
>>    btrfs: write-intent: introduce an internal helper to clear bits for a
>>      range.
>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>
>>   fs/btrfs/Makefile          |   2 +-
>>   fs/btrfs/ctree.h           |  24 +-
>>   fs/btrfs/disk-io.c         |  54 +++
>>   fs/btrfs/raid56.c          |  16 +
>>   fs/btrfs/sysfs.c           |   2 +
>>   fs/btrfs/volumes.c         |  34 +-
>>   fs/btrfs/write-intent.c    | 962 +++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/write-intent.h    | 288 +++++++++++
>>   fs/btrfs/zoned.c           |   8 +
>>   include/uapi/linux/btrfs.h |  17 +
>>   10 files changed, 1399 insertions(+), 8 deletions(-)
>>   create mode 100644 fs/btrfs/write-intent.c
>>   create mode 100644 fs/btrfs/write-intent.h
>>
>> --
>> 2.36.1
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
                   ` (11 preceding siblings ...)
  2022-07-06 23:36 ` [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Wang Yugui
@ 2022-07-07  4:24 ` Wang Yugui
  2022-07-07  4:28   ` Qu Wenruo
  12 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2022-07-07  4:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi,

> [BACKGROUND]
> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
> loss happens.
> 
> For pure mirror based profiles it's fine as btrfs can utilize its csums
> to find the correct mirror the repair the bad ones.
> 
> But for RAID56, the repair itself needs the data from other devices,
> thus any out-of-sync data can degrade the tolerance.
> 
> Even worse, incorrect RMW can use the stale data to generate P/Q,
> removing the possibility of recovery the data.
> 
> 
> For md-raid, it goes with write-intent bitmap, to do faster resilver,
> and goes journal (partial parity log for RAID5) to ensure it can even
> stand a powerloss + device lose.
> 
> [OBJECTIVE]
> 
> This patchset will introduce a btrfs specific write-intent bitmap.
> 
> The bitmap will locate at physical offset 1MiB of each device, and the
> content is the same between all devices.
> 
> When there is a RAID56 write (currently all RAID56 write, including full
> stripe write), before submitting all the real bios to disks,
> write-intent bitmap will be updated and flushed to all writeable
> devices.
> 
> So even if a powerloss happened, at the next mount time we know which
> full stripes needs to check, and can start a scrub for those involved
> logical bytenr ranges.
> 
> [NO RECOVERY CODE YET]
> 
> Unfortunately, this patchset only implements the write-intent bitmap
> code, the recovery part is still a place holder, as we need some scrub
> refactor to make it only scrub a logical bytenr range.
> 
> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
> 
> Since btrfs can utilize csum for its metadata and CoWed data, unlike
> dm-bitmap which can only be used for faster re-silver, we can fully
> rebuild the full stripe, as long as:
> 
> 1) There is no missing device
>    For missing device case, we still need to go full journal.
> 
> 2) Untouched data stays untouched
>    This should be mostly sane for sane hardware.
> 
> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
> in size), the overhead much lower than full journal.
> 
> In the future, we may allow users to choose between just bitmaps or full
> journal to meet their requirement.
> 
> [BITMAPS DESIGN]
> 
> The bitmaps on-disk format looks like this:
> 
>  [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>  |<---------  super::size (4K) ------------->|
> 
> Super block contains how many entires are in use.
> 
> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> bytenr, and u64 for one bitmap.
> 
> And all utilized entries will be sorted in their bytenr order, and no
> bit can overlap.
> 
> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.

Can we skip the write-intent bitmap log if we already log it in last N records
(logrotate aware) to improve the write performance?  because HDD sync
IOPS is very small.

And a bigger blocksize help this case too.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/07/07


> For the worst case, it can contain 14MiB dirty ranges.
> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
> 
> For the best case, it can contain 896MiB dirty ranges.
> (all bits set per bitmap)
> 
> [WHY NOT BTRFS BTREE]
> 
> Current write-intent structure needs two features:
> 
> - Its data needs to survive cross stripe boundary
>   Normally this means write-intent btree needs to acts like a proper
>   tree root, has METADATA_ITEMs for all its tree blocks.
> 
> - Its data update must be outside of a transaction
>   Currently only log tree can do such thing.
>   But unfortunately log tree can not survive across transaction
>   boundary.
> 
> Thus write-intent btree can only meet one of the requirement, not a
> suitable solution here.
> 
> [TESTING AND BENCHMARK]
> 
> For performance benchmark, unfortunately I don't have 3 HDDs to test.
> Will do the benchmark after secured enough hardware.
> 
> For testing, it can survive volume/raid/dev-replace test groups, and no
> write-intent bitmap leakage.
> 
> Unfortunately there is still a warning triggered in btrfs/070, still
> under investigation, hopefully to be a false alert in bitmap clearing
> path.
> 
> [TODO]
> - Scrub refactor to allow us to do proper recovery at mount time
>   Need to change scrub interface to scrub based on logical bytenr.
> 
>   This can be a super big work, thus currently we will focus only on
>   RAID56 new scrub interface for write-intent recovery only.
> 
> - Extra optimizations
>   * Skip full stripe writes
>   * Enlarge the window between btrfs_write_intent_mark_dirty() and
>     btrfs_write_intent_writeback()
> 
> - Bug hunts and more fstests runs
> 
> - Proper performance benchmark
>   Needs hardware/baremetal VMs, since I don't have any physical machine
>   large enough to contian 3 3.5" HDDs.
> 
> 
> Qu Wenruo (11):
>   btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>   btrfs: introduce a new experimental compat RO flag,
>     WRITE_INTENT_BITMAP
>   btrfs: introduce the on-disk format of btrfs write intent bitmaps
>   btrfs: load/create write-intent bitmaps at mount time
>   btrfs: write-intent: write the newly created bitmaps to all disks
>   btrfs: write-intent: introduce an internal helper to set bits for a
>     range.
>   btrfs: write-intent: introduce an internal helper to clear bits for a
>     range.
>   btrfs: write back write intent bitmap after barrier_all_devices()
>   btrfs: update and writeback the write-intent bitmap for RAID56 write.
>   btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>   btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
> 
>  fs/btrfs/Makefile          |   2 +-
>  fs/btrfs/ctree.h           |  24 +-
>  fs/btrfs/disk-io.c         |  54 +++
>  fs/btrfs/raid56.c          |  16 +
>  fs/btrfs/sysfs.c           |   2 +
>  fs/btrfs/volumes.c         |  34 +-
>  fs/btrfs/write-intent.c    | 962 +++++++++++++++++++++++++++++++++++++
>  fs/btrfs/write-intent.h    | 288 +++++++++++
>  fs/btrfs/zoned.c           |   8 +
>  include/uapi/linux/btrfs.h |  17 +
>  10 files changed, 1399 insertions(+), 8 deletions(-)
>  create mode 100644 fs/btrfs/write-intent.c
>  create mode 100644 fs/btrfs/write-intent.h
> 
> -- 
> 2.36.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  4:24 ` Wang Yugui
@ 2022-07-07  4:28   ` Qu Wenruo
  2022-07-07  4:40     ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2022-07-07  4:28 UTC (permalink / raw)
  To: Wang Yugui, Qu Wenruo; +Cc: linux-btrfs



On 2022/7/7 12:24, Wang Yugui wrote:
> Hi,
>
>> [BACKGROUND]
>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
>> loss happens.
>>
>> For pure mirror based profiles it's fine as btrfs can utilize its csums
>> to find the correct mirror the repair the bad ones.
>>
>> But for RAID56, the repair itself needs the data from other devices,
>> thus any out-of-sync data can degrade the tolerance.
>>
>> Even worse, incorrect RMW can use the stale data to generate P/Q,
>> removing the possibility of recovery the data.
>>
>>
>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
>> and goes journal (partial parity log for RAID5) to ensure it can even
>> stand a powerloss + device lose.
>>
>> [OBJECTIVE]
>>
>> This patchset will introduce a btrfs specific write-intent bitmap.
>>
>> The bitmap will locate at physical offset 1MiB of each device, and the
>> content is the same between all devices.
>>
>> When there is a RAID56 write (currently all RAID56 write, including full
>> stripe write), before submitting all the real bios to disks,
>> write-intent bitmap will be updated and flushed to all writeable
>> devices.
>>
>> So even if a powerloss happened, at the next mount time we know which
>> full stripes needs to check, and can start a scrub for those involved
>> logical bytenr ranges.
>>
>> [NO RECOVERY CODE YET]
>>
>> Unfortunately, this patchset only implements the write-intent bitmap
>> code, the recovery part is still a place holder, as we need some scrub
>> refactor to make it only scrub a logical bytenr range.
>>
>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>>
>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
>> dm-bitmap which can only be used for faster re-silver, we can fully
>> rebuild the full stripe, as long as:
>>
>> 1) There is no missing device
>>     For missing device case, we still need to go full journal.
>>
>> 2) Untouched data stays untouched
>>     This should be mostly sane for sane hardware.
>>
>> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
>> in size), the overhead much lower than full journal.
>>
>> In the future, we may allow users to choose between just bitmaps or full
>> journal to meet their requirement.
>>
>> [BITMAPS DESIGN]
>>
>> The bitmaps on-disk format looks like this:
>>
>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>   |<---------  super::size (4K) ------------->|
>>
>> Super block contains how many entires are in use.
>>
>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>> bytenr, and u64 for one bitmap.
>>
>> And all utilized entries will be sorted in their bytenr order, and no
>> bit can overlap.
>>
>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>
> Can we skip the write-intent bitmap log if we already log it in last N records
> (logrotate aware) to improve the write performance?  because HDD sync
> IOPS is very small.

I'm not aware about the logrotate idea you mentioned, mind to explain it
more?

But the overall idea of journal/write-intent bitmaps are, always ensure
there is something recording the full write or the write-intention
before the real IO is submitted.

So I'm afraid such behavior can not be changed much.

Thanks,
Qu

>
> And a bigger blocksize help this case too.
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/07/07
>
>
>> For the worst case, it can contain 14MiB dirty ranges.
>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>
>> For the best case, it can contain 896MiB dirty ranges.
>> (all bits set per bitmap)
>>
>> [WHY NOT BTRFS BTREE]
>>
>> Current write-intent structure needs two features:
>>
>> - Its data needs to survive cross stripe boundary
>>    Normally this means write-intent btree needs to acts like a proper
>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>
>> - Its data update must be outside of a transaction
>>    Currently only log tree can do such thing.
>>    But unfortunately log tree can not survive across transaction
>>    boundary.
>>
>> Thus write-intent btree can only meet one of the requirement, not a
>> suitable solution here.
>>
>> [TESTING AND BENCHMARK]
>>
>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>> Will do the benchmark after secured enough hardware.
>>
>> For testing, it can survive volume/raid/dev-replace test groups, and no
>> write-intent bitmap leakage.
>>
>> Unfortunately there is still a warning triggered in btrfs/070, still
>> under investigation, hopefully to be a false alert in bitmap clearing
>> path.
>>
>> [TODO]
>> - Scrub refactor to allow us to do proper recovery at mount time
>>    Need to change scrub interface to scrub based on logical bytenr.
>>
>>    This can be a super big work, thus currently we will focus only on
>>    RAID56 new scrub interface for write-intent recovery only.
>>
>> - Extra optimizations
>>    * Skip full stripe writes
>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>      btrfs_write_intent_writeback()
>>
>> - Bug hunts and more fstests runs
>>
>> - Proper performance benchmark
>>    Needs hardware/baremetal VMs, since I don't have any physical machine
>>    large enough to contian 3 3.5" HDDs.
>>
>>
>> Qu Wenruo (11):
>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>    btrfs: introduce a new experimental compat RO flag,
>>      WRITE_INTENT_BITMAP
>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>    btrfs: load/create write-intent bitmaps at mount time
>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>      range.
>>    btrfs: write-intent: introduce an internal helper to clear bits for a
>>      range.
>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>
>>   fs/btrfs/Makefile          |   2 +-
>>   fs/btrfs/ctree.h           |  24 +-
>>   fs/btrfs/disk-io.c         |  54 +++
>>   fs/btrfs/raid56.c          |  16 +
>>   fs/btrfs/sysfs.c           |   2 +
>>   fs/btrfs/volumes.c         |  34 +-
>>   fs/btrfs/write-intent.c    | 962 +++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/write-intent.h    | 288 +++++++++++
>>   fs/btrfs/zoned.c           |   8 +
>>   include/uapi/linux/btrfs.h |  17 +
>>   10 files changed, 1399 insertions(+), 8 deletions(-)
>>   create mode 100644 fs/btrfs/write-intent.c
>>   create mode 100644 fs/btrfs/write-intent.h
>>
>> --
>> 2.36.1
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  4:28   ` Qu Wenruo
@ 2022-07-07  4:40     ` Wang Yugui
  2022-07-07  5:05       ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2022-07-07  4:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

Hi,

> On 2022/7/7 12:24, Wang Yugui wrote:
> > Hi,
> >
> >> [BACKGROUND]
> >> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
> >> loss happens.
> >>
> >> For pure mirror based profiles it's fine as btrfs can utilize its csums
> >> to find the correct mirror the repair the bad ones.
> >>
> >> But for RAID56, the repair itself needs the data from other devices,
> >> thus any out-of-sync data can degrade the tolerance.
> >>
> >> Even worse, incorrect RMW can use the stale data to generate P/Q,
> >> removing the possibility of recovery the data.
> >>
> >>
> >> For md-raid, it goes with write-intent bitmap, to do faster resilver,
> >> and goes journal (partial parity log for RAID5) to ensure it can even
> >> stand a powerloss + device lose.
> >>
> >> [OBJECTIVE]
> >>
> >> This patchset will introduce a btrfs specific write-intent bitmap.
> >>
> >> The bitmap will locate at physical offset 1MiB of each device, and the
> >> content is the same between all devices.
> >>
> >> When there is a RAID56 write (currently all RAID56 write, including full
> >> stripe write), before submitting all the real bios to disks,
> >> write-intent bitmap will be updated and flushed to all writeable
> >> devices.
> >>
> >> So even if a powerloss happened, at the next mount time we know which
> >> full stripes needs to check, and can start a scrub for those involved
> >> logical bytenr ranges.
> >>
> >> [NO RECOVERY CODE YET]
> >>
> >> Unfortunately, this patchset only implements the write-intent bitmap
> >> code, the recovery part is still a place holder, as we need some scrub
> >> refactor to make it only scrub a logical bytenr range.
> >>
> >> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
> >>
> >> Since btrfs can utilize csum for its metadata and CoWed data, unlike
> >> dm-bitmap which can only be used for faster re-silver, we can fully
> >> rebuild the full stripe, as long as:
> >>
> >> 1) There is no missing device
> >>     For missing device case, we still need to go full journal.
> >>
> >> 2) Untouched data stays untouched
> >>     This should be mostly sane for sane hardware.
> >>
> >> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
> >> in size), the overhead much lower than full journal.
> >>
> >> In the future, we may allow users to choose between just bitmaps or full
> >> journal to meet their requirement.
> >>
> >> [BITMAPS DESIGN]
> >>
> >> The bitmaps on-disk format looks like this:
> >>
> >>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
> >>   |<---------  super::size (4K) ------------->|
> >>
> >> Super block contains how many entires are in use.
> >>
> >> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> >> bytenr, and u64 for one bitmap.
> >>
> >> And all utilized entries will be sorted in their bytenr order, and no
> >> bit can overlap.
> >>
> >> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> >> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
> >
> > Can we skip the write-intent bitmap log if we already log it in last N records
> > (logrotate aware) to improve the write performance?  because HDD sync
> > IOPS is very small.
> 
> I'm not aware about the logrotate idea you mentioned, mind to explain it
> more?
> 
> But the overall idea of journal/write-intent bitmaps are, always ensure
> there is something recording the full write or the write-intention
> before the real IO is submitted.
> 
> So I'm afraid such behavior can not be changed much.

The basic idea is that we recover the data by scrub, not full log,
so log *1 is same to log *2, but with less IOPS?

log *1
	write(0-64K) log
	wirte(64K-128K) log
	wirte(128K-192K) log
	wirte(192K-256K) log
log *2
	write(0-256K) log
	already write(0-256K), skip
	already write(0-256K), skip
	already write(0-256K), skip

we can search the entry we currently used, but can not search the prev
entry, because it maybe be logrotated?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/07/07


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
  2022-07-07  4:40     ` Wang Yugui
@ 2022-07-07  5:05       ` Qu Wenruo
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Wenruo @ 2022-07-07  5:05 UTC (permalink / raw)
  To: Wang Yugui, Qu Wenruo; +Cc: linux-btrfs



On 2022/7/7 12:40, Wang Yugui wrote:
> Hi,
> 
>> On 2022/7/7 12:24, Wang Yugui wrote:
>>> Hi,
>>>
>>>> [BACKGROUND]
>>>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
>>>> loss happens.
>>>>
>>>> For pure mirror based profiles it's fine as btrfs can utilize its csums
>>>> to find the correct mirror the repair the bad ones.
>>>>
>>>> But for RAID56, the repair itself needs the data from other devices,
>>>> thus any out-of-sync data can degrade the tolerance.
>>>>
>>>> Even worse, incorrect RMW can use the stale data to generate P/Q,
>>>> removing the possibility of recovery the data.
>>>>
>>>>
>>>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
>>>> and goes journal (partial parity log for RAID5) to ensure it can even
>>>> stand a powerloss + device lose.
>>>>
>>>> [OBJECTIVE]
>>>>
>>>> This patchset will introduce a btrfs specific write-intent bitmap.
>>>>
>>>> The bitmap will locate at physical offset 1MiB of each device, and the
>>>> content is the same between all devices.
>>>>
>>>> When there is a RAID56 write (currently all RAID56 write, including full
>>>> stripe write), before submitting all the real bios to disks,
>>>> write-intent bitmap will be updated and flushed to all writeable
>>>> devices.
>>>>
>>>> So even if a powerloss happened, at the next mount time we know which
>>>> full stripes needs to check, and can start a scrub for those involved
>>>> logical bytenr ranges.
>>>>
>>>> [NO RECOVERY CODE YET]
>>>>
>>>> Unfortunately, this patchset only implements the write-intent bitmap
>>>> code, the recovery part is still a place holder, as we need some scrub
>>>> refactor to make it only scrub a logical bytenr range.
>>>>
>>>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>>>>
>>>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
>>>> dm-bitmap which can only be used for faster re-silver, we can fully
>>>> rebuild the full stripe, as long as:
>>>>
>>>> 1) There is no missing device
>>>>      For missing device case, we still need to go full journal.
>>>>
>>>> 2) Untouched data stays untouched
>>>>      This should be mostly sane for sane hardware.
>>>>
>>>> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
>>>> in size), the overhead much lower than full journal.
>>>>
>>>> In the future, we may allow users to choose between just bitmaps or full
>>>> journal to meet their requirement.
>>>>
>>>> [BITMAPS DESIGN]
>>>>
>>>> The bitmaps on-disk format looks like this:
>>>>
>>>>    [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>>>    |<---------  super::size (4K) ------------->|
>>>>
>>>> Super block contains how many entires are in use.
>>>>
>>>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>>>> bytenr, and u64 for one bitmap.
>>>>
>>>> And all utilized entries will be sorted in their bytenr order, and no
>>>> bit can overlap.
>>>>
>>>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>>>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>>>
>>> Can we skip the write-intent bitmap log if we already log it in last N records
>>> (logrotate aware) to improve the write performance?  because HDD sync
>>> IOPS is very small.
>>
>> I'm not aware about the logrotate idea you mentioned, mind to explain it
>> more?
>>
>> But the overall idea of journal/write-intent bitmaps are, always ensure
>> there is something recording the full write or the write-intention
>> before the real IO is submitted.
>>
>> So I'm afraid such behavior can not be changed much.
> 
> The basic idea is that we recover the data by scrub, not full log,
> so log *1 is same to log *2, but with less IOPS?
> 
> log *1
> 	write(0-64K) log
> 	wirte(64K-128K) log
> 	wirte(128K-192K) log
> 	wirte(192K-256K) log

It's already done using the 64KiB blocksize.

In that case, a partial write for any range inside the full stripe (for 
3 disk RAID5 it will be in 128KiB), the full stripe will be marked dirty.

So for above log *1 case, it will only be 4 bits set the full stripe 
0-256K (if that's a full stripe) in one go.

> log *2
> 	write(0-256K) log
> 	already write(0-256K), skip
> 	already write(0-256K), skip
> 	already write(0-256K), skip
> 
> we can search the entry we currently used, but can not search the prev
> entry, because it maybe be logrotated?

Logrotation makes no sense. After one entry being written, the next 
write will still trigger a new write for the bitmap, the same amount of 
seek.

And the existing code already has the window to merge different bitmap 
writes into the same event count.

The requirement is only to ensure the bitmaps with corresponding bits 
reach disks before the IO has been submitted.
It leaves a lot of room for us to merge the bits into one write.

Thanks,
Qu

> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/07/07
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-07-07  5:06 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 01/11] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 02/11] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 03/11] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 04/11] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 05/11] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
2022-07-06  6:16   ` Qu Wenruo
2022-07-06  9:00     ` Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 07/11] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 08/11] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 09/11] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 10/11] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 11/11] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
2022-07-06 23:36 ` [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Wang Yugui
2022-07-07  1:14   ` Qu Wenruo
2022-07-07  4:24 ` Wang Yugui
2022-07-07  4:28   ` Qu Wenruo
2022-07-07  4:40     ` Wang Yugui
2022-07-07  5:05       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).