All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/13] btrfs: introduce RAID stripe tree
@ 2023-02-08 10:57 Johannes Thumshirn
  2023-02-08 10:57 ` [PATCH v5 01/13] btrfs: re-add trans parameter to insert_delayed_ref Johannes Thumshirn
                   ` (14 more replies)
  0 siblings, 15 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Updates of the raid-stripe-tree are done at delayed-ref time to safe on
bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
i.e. when the logical to physical translation happens for regular btrfs RAID
as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows:

rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
btrfs-progs v5.16.1 
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
leaf 805847040 flags 0x1(WRITTEN) backref revision 1
checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
        item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
                        stripe 0 devid 1 offset 939524096
                        stripe 1 devid 2 offset 536870912
        item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
                        stripe 0 devid 1 offset 939651072
                        stripe 1 devid 2 offset 536997888
        item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
                        stripe 0 devid 1 offset 939778048
                        stripe 1 devid 2 offset 537124864
        item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
                        stripe 0 devid 1 offset 939905024
                        stripe 1 devid 2 offset 537251840
        item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
                        stripe 0 devid 1 offset 940032000
                        stripe 1 devid 2 offset 537378816
        item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
                        stripe 0 devid 1 offset 940158976
                        stripe 1 devid 2 offset 537505792
        item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
                        stripe 0 devid 1 offset 940285952
                        stripe 1 devid 2 offset 537632768
        item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
                        stripe 0 devid 1 offset 940412928
                        stripe 1 devid 2 offset 537759744
        item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
                        stripe 0 devid 1 offset 940539904
                        stripe 1 devid 2 offset 537886720
total bytes 26843545600
bytes used 1245184
uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb

A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true


Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing 
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST

Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches

v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Johannes Thumshirn (13):
  btrfs: re-add trans parameter to insert_delayed_ref
  btrfs: add raid stripe tree definitions
  btrfs: read raid-stripe-tree from disk
  btrfs: add support for inserting raid stripe extents
  btrfs: delete stripe extent on extent deletion
  btrfs: lookup physical address from stripe extent
  btrfs: add raid stripe tree pretty printer
  btrfs: zoned: allow zoned RAID
  btrfs: check for leaks of ordered stripes on umount
  btrfs: add tracepoints for ordered stripes
  btrfs: announce presence of raid-stripe-tree in sysfs
  btrfs: consult raid-stripe-tree when scrubbing
  btrfs: add raid-stripe-tree to features enabled with debug

 fs/btrfs/Makefile               |   2 +-
 fs/btrfs/accessors.h            |  29 +++
 fs/btrfs/bio.c                  |  29 +++
 fs/btrfs/bio.h                  |   2 +
 fs/btrfs/block-rsv.c            |   1 +
 fs/btrfs/delayed-ref.c          |  13 +-
 fs/btrfs/delayed-ref.h          |   2 +
 fs/btrfs/disk-io.c              |  30 ++-
 fs/btrfs/disk-io.h              |   5 +
 fs/btrfs/extent-tree.c          |  68 ++++++
 fs/btrfs/fs.h                   |   8 +-
 fs/btrfs/inode.c                |  15 +-
 fs/btrfs/print-tree.c           |  21 ++
 fs/btrfs/raid-stripe-tree.c     | 415 ++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  87 +++++++
 fs/btrfs/scrub.c                |  33 ++-
 fs/btrfs/super.c                |   1 +
 fs/btrfs/sysfs.c                |   3 +
 fs/btrfs/volumes.c              |  39 ++-
 fs/btrfs/volumes.h              |  12 +-
 fs/btrfs/zoned.c                |  49 +++-
 include/trace/events/btrfs.h    |  50 ++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  20 +-
 24 files changed, 905 insertions(+), 30 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

-- 
2.39.0


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v5 01/13] btrfs: re-add trans parameter to insert_delayed_ref
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 19:41   ` Josef Bacik
  2023-02-08 10:57 ` [PATCH v5 02/13] btrfs: add raid stripe tree definitions Johannes Thumshirn
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Re-add the trans parameter to insert_delayed_ref as it is needed again
later in this series.

This reverts commit bccf28752a99 ("btrfs: drop trans parameter of insert_delayed_ref")

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/delayed-ref.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 886ffb232eac..7660ac642c81 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -598,7 +598,8 @@ void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
  * Return 0 for insert.
  * Return >0 for merge.
  */
-static int insert_delayed_ref(struct btrfs_delayed_ref_root *root,
+static int insert_delayed_ref(struct btrfs_trans_handle *trans,
+			      struct btrfs_delayed_ref_root *root,
 			      struct btrfs_delayed_ref_head *href,
 			      struct btrfs_delayed_ref_node *ref)
 {
@@ -974,7 +975,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	head_ref = add_delayed_ref_head(trans, head_ref, record,
 					action, &qrecord_inserted);
 
-	ret = insert_delayed_ref(delayed_refs, head_ref, &ref->node);
+	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
 	spin_unlock(&delayed_refs->lock);
 
 	/*
@@ -1066,7 +1067,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	head_ref = add_delayed_ref_head(trans, head_ref, record,
 					action, &qrecord_inserted);
 
-	ret = insert_delayed_ref(delayed_refs, head_ref, &ref->node);
+	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
 	spin_unlock(&delayed_refs->lock);
 
 	/*
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 02/13] btrfs: add raid stripe tree definitions
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
  2023-02-08 10:57 ` [PATCH v5 01/13] btrfs: re-add trans parameter to insert_delayed_ref Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 19:42   ` Josef Bacik
                     ` (2 more replies)
  2023-02-08 10:57 ` [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
                   ` (12 subsequent siblings)
  14 siblings, 3 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Add definitions for the raid stripe tree. This tree will hold information
about the on-disk layout of the stripes in a RAID set.

Each stripe extent has a 1:1 relationship with an on-disk extent item and
is doing the logical to per-drive physical address translation for the
extent item in question.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/accessors.h            | 29 +++++++++++++++++++++++++++++
 include/uapi/linux/btrfs_tree.h | 20 ++++++++++++++++++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index ceadfc5d6c66..6e753b63faae 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -306,6 +306,35 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
 
+BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+
+static inline struct btrfs_raid_stride *btrfs_raid_stride_nr(
+					 struct btrfs_stripe_extent *dps, int nr)
+{
+	unsigned long offset = (unsigned long)dps;
+
+	offset += offsetof(struct btrfs_stripe_extent, strides);
+	offset += nr * sizeof(struct btrfs_raid_stride);
+	return (struct btrfs_raid_stride *)offset;
+}
+
+static inline u64 btrfs_raid_stride_devid_nr(const struct extent_buffer *eb,
+					       struct btrfs_stripe_extent *dps,
+					       int nr)
+{
+	return btrfs_raid_stride_devid(eb, btrfs_raid_stride_nr(dps, nr));
+}
+
+static inline u64 btrfs_raid_stride_physical_nr(const struct extent_buffer *eb,
+						  struct btrfs_stripe_extent *dps,
+						  int nr)
+{
+	return btrfs_raid_stride_physical(eb, btrfs_raid_stride_nr(dps, nr));
+}
+
 /* struct btrfs_dev_extent */
 BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64);
 BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ab38d0f411fa..64e6bf2a10d8 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -4,9 +4,8 @@
 
 #include <linux/btrfs.h>
 #include <linux/types.h>
-#ifdef __KERNEL__
 #include <linux/stddef.h>
-#else
+#ifndef __KERNEL__
 #include <stddef.h>
 #endif
 
@@ -73,6 +72,9 @@
 /* Holds the block group items for extent tree v2. */
 #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
 
+/* tracks RAID stripes in block groups. */
+#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -281,6 +283,8 @@
  */
 #define BTRFS_QGROUP_RELATION_KEY       246
 
+#define BTRFS_RAID_STRIPE_KEY		247
+
 /*
  * Obsolete name, see BTRFS_TEMPORARY_ITEM_KEY.
  */
@@ -715,6 +719,18 @@ struct btrfs_free_space_header {
 	__le64 num_bitmaps;
 } __attribute__ ((__packed__));
 
+struct btrfs_raid_stride {
+	/* btrfs device-id this raid extent lives on */
+	__le64 devid;
+	/* physical location on disk */
+	__le64 physical;
+};
+
+struct btrfs_stripe_extent {
+	/* array of raid strides this stripe is composed of */
+	__DECLARE_FLEX_ARRAY(struct btrfs_raid_stride, strides);
+};
+
 #define BTRFS_HEADER_FLAG_WRITTEN	(1ULL << 0)
 #define BTRFS_HEADER_FLAG_RELOC		(1ULL << 1)
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
  2023-02-08 10:57 ` [PATCH v5 01/13] btrfs: re-add trans parameter to insert_delayed_ref Johannes Thumshirn
  2023-02-08 10:57 ` [PATCH v5 02/13] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 19:43   ` Josef Bacik
  2023-02-13 11:35   ` Anand Jain
  2023-02-08 10:57 ` [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

If we find a raid-stripe-tree on mount, read it from disk.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-rsv.c       |  1 +
 fs/btrfs/disk-io.c         | 19 +++++++++++++++++++
 fs/btrfs/disk-io.h         |  5 +++++
 fs/btrfs/fs.h              |  1 +
 include/uapi/linux/btrfs.h |  1 +
 5 files changed, 27 insertions(+)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 5367a14d44d2..384987343a64 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -402,6 +402,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
 	case BTRFS_EXTENT_TREE_OBJECTID:
 	case BTRFS_FREE_SPACE_TREE_OBJECTID:
 	case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+	case BTRFS_RAID_STRIPE_TREE_OBJECTID:
 		root->block_rsv = &fs_info->delayed_refs_rsv;
 		break;
 	case BTRFS_ROOT_TREE_OBJECTID:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0da0bde347e5..ad64a79d052a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1454,6 +1454,9 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
 
 		return btrfs_grab_root(root) ? root : ERR_PTR(-ENOENT);
 	}
+	if (objectid == BTRFS_RAID_STRIPE_TREE_OBJECTID)
+		return btrfs_grab_root(fs_info->stripe_root) ?
+			fs_info->stripe_root : ERR_PTR(-ENOENT);
 	return NULL;
 }
 
@@ -1532,6 +1535,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->fs_root);
 	btrfs_put_root(fs_info->data_reloc_root);
 	btrfs_put_root(fs_info->block_group_root);
+	btrfs_put_root(fs_info->stripe_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
 	kfree(fs_info->super_copy);
@@ -2067,6 +2071,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
 	free_root_extent_buffers(info->fs_root);
 	free_root_extent_buffers(info->data_reloc_root);
 	free_root_extent_buffers(info->block_group_root);
+	free_root_extent_buffers(info->stripe_root);
 	if (free_chunk_root)
 		free_root_extent_buffers(info->chunk_root);
 }
@@ -2519,6 +2524,20 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 		fs_info->uuid_root = root;
 	}
 
+	if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
+		location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
+		root = btrfs_read_tree_root(tree_root, &location);
+		if (IS_ERR(root)) {
+			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
+				ret = PTR_ERR(root);
+				goto out;
+			}
+		} else {
+			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+			fs_info->stripe_root = root;
+		}
+	}
+
 	return 0;
 out:
 	btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d",
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 3b53fc29a858..a85a8922c3fd 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -106,6 +106,11 @@ static inline struct btrfs_root *btrfs_grab_root(struct btrfs_root *root)
 	return NULL;
 }
 
+static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info)
+{
+	return fs_info->stripe_root;
+}
+
 void btrfs_put_root(struct btrfs_root *root);
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 4c477eae6891..93f2499a9c5b 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -367,6 +367,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *data_reloc_root;
 	struct btrfs_root *block_group_root;
+	struct btrfs_root *stripe_root;
 
 	/* The log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index b4f0f9531119..593fb7930a37 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -322,6 +322,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE	(1ULL << 14)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 19:47   ` Josef Bacik
                     ` (3 more replies)
  2023-02-08 10:57 ` [PATCH v5 05/13] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
                   ` (10 subsequent siblings)
  14 siblings, 4 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.

Inserting the stripe extents happens after the data I/O has completed,
this is done to a) support zone-append and b) rule out the possibility of
a RAID-write-hole.

This is done by creating in-memory ordered stripe extents, just like the
in memory ordered extents, on I/O completion and the on-disk raid stripe
extents get created once we're running the delayed_refs for the extent
item this stripe extent is tied to.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/bio.c              |  29 ++++++
 fs/btrfs/bio.h              |   2 +
 fs/btrfs/delayed-ref.c      |   6 +-
 fs/btrfs/delayed-ref.h      |   2 +
 fs/btrfs/disk-io.c          |   9 +-
 fs/btrfs/extent-tree.c      |  60 +++++++++++
 fs/btrfs/fs.h               |   4 +
 fs/btrfs/inode.c            |  15 ++-
 fs/btrfs/raid-stripe-tree.c | 202 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  71 +++++++++++++
 fs/btrfs/volumes.c          |   5 +-
 fs/btrfs/volumes.h          |  12 +--
 fs/btrfs/zoned.c            |   4 +
 14 files changed, 410 insertions(+), 13 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 90d53209755b..3bb869a84e54 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
 	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
-	   lru_cache.o
+	   lru_cache.o raid-stripe-tree.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index e6fe1b7dbc50..a01c6560ef49 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -15,6 +15,7 @@
 #include "rcu-string.h"
 #include "zoned.h"
 #include "file-item.h"
+#include "raid-stripe-tree.h"
 
 static struct bio_set btrfs_bioset;
 static struct bio_set btrfs_clone_bioset;
@@ -350,6 +351,21 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	btrfs_put_bioc(bioc);
 }
 
+static void btrfs_raid_stripe_update(struct work_struct *work)
+{
+	struct btrfs_bio *bbio =
+		container_of(work, struct btrfs_bio, raid_stripe_work);
+	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
+	struct btrfs_io_context *bioc = stripe->bioc;
+	int ret;
+
+	ret = btrfs_add_ordered_stripe(bioc);
+	if (ret)
+		bbio->bio.bi_status = errno_to_blk_status(ret);
+	btrfs_orig_bbio_end_io(bbio);
+	btrfs_put_bioc(bioc);
+}
+
 static void btrfs_orig_write_end_io(struct bio *bio)
 {
 	struct btrfs_io_stripe *stripe = bio->bi_private;
@@ -372,6 +388,16 @@ static void btrfs_orig_write_end_io(struct bio *bio)
 	else
 		bio->bi_status = BLK_STS_OK;
 
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
+	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
+		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
+		queue_work(bbio->inode->root->fs_info->raid_stripe_workers,
+			   &bbio->raid_stripe_work);
+		return;
+	}
+
 	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
 }
@@ -383,6 +409,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
 	if (bio->bi_status) {
 		atomic_inc(&stripe->bioc->error);
 		btrfs_log_dev_io_error(bio, stripe->dev);
+	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	}
 
 	/* Pass on control to the original bio this one was cloned from */
@@ -442,6 +470,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 	bio->bi_private = &bioc->stripes[dev_nr];
 	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
 	bioc->stripes[dev_nr].bioc = bioc;
+	bioc->size = bio->bi_iter.bi_size;
 	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
 }
 
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 20105806c8ac..bf5fbc105148 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -58,6 +58,8 @@ struct btrfs_bio {
 	atomic_t pending_ios;
 	struct work_struct end_io_work;
 
+	struct work_struct raid_stripe_work;
+
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 7660ac642c81..261f52ad8e12 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -14,6 +14,7 @@
 #include "space-info.h"
 #include "tree-mod-log.h"
 #include "fs.h"
+#include "raid-stripe-tree.h"
 
 struct kmem_cache *btrfs_delayed_ref_head_cachep;
 struct kmem_cache *btrfs_delayed_tree_ref_cachep;
@@ -637,8 +638,11 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
 	exist->ref_mod += mod;
 
 	/* remove existing tail if its ref_mod is zero */
-	if (exist->ref_mod == 0)
+	if (exist->ref_mod == 0) {
+		btrfs_drop_ordered_stripe(trans->fs_info, exist->bytenr);
 		drop_delayed_ref(root, href, exist);
+	}
+
 	spin_unlock(&href->lock);
 	return ret;
 inserted:
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 2eb34abf700f..5096c1a1ed3e 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -51,6 +51,8 @@ struct btrfs_delayed_ref_node {
 	/* is this node still in the rbtree? */
 	unsigned int is_head:1;
 	unsigned int in_tree:1;
+	/* Do we need RAID stripe tree modifications? */
+	unsigned int must_insert_stripe:1;
 };
 
 struct btrfs_delayed_extent_op {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ad64a79d052a..b130c8dcd8d9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2022,6 +2022,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 		destroy_workqueue(fs_info->rmw_workers);
 	if (fs_info->compressed_write_workers)
 		destroy_workqueue(fs_info->compressed_write_workers);
+	if (fs_info->raid_stripe_workers)
+		destroy_workqueue(fs_info->raid_stripe_workers);
 	btrfs_destroy_workqueue(fs_info->endio_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
 	btrfs_destroy_workqueue(fs_info->delayed_workers);
@@ -2240,12 +2242,14 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
 	fs_info->discard_ctl.discard_workers =
 		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
+	fs_info->raid_stripe_workers =
+		alloc_workqueue("btrfs-raid-stripe", flags, max_active);
 
 	if (!(fs_info->workers && fs_info->hipri_workers &&
 	      fs_info->delalloc_workers && fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->compressed_write_workers &&
-	      fs_info->endio_write_workers &&
+	      fs_info->endio_write_workers && fs_info->raid_stripe_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->fixup_workers &&
 	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
@@ -3046,6 +3050,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 
 	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
 	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
+
+	rwlock_init(&fs_info->stripe_update_lock);
+	fs_info->stripe_update_tree = RB_ROOT;
 }
 
 static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 688cdf816957..50b3a2c3c0dd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -42,6 +42,7 @@
 #include "file-item.h"
 #include "orphan.h"
 #include "tree-checker.h"
+#include "raid-stripe-tree.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1497,6 +1498,56 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static bool delayed_ref_needs_rst_update(struct btrfs_fs_info *fs_info,
+					 struct btrfs_delayed_ref_head *head)
+{
+	struct extent_map *em;
+	struct map_lookup *map;
+	bool ret = false;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return ret;
+
+	em = btrfs_get_chunk_map(fs_info, head->bytenr, head->num_bytes);
+	if (!em)
+		return ret;
+
+	map = em->map_lookup;
+
+	if (btrfs_need_stripe_tree_update(fs_info, map->type))
+		ret = true;
+
+	free_extent_map(em);
+
+	return ret;
+}
+
+static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
+					    struct btrfs_delayed_ref_node *node)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_ordered_stripe *stripe;
+	int ret = 0;
+
+	stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
+	if (!stripe) {
+		btrfs_err(fs_info,
+			  "cannot get stripe extent for address %llu (%llu)",
+			  node->bytenr, node->num_bytes);
+		return -EINVAL;
+	}
+
+	ASSERT(stripe->logical == node->bytenr);
+
+	ret = btrfs_insert_raid_extent(trans, stripe);
+	/* once for us */
+	btrfs_put_ordered_stripe(fs_info, stripe);
+	/* once for the tree */
+	btrfs_put_ordered_stripe(fs_info, stripe);
+
+	return ret;
+}
+
 static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 				struct btrfs_delayed_ref_node *node,
 				struct btrfs_delayed_extent_op *extent_op,
@@ -1527,11 +1578,17 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 						 flags, ref->objectid,
 						 ref->offset, &ins,
 						 node->ref_mod);
+		if (ret)
+			return ret;
+		if (node->must_insert_stripe)
+			ret = add_stripe_entry_for_delayed_ref(trans, node);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
 					     ref->objectid, ref->offset,
 					     node->ref_mod, extent_op);
 	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
+		if (node->must_insert_stripe)
+			btrfs_drop_ordered_stripe(trans->fs_info, node->bytenr);
 		ret = __btrfs_free_extent(trans, node, parent,
 					  ref_root, ref->objectid,
 					  ref->offset, node->ref_mod,
@@ -1901,6 +1958,8 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct btrfs_delayed_extent_op *extent_op;
 	struct btrfs_delayed_ref_node *ref;
+	const bool need_rst_update =
+		delayed_ref_needs_rst_update(fs_info, locked_ref);
 	int must_insert_reserved = 0;
 	int ret;
 
@@ -1951,6 +2010,7 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
 		locked_ref->extent_op = NULL;
 		spin_unlock(&locked_ref->lock);
 
+		ref->must_insert_stripe = need_rst_update;
 		ret = run_one_delayed_ref(trans, ref, extent_op,
 					  must_insert_reserved);
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 93f2499a9c5b..bee7ed0304cd 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -551,6 +551,7 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *endio_write_workers;
 	struct btrfs_workqueue *endio_freespace_worker;
 	struct btrfs_workqueue *caching_workers;
+	struct workqueue_struct *raid_stripe_workers;
 
 	/*
 	 * Fixup workers take dirty pages that didn't properly go through the
@@ -791,6 +792,9 @@ struct btrfs_fs_info {
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
 
+	rwlock_t stripe_update_lock;
+	struct rb_root stripe_update_tree;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 36ae541ad51b..74c0b486e496 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -70,6 +70,7 @@
 #include "verity.h"
 #include "super.h"
 #include "orphan.h"
+#include "raid-stripe-tree.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -9509,12 +9510,17 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
 	if (qgroup_released < 0)
 		return ERR_PTR(qgroup_released);
 
+	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
+						    start, len);
+	if (ret)
+		goto free_qgroup;
+
 	if (trans) {
 		ret = insert_reserved_file_extent(trans, inode,
 						  file_offset, &stack_fi,
 						  true, qgroup_released);
 		if (ret)
-			goto free_qgroup;
+			goto free_stripe_extent;
 		return trans;
 	}
 
@@ -9532,7 +9538,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
-		goto free_qgroup;
+		goto free_stripe_extent;
 	}
 
 	ret = btrfs_replace_file_extents(inode, path, file_offset,
@@ -9540,9 +9546,12 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
 				     &trans);
 	btrfs_free_path(path);
 	if (ret)
-		goto free_qgroup;
+		goto free_stripe_extent;
 	return trans;
 
+free_stripe_extent:
+	btrfs_drop_ordered_stripe(inode->root->fs_info, start);
+
 free_qgroup:
 	/*
 	 * We have released qgroup data range at the beginning of the function,
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
new file mode 100644
index 000000000000..d184cd9dc96e
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Western Digital Corporation or its affiliates.
+ */
+
+#include <linux/btrfs_tree.h>
+
+#include "ctree.h"
+#include "fs.h"
+#include "accessors.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "raid-stripe-tree.h"
+#include "volumes.h"
+#include "misc.h"
+#include "disk-io.h"
+#include "print-tree.h"
+
+static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
+{
+	struct btrfs_ordered_stripe *stripe =
+		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+	const u64 *logical = key;
+
+	if (*logical < stripe->logical)
+		return -1;
+	if (*logical >= stripe->logical + stripe->num_bytes)
+		return 1;
+	return 0;
+}
+
+static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
+{
+	struct btrfs_ordered_stripe *stripe =
+		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
+	return ordered_stripe_cmp(&stripe->logical, rbb);
+}
+
+int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
+{
+	struct btrfs_fs_info *fs_info = bioc->fs_info;
+	struct btrfs_ordered_stripe *stripe;
+	struct btrfs_io_stripe *tmp;
+	u64 logical = bioc->logical;
+	u64 length = bioc->size;
+	struct rb_node *node;
+	size_t size;
+
+	size = bioc->num_stripes * sizeof(struct btrfs_io_stripe);
+	stripe = kzalloc(sizeof(struct btrfs_ordered_stripe), GFP_NOFS);
+	if (!stripe)
+		return -ENOMEM;
+
+	spin_lock_init(&stripe->lock);
+	tmp = kmemdup(bioc->stripes, size, GFP_NOFS);
+	if (!tmp) {
+		kfree(stripe);
+		return -ENOMEM;
+	}
+
+	stripe->logical = logical;
+	stripe->num_bytes = length;
+	stripe->num_stripes = bioc->num_stripes;
+	spin_lock(&stripe->lock);
+	stripe->stripes = tmp;
+	spin_unlock(&stripe->lock);
+	refcount_set(&stripe->ref, 1);
+
+	write_lock(&fs_info->stripe_update_lock);
+	node = rb_find_add(&stripe->rb_node, &fs_info->stripe_update_tree,
+	       ordered_stripe_less);
+	write_unlock(&fs_info->stripe_update_lock);
+	if (node) {
+		struct btrfs_ordered_stripe *old =
+			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+
+		btrfs_debug(fs_info, "logical: %llu, length: %llu already exists",
+			  logical, length);
+		ASSERT(logical == old->logical);
+		write_lock(&fs_info->stripe_update_lock);
+		rb_replace_node(node, &stripe->rb_node,
+				&fs_info->stripe_update_tree);
+		write_unlock(&fs_info->stripe_update_lock);
+	}
+
+	return 0;
+}
+
+struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *fs_info,
+							 u64 logical)
+{
+	struct rb_root *root = &fs_info->stripe_update_tree;
+	struct btrfs_ordered_stripe *stripe = NULL;
+	struct rb_node *node;
+
+	read_lock(&fs_info->stripe_update_lock);
+	node = rb_find(&logical, root, ordered_stripe_cmp);
+	if (node) {
+		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+		refcount_inc(&stripe->ref);
+	}
+	read_unlock(&fs_info->stripe_update_lock);
+
+	return stripe;
+}
+
+void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
+				 struct btrfs_ordered_stripe *stripe)
+{
+	write_lock(&fs_info->stripe_update_lock);
+	if (refcount_dec_and_test(&stripe->ref)) {
+		struct rb_node *node = &stripe->rb_node;
+
+		rb_erase(node, &fs_info->stripe_update_tree);
+		RB_CLEAR_NODE(node);
+
+		spin_lock(&stripe->lock);
+		kfree(stripe->stripes);
+		spin_unlock(&stripe->lock);
+		kfree(stripe);
+	}
+	write_unlock(&fs_info->stripe_update_lock);
+}
+
+int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
+					  u64 start, u64 len)
+{
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_ordered_stripe *stripe;
+	u64 map_length = len;
+	int ret;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return 0;
+
+	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, start, &map_length,
+			      &bioc, 0);
+	if (ret)
+		return ret;
+
+	bioc->size = len;
+
+	stripe = btrfs_lookup_ordered_stripe(fs_info, start);
+	if (!stripe) {
+		ret = btrfs_add_ordered_stripe(bioc);
+		if (ret)
+			return ret;
+	} else {
+		spin_lock(&stripe->lock);
+		memcpy(stripe->stripes, bioc->stripes,
+		       bioc->num_stripes * sizeof(struct btrfs_io_stripe));
+		spin_unlock(&stripe->lock);
+		btrfs_put_ordered_stripe(fs_info, stripe);
+	}
+
+	return 0;
+}
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+			     struct btrfs_ordered_stripe *stripe)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key stripe_key;
+	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+	struct btrfs_stripe_extent *stripe_extent;
+	size_t item_size;
+	int ret;
+
+	item_size = stripe->num_stripes * sizeof(struct btrfs_raid_stride);
+
+	stripe_extent = kzalloc(item_size, GFP_NOFS);
+	if (!stripe_extent) {
+		btrfs_abort_transaction(trans, -ENOMEM);
+		btrfs_end_transaction(trans);
+		return -ENOMEM;
+	}
+
+	spin_lock(&stripe->lock);
+	for (int i = 0; i < stripe->num_stripes; i++) {
+		u64 devid = stripe->stripes[i].dev->devid;
+		u64 physical = stripe->stripes[i].physical;
+		struct btrfs_raid_stride *raid_stride =
+						&stripe_extent->strides[i];
+
+		btrfs_set_stack_raid_stride_devid(raid_stride, devid);
+		btrfs_set_stack_raid_stride_physical(raid_stride, physical);
+	}
+	spin_unlock(&stripe->lock);
+
+	stripe_key.objectid = stripe->logical;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = stripe->num_bytes;
+
+	ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
+				item_size);
+	if (ret)
+		btrfs_abort_transaction(trans, ret);
+
+	kfree(stripe_extent);
+
+	return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
new file mode 100644
index 000000000000..60d3f8489cc9
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Western Digital Corporation or its affiliates.
+ */
+
+#ifndef BTRFS_RAID_STRIPE_TREE_H
+#define BTRFS_RAID_STRIPE_TREE_H
+
+#include "disk-io.h"
+#include "messages.h"
+
+struct btrfs_io_context;
+
+struct btrfs_ordered_stripe {
+	struct rb_node rb_node;
+
+	u64 logical;
+	u64 num_bytes;
+	int num_stripes;
+	struct btrfs_io_stripe *stripes;
+	spinlock_t lock;
+	refcount_t ref;
+};
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+			     struct btrfs_ordered_stripe *stripe);
+int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
+					  u64 start, u64 len);
+struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
+						 struct btrfs_fs_info *fs_info,
+						 u64 logical);
+int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
+void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
+					    struct btrfs_ordered_stripe *stripe);
+
+static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
+						 u64 map_type)
+{
+	u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
+	u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return false;
+
+	if (type != BTRFS_BLOCK_GROUP_DATA)
+		return false;
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
+		return true;
+
+	return false;
+}
+
+static inline void btrfs_drop_ordered_stripe(struct btrfs_fs_info *fs_info,
+					     u64 logical)
+{
+	struct btrfs_ordered_stripe *stripe;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return;
+
+	stripe = btrfs_lookup_ordered_stripe(fs_info, logical);
+	if (!stripe)
+		return;
+	ASSERT(refcount_read(&stripe->ref) == 2);
+	/* once for us */
+	btrfs_put_ordered_stripe(fs_info, stripe);
+	/* once for the tree */
+	btrfs_put_ordered_stripe(fs_info, stripe);
+}
+#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 707dd0456cea..e7c0353e5655 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5885,6 +5885,7 @@ static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
 }
 
 static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
+						       u64 logical,
 						       int total_stripes,
 						       int real_stripes)
 {
@@ -5908,6 +5909,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 	refcount_set(&bioc->refs, 1);
 
 	bioc->fs_info = fs_info;
+	bioc->logical = logical;
 	bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes);
 	bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
 
@@ -6513,7 +6515,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		goto out;
 	}
 
-	bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes, tgtdev_indexes);
+	bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes,
+				      tgtdev_indexes);
 	if (!bioc) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7e51f2238f72..5d7547b5fa87 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -368,12 +368,10 @@ struct btrfs_fs_devices {
 
 struct btrfs_io_stripe {
 	struct btrfs_device *dev;
-	union {
-		/* Block mapping */
-		u64 physical;
-		/* For the endio handler */
-		struct btrfs_io_context *bioc;
-	};
+	/* Block mapping */
+	u64 physical;
+	/* For the endio handler */
+	struct btrfs_io_context *bioc;
 };
 
 struct btrfs_discard_stripe {
@@ -409,6 +407,8 @@ struct btrfs_io_context {
 	int mirror_num;
 	int num_tgtdevs;
 	int *tgtdev_map;
+	u64 logical;
+	u64 size;
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index d862477f79f3..ed49150e6e6f 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1687,6 +1687,10 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 	u64 *logical = NULL;
 	int nr, stripe_len;
 
+	/* Filesystems with a stripe tree have their own l2p mapping */
+	if (btrfs_stripe_tree_root(fs_info))
+		return;
+
 	/* Zoned devices should not have partitions. So, we can assume it is 0 */
 	ASSERT(!bdev_is_partition(ordered->bdev));
 	if (WARN_ON(!ordered->bdev))
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 05/13] btrfs: delete stripe extent on extent deletion
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 20:00   ` Josef Bacik
  2023-02-08 10:57 ` [PATCH v5 06/13] btrfs: lookup physical address from stripe extent Johannes Thumshirn
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

As each stripe extent is tied to an extent item, delete the stripe extent
once the corresponding extent item is deleted.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent-tree.c      |  8 ++++++++
 fs/btrfs/raid-stripe-tree.c | 31 +++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  2 ++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 50b3a2c3c0dd..f08ee7d9211c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3238,6 +3238,14 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 
+		if (is_data) {
+			ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
+			if (ret) {
+				btrfs_abort_transaction(trans, ret);
+				return ret;
+			}
+		}
+
 		ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
 				      num_to_del);
 		if (ret) {
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index d184cd9dc96e..ff5787a19454 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -122,6 +122,37 @@ void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 	write_unlock(&fs_info->stripe_update_lock);
 }
 
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+			     u64 length)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+	struct btrfs_path *path;
+	struct btrfs_key stripe_key;
+	int ret;
+
+	if (!stripe_root)
+		return 0;
+
+	stripe_key.objectid = start;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(trans, stripe_root, &stripe_key, path, -1, 1);
+	if (ret < 0)
+		goto out;
+
+	ret = btrfs_del_item(trans, stripe_root, path);
+out:
+	btrfs_free_path(path);
+	return ret;
+
+}
+
 int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
 					  u64 start, u64 len)
 {
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 60d3f8489cc9..12d2f588b22d 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -22,6 +22,8 @@ struct btrfs_ordered_stripe {
 	refcount_t ref;
 };
 
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+			     u64 length);
 int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_ordered_stripe *stripe);
 int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 06/13] btrfs: lookup physical address from stripe extent
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 05/13] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 20:16   ` Josef Bacik
  2023-02-08 10:57 ` [PATCH v5 07/13] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Lookup the physical address from the raid stripe tree when a read on an
RAID volume formatted with the raid stripe tree was attempted.

If the requested logical address was not found in the stripe tree, it may
still be in the in-memory ordered stripe tree, so fallback to searching
the ordered stripe tree in this case.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c | 145 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |   3 +
 fs/btrfs/volumes.c          |  31 ++++++--
 3 files changed, 172 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index ff5787a19454..ba7015a8012c 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -231,3 +231,148 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
 
 	return ret;
 }
+
+static bool btrfs_physical_from_ordered_stripe(struct btrfs_fs_info *fs_info,
+					      u64 logical, u64 *length,
+					      int num_stripes,
+					      struct btrfs_io_stripe *stripe)
+{
+	struct btrfs_ordered_stripe *os;
+	u64 offset;
+	u64 found_end;
+	u64 end;
+	int i;
+
+	os = btrfs_lookup_ordered_stripe(fs_info, logical);
+	if (!os)
+		return false;
+
+	end = logical + *length;
+	found_end = os->logical + os->num_bytes;
+	if (end > found_end)
+		*length -= end - found_end;
+
+	for (i = 0; i < num_stripes; i++) {
+		if (os->stripes[i].dev != stripe->dev)
+			continue;
+
+		offset = logical - os->logical;
+		ASSERT(offset >= 0);
+		stripe->physical = os->stripes[i].physical + offset;
+		btrfs_put_ordered_stripe(fs_info, os);
+		break;
+	}
+
+	return true;
+}
+
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+				 u64 logical, u64 *length, u64 map_type,
+				 struct btrfs_io_stripe *stripe)
+{
+	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+	int num_stripes = btrfs_bg_type_to_factor(map_type);
+	struct btrfs_stripe_extent *stripe_extent;
+	struct btrfs_key stripe_key;
+	struct btrfs_key found_key;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	u64 offset;
+	u64 found_logical;
+	u64 found_length;
+	u64 end;
+	u64 found_end;
+	int slot;
+	int ret;
+	int i;
+
+	/*
+	 * If we still have the stripe in the ordered stripe tree get it from
+	 * there
+	 */
+	if (btrfs_physical_from_ordered_stripe(fs_info, logical, length,
+					       num_stripes, stripe))
+		return 0;
+
+	stripe_key.objectid = logical;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
+	if (ret < 0)
+		goto free_path;
+	if (ret) {
+		if (path->slots[0] != 0)
+			path->slots[0]--;
+	}
+
+	end = logical + *length;
+
+	while (1) {
+		leaf = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(leaf, &found_key, slot);
+		found_logical = found_key.objectid;
+		found_length = found_key.offset;
+
+		if (found_logical > end)
+			break;
+
+		if (!in_range(logical, found_logical, found_length))
+			goto next;
+
+		offset = logical - found_logical;
+		found_end = found_logical + found_length;
+
+		/*
+		 * If we have a logically contiguous, but physically
+		 * noncontinuous range, we need to split the bio. Record the
+		 * length after which we must split the bio.
+		 */
+		if (end > found_end)
+			*length -= end - found_end;
+
+		stripe_extent =
+			btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+		for (i = 0; i < num_stripes; i++) {
+			if (btrfs_raid_stride_devid_nr(leaf,
+				       stripe_extent, i) != stripe->dev->devid)
+				continue;
+			stripe->physical = btrfs_raid_stride_physical_nr(leaf,
+						   stripe_extent, i) + offset;
+			ret = 0;
+			goto out;
+		}
+
+		/*
+		 * If we're here, we haven't found the requested devid in the
+		 * stripe.
+		 */
+		ret = -ENOENT;
+		goto out;
+next:
+		ret = btrfs_next_item(stripe_root, path);
+		if (ret)
+			break;
+	}
+
+out:
+	if (ret > 0)
+		ret = -ENOENT;
+	if (ret && ret != -EIO) {
+		btrfs_err(fs_info,
+			  "cannot find raid-stripe for logical [%llu, %llu]",
+			  logical, logical + *length);
+		btrfs_print_tree(leaf, 1);
+	}
+
+free_path:
+	btrfs_free_path(path);
+
+	return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 12d2f588b22d..9359df0ca3f1 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -22,6 +22,9 @@ struct btrfs_ordered_stripe {
 	refcount_t ref;
 };
 
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+				 u64 logical, u64 *length, u64 map_type,
+				 struct btrfs_io_stripe *stripe);
 int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
 			     u64 length);
 int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e7c0353e5655..7a784bb511ed 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -35,6 +35,7 @@
 #include "relocation.h"
 #include "scrub.h"
 #include "super.h"
+#include "raid-stripe-tree.h"
 
 #define BTRFS_BLOCK_GROUP_STRIPE_MASK	(BTRFS_BLOCK_GROUP_RAID0 | \
 					 BTRFS_BLOCK_GROUP_RAID10 | \
@@ -6311,12 +6312,21 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
 	return U64_MAX;
 }
 
-static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
-		          u32 stripe_index, u64 stripe_offset, u64 stripe_nr)
+static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
+		      u64 logical, u64 *length, struct btrfs_io_stripe *dst,
+		      struct map_lookup *map, u32 stripe_index,
+		      u64 stripe_offset, u64 stripe_nr)
 {
 	dst->dev = map->stripes[stripe_index].dev;
+
+	if (op == BTRFS_MAP_READ &&
+	    btrfs_need_stripe_tree_update(fs_info, map->type))
+		return btrfs_get_raid_extent_offset(fs_info, logical, length,
+						    map->type, dst);
+
 	dst->physical = map->stripes[stripe_index].physical +
 			stripe_offset + stripe_nr * map->stripe_len;
+	return 0;
 }
 
 int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
@@ -6505,13 +6515,14 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 			smap->dev = dev_replace->tgtdev;
 			smap->physical = physical_to_patch_in_first_stripe;
 			*mirror_num_ret = map->num_stripes + 1;
+			ret = 0;
 		} else {
-			set_io_stripe(smap, map, stripe_index, stripe_offset,
-				      stripe_nr);
 			*mirror_num_ret = mirror_num;
+			ret = set_io_stripe(fs_info, op, logical, length, smap,
+					    map, stripe_index, stripe_offset,
+					    stripe_nr);
 		}
 		*bioc_ret = NULL;
-		ret = 0;
 		goto out;
 	}
 
@@ -6523,8 +6534,14 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	}
 
 	for (i = 0; i < num_stripes; i++) {
-		set_io_stripe(&bioc->stripes[i], map, stripe_index, stripe_offset,
-			      stripe_nr);
+		ret = set_io_stripe(fs_info, op, logical, length,
+				 &bioc->stripes[i], map, stripe_index,
+				 stripe_offset, stripe_nr);
+		if (ret) {
+			btrfs_put_bioc(bioc);
+			goto out;
+		}
+
 		stripe_index++;
 	}
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 07/13] btrfs: add raid stripe tree pretty printer
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 06/13] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 10:57 ` [PATCH v5 08/13] btrfs: zoned: allow zoned RAID Johannes Thumshirn
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik

Decode raid-stripe-tree entries on btrfs_print_tree().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/print-tree.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index b93c96213304..d9506d54298b 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -9,6 +9,7 @@
 #include "print-tree.h"
 #include "accessors.h"
 #include "tree-checker.h"
+#include "raid-stripe-tree.h"
 
 struct root_name_map {
 	u64 id;
@@ -28,6 +29,7 @@ static const struct root_name_map root_map[] = {
 	{ BTRFS_FREE_SPACE_TREE_OBJECTID,	"FREE_SPACE_TREE"	},
 	{ BTRFS_BLOCK_GROUP_TREE_OBJECTID,	"BLOCK_GROUP_TREE"	},
 	{ BTRFS_DATA_RELOC_TREE_OBJECTID,	"DATA_RELOC_TREE"	},
+	{ BTRFS_RAID_STRIPE_TREE_OBJECTID,	"RAID_STRIPE_TREE"	},
 };
 
 const char *btrfs_root_name(const struct btrfs_key *key, char *buf)
@@ -187,6 +189,20 @@ static void print_uuid_item(struct extent_buffer *l, unsigned long offset,
 	}
 }
 
+static void print_raid_stripe_key(struct extent_buffer *eb, u32 item_size,
+				  struct btrfs_stripe_extent *stripe)
+{
+	int num_stripes;
+	int i;
+
+	num_stripes = item_size / sizeof(struct btrfs_raid_stride);
+
+	for (i = 0; i < num_stripes; i++)
+		pr_info("\t\t\tstride %d devid %llu physical %llu\n", i,
+			btrfs_raid_stride_devid_nr(eb, stripe, i),
+			btrfs_raid_stride_physical_nr(eb, stripe, i));
+}
+
 /*
  * Helper to output refs and locking status of extent buffer.  Useful to debug
  * race condition related problems.
@@ -351,6 +367,11 @@ void btrfs_print_leaf(struct extent_buffer *l)
 			print_uuid_item(l, btrfs_item_ptr_offset(l, i),
 					btrfs_item_size(l, i));
 			break;
+		case BTRFS_RAID_STRIPE_KEY:
+			print_raid_stripe_key(l, btrfs_item_size(l, i),
+					      btrfs_item_ptr(l, i,
+							     struct btrfs_stripe_extent));
+			break;
 		}
 	}
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 08/13] btrfs: zoned: allow zoned RAID
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 07/13] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 20:18   ` Josef Bacik
  2023-02-08 10:57 ` [PATCH v5 09/13] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
data block-groups. For meta-data block-groups, we don't actually need
anything special, as all meta-data I/O is protected by the
btrfs_zoned_meta_io_lock() already.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c |  4 ++++
 fs/btrfs/raid-stripe-tree.h | 10 +++++++++
 fs/btrfs/volumes.c          |  5 ++++-
 fs/btrfs/zoned.c            | 45 +++++++++++++++++++++++++++++++++++--
 4 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index ba7015a8012c..1eaa97378d1c 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -268,10 +268,12 @@ static bool btrfs_physical_from_ordered_stripe(struct btrfs_fs_info *fs_info,
 
 int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
 				 u64 logical, u64 *length, u64 map_type,
+				 u32 stripe_index,
 				 struct btrfs_io_stripe *stripe)
 {
 	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
 	int num_stripes = btrfs_bg_type_to_factor(map_type);
+	const bool is_dup = map_type & BTRFS_BLOCK_GROUP_DUP;
 	struct btrfs_stripe_extent *stripe_extent;
 	struct btrfs_key stripe_key;
 	struct btrfs_key found_key;
@@ -343,6 +345,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
 			if (btrfs_raid_stride_devid_nr(leaf,
 				       stripe_extent, i) != stripe->dev->devid)
 				continue;
+			if (is_dup && (stripe_index - 1) != i)
+				continue;
 			stripe->physical = btrfs_raid_stride_physical_nr(leaf,
 						   stripe_extent, i) + offset;
 			ret = 0;
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 9359df0ca3f1..c7f6c5377aaa 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -24,6 +24,7 @@ struct btrfs_ordered_stripe {
 
 int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
 				 u64 logical, u64 *length, u64 map_type,
+				 u32 stripe_index,
 				 struct btrfs_io_stripe *stripe);
 int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
 			     u64 length);
@@ -50,9 +51,18 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
 	if (type != BTRFS_BLOCK_GROUP_DATA)
 		return false;
 
+	if (profile & BTRFS_BLOCK_GROUP_DUP)
+		return true;
+
 	if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
 		return true;
 
+	if (profile & BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID10)
+		return true;
+
 	return false;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7a784bb511ed..ef626f932af5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6322,7 +6322,8 @@ static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	if (op == BTRFS_MAP_READ &&
 	    btrfs_need_stripe_tree_update(fs_info, map->type))
 		return btrfs_get_raid_extent_offset(fs_info, logical, length,
-						    map->type, dst);
+						    map->type, stripe_index,
+						    dst);
 
 	dst->physical = map->stripes[stripe_index].physical +
 			stripe_offset + stripe_nr * map->stripe_len;
@@ -6508,6 +6509,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	 * I/O context structure.
 	 */
 	if (smap && num_alloc_stripes == 1 &&
+	    !(btrfs_need_stripe_tree_update(fs_info, map->type) &&
+	      op != BTRFS_MAP_READ) &&
 	    !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1) &&
 	    (!need_full_stripe(op) || !dev_replace_is_ongoing ||
 	     !dev_replace->tgtdev)) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index ed49150e6e6f..9796f76cffd6 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1476,8 +1476,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 			set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &cache->runtime_flags);
 		break;
 	case BTRFS_BLOCK_GROUP_DUP:
-		if (map->type & BTRFS_BLOCK_GROUP_DATA) {
-			btrfs_err(fs_info, "zoned: profile DUP not yet supported on data bg");
+		if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+		    !btrfs_stripe_tree_root(fs_info)) {
+			btrfs_err(fs_info, "zoned: data DUP profile needs stripe_root");
 			ret = -EINVAL;
 			goto out;
 		}
@@ -1515,8 +1516,48 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		cache->zone_capacity = min(caps[0], caps[1]);
 		break;
 	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 	case BTRFS_BLOCK_GROUP_RAID0:
 	case BTRFS_BLOCK_GROUP_RAID10:
+		if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+		    !btrfs_stripe_tree_root(fs_info)) {
+			btrfs_err(fs_info,
+				  "zoned: data %s needs stripe_root",
+				  btrfs_bg_type_to_raid_name(map->type));
+			ret = -EIO;
+			goto out;
+
+		}
+
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV ||
+			    alloc_offsets[i] == WP_CONVENTIONAL)
+				continue;
+
+			if (i == 0)
+				continue;
+
+			if (alloc_offsets[0] != alloc_offsets[i]) {
+				btrfs_err(fs_info,
+					  "zoned: write pointer offset mismatch of zones in RAID profile");
+				ret = -EIO;
+				goto out;
+			}
+			if (test_bit(0, active) != test_bit(i, active)) {
+				if (!btrfs_zone_activate(cache)) {
+					ret = -EIO;
+					goto out;
+				}
+			} else {
+				if (test_bit(0, active))
+					set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+						&cache->runtime_flags);
+			}
+			cache->zone_capacity = min(caps[0], caps[i]);
+		}
+		cache->alloc_offset = alloc_offsets[0];
+		break;
 	case BTRFS_BLOCK_GROUP_RAID5:
 	case BTRFS_BLOCK_GROUP_RAID6:
 		/* non-single profiles are not supported yet */
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 09/13] btrfs: check for leaks of ordered stripes on umount
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 08/13] btrfs: zoned: allow zoned RAID Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 20:19   ` Josef Bacik
  2023-02-08 10:57 ` [PATCH v5 10/13] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Check if we're leaking any ordered stripes when unmounting a filesystem
with an stripe tree.

This check is gated behind CONFIG_BTRFS_DEBUG to not affect any production
type systems.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c          |  2 ++
 fs/btrfs/raid-stripe-tree.c | 30 ++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  1 +
 3 files changed, 33 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b130c8dcd8d9..f2de4d3600d6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -52,6 +52,7 @@
 #include "relocation.h"
 #include "scrub.h"
 #include "super.h"
+#include "raid-stripe-tree.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -1538,6 +1539,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->stripe_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
+	btrfs_check_ordered_stripe_leak(fs_info);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	kfree(fs_info->subpage_info);
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 1eaa97378d1c..32a428413f5b 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -36,6 +36,36 @@ static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
 	return ordered_stripe_cmp(&stripe->logical, rbb);
 }
 
+void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info)
+{
+#ifdef CONFIG_BTRFS_DEBUG
+	struct rb_node *node;
+
+	if (!btrfs_stripe_tree_root(fs_info) ||
+	    RB_EMPTY_ROOT(&fs_info->stripe_update_tree))
+		return;
+
+	WARN_ON_ONCE(1);
+	write_lock(&fs_info->stripe_update_lock);
+	while ((node = rb_first_postorder(&fs_info->stripe_update_tree))
+	       != NULL) {
+		struct btrfs_ordered_stripe *stripe =
+			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+
+		write_unlock(&fs_info->stripe_update_lock);
+		btrfs_err(fs_info,
+			  "ordered_stripe [%llu, %llu] leaked, refcount=%d",
+			  stripe->logical, stripe->logical + stripe->num_bytes,
+			  refcount_read(&stripe->ref));
+		while (refcount_read(&stripe->ref) > 1)
+			btrfs_put_ordered_stripe(fs_info, stripe);
+		btrfs_put_ordered_stripe(fs_info, stripe);
+		write_lock(&fs_info->stripe_update_lock);
+	}
+	write_unlock(&fs_info->stripe_update_lock);
+#endif
+}
+
 int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
 {
 	struct btrfs_fs_info *fs_info = bioc->fs_info;
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index c7f6c5377aaa..371409351d60 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -38,6 +38,7 @@ struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
 int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
 void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 					    struct btrfs_ordered_stripe *stripe);
+void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info);
 
 static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
 						 u64 map_type)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 10/13] btrfs: add tracepoints for ordered stripes
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (8 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 09/13] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 10:57 ` [PATCH v5 11/13] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik

Add tracepoints to check the lifetime of btrfs_ordered_stripe entries.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c  |  3 +++
 fs/btrfs/super.c             |  1 +
 include/trace/events/btrfs.h | 50 ++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 32a428413f5b..4d4c7870547c 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -113,6 +113,7 @@ int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
 		write_unlock(&fs_info->stripe_update_lock);
 	}
 
+	trace_btrfs_ordered_stripe_add(fs_info, stripe);
 	return 0;
 }
 
@@ -128,6 +129,7 @@ struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *f
 	if (node) {
 		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
 		refcount_inc(&stripe->ref);
+		trace_btrfs_ordered_stripe_lookup(fs_info, stripe);
 	}
 	read_unlock(&fs_info->stripe_update_lock);
 
@@ -138,6 +140,7 @@ void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 				 struct btrfs_ordered_stripe *stripe)
 {
 	write_lock(&fs_info->stripe_update_lock);
+	trace_btrfs_ordered_stripe_put(fs_info, stripe);
 	if (refcount_dec_and_test(&stripe->ref)) {
 		struct rb_node *node = &stripe->rb_node;
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e5136baef9af..7235106e8d08 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -59,6 +59,7 @@
 #include "verity.h"
 #include "super.h"
 #include "extent-tree.h"
+#include "raid-stripe-tree.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 75d7d22c3a27..8efea1445dd9 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -33,6 +33,7 @@ struct btrfs_space_info;
 struct btrfs_raid_bio;
 struct raid56_bio_trace_info;
 struct find_free_extent_ctl;
+struct btrfs_ordered_stripe;
 
 #define show_ref_type(type)						\
 	__print_symbolic(type,						\
@@ -2492,6 +2493,55 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_scrub_read_recover,
 	TP_ARGS(rbio, bio, trace_info)
 );
 
+DECLARE_EVENT_CLASS(btrfs__ordered_stripe,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info,
+		 const struct btrfs_ordered_stripe *stripe),
+
+	TP_ARGS(fs_info, stripe),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	logical		)
+		__field(	u64,	num_bytes	)
+		__field(	int,	num_stripes	)
+		__field(	int,	ref		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->logical	= stripe->logical;
+		__entry->num_bytes	= stripe->num_bytes;
+		__entry->num_stripes	= stripe->num_stripes;
+		__entry->ref		= refcount_read(&stripe->ref);
+	),
+
+	TP_printk_btrfs("logical=%llu, num_bytes=%llu, num_stripes=%d, ref=%d",
+			__entry->logical, __entry->num_bytes,
+			__entry->num_stripes, __entry->ref)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_add,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_lookup,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_put,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 11/13] btrfs: announce presence of raid-stripe-tree in sysfs
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (9 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 10/13] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 20:20   ` Josef Bacik
  2023-02-08 10:57 ` [PATCH v5 12/13] btrfs: consult raid-stripe-tree when scrubbing Johannes Thumshirn
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

If a filesystem with a raid-stripe-tree is mounted, show the RST feature
in sysfs.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/sysfs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 108aa3876186..776f9b897642 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -296,6 +296,8 @@ BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #ifdef CONFIG_BTRFS_DEBUG
 /* Remove once support for extent tree v2 is feature complete */
 BTRFS_FEAT_ATTR_INCOMPAT(extent_tree_v2, EXTENT_TREE_V2);
+/* Remove once support for raid stripe tree is feature complete */
+BTRFS_FEAT_ATTR_INCOMPAT(raid_stripe_tree, RAID_STRIPE_TREE);
 #endif
 #ifdef CONFIG_FS_VERITY
 BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
@@ -326,6 +328,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 #endif
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_FEAT_ATTR_PTR(extent_tree_v2),
+	BTRFS_FEAT_ATTR_PTR(raid_stripe_tree),
 #endif
 #ifdef CONFIG_FS_VERITY
 	BTRFS_FEAT_ATTR_PTR(verity),
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 12/13] btrfs: consult raid-stripe-tree when scrubbing
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (10 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 11/13] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 20:21   ` Josef Bacik
  2023-02-08 10:57 ` [PATCH v5 13/13] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

When scrubbing a filesystem which uses the raid-stripe-tree for logical to
physical address translation, consult the RST to perform the address
translation instead of relying on fixed block group offsets.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/scrub.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a5d026041be4..d456dda8c5b0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -24,6 +24,7 @@
 #include "accessors.h"
 #include "file-item.h"
 #include "scrub.h"
+#include "raid-stripe-tree.h"
 
 /*
  * This is only the first step towards a full-features scrub. It reads all
@@ -2719,6 +2720,21 @@ static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map,
 	int ret;
 	u8 csum[BTRFS_CSUM_SIZE];
 	u32 blocksize;
+	struct btrfs_io_stripe stripe;
+	const bool stripe_update =
+		btrfs_need_stripe_tree_update(sctx->fs_info, map->type);
+
+	if (stripe_update) {
+		stripe.dev = src_dev;
+		ret = btrfs_get_raid_extent_offset(sctx->fs_info, logical,
+						   (u64 *)&len,
+						   map->type, mirror_num,
+						   &stripe);
+		if (ret)
+			return ret;
+
+		src_physical = stripe.physical;
+	}
 
 	if (flags & BTRFS_EXTENT_FLAG_DATA) {
 		if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
@@ -2772,8 +2788,21 @@ static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map,
 			return ret;
 		len -= l;
 		logical += l;
-		physical += l;
-		src_physical += l;
+		if (stripe_update && len) {
+
+			ret = btrfs_get_raid_extent_offset(sctx->fs_info,
+							   logical, (u64 *)&len,
+							   map->type, mirror_num,
+							   &stripe);
+			if (ret)
+				return ret;
+
+			src_physical = stripe.physical;
+			physical = stripe.physical;
+		} else {
+			physical += l;
+			src_physical += l;
+		}
 	}
 	return 0;
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 13/13] btrfs: add raid-stripe-tree to features enabled with debug
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (11 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 12/13] btrfs: consult raid-stripe-tree when scrubbing Johannes Thumshirn
@ 2023-02-08 10:57 ` Johannes Thumshirn
  2023-02-08 20:23   ` Josef Bacik
  2023-02-09  0:42 ` [PATCH v5 00/13] btrfs: introduce RAID stripe tree Qu Wenruo
  2023-02-09 15:57 ` Phillip Susi
  14 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-08 10:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Until the RAID stripe tree code is well enough tested and feature
complete, "hide" it behind CONFIG_BTRFS_DEBUG so only people who
want to use it are actually using it.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/fs.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index bee7ed0304cd..c0d6dd89e3b0 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -214,7 +214,8 @@ enum {
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
 	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
 	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
-	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	|	\
+	 BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE)
 #else
 #define BTRFS_FEATURE_INCOMPAT_SUPP			\
 	(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF |		\
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 01/13] btrfs: re-add trans parameter to insert_delayed_ref
  2023-02-08 10:57 ` [PATCH v5 01/13] btrfs: re-add trans parameter to insert_delayed_ref Johannes Thumshirn
@ 2023-02-08 19:41   ` Josef Bacik
  0 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 19:41 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:38AM -0800, Johannes Thumshirn wrote:
> Re-add the trans parameter to insert_delayed_ref as it is needed again
> later in this series.
> 
> This reverts commit bccf28752a99 ("btrfs: drop trans parameter of insert_delayed_ref")
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 02/13] btrfs: add raid stripe tree definitions
  2023-02-08 10:57 ` [PATCH v5 02/13] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2023-02-08 19:42   ` Josef Bacik
  2023-02-13  6:50   ` Christoph Hellwig
  2023-02-13 11:34   ` Anand Jain
  2 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 19:42 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:39AM -0800, Johannes Thumshirn wrote:
> Add definitions for the raid stripe tree. This tree will hold information
> about the on-disk layout of the stripes in a RAID set.
> 
> Each stripe extent has a 1:1 relationship with an on-disk extent item and
> is doing the logical to per-drive physical address translation for the
> extent item in question.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk
  2023-02-08 10:57 ` [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
@ 2023-02-08 19:43   ` Josef Bacik
  2023-02-13 11:35   ` Anand Jain
  1 sibling, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 19:43 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:40AM -0800, Johannes Thumshirn wrote:
> If we find a raid-stripe-tree on mount, read it from disk.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-08 10:57 ` [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2023-02-08 19:47   ` Josef Bacik
  2023-02-08 19:52   ` Josef Bacik
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 19:47 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:41AM -0800, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
> 
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
> 
> This is done by creating in-memory ordered stripe extents, just like the
> in memory ordered extents, on I/O completion and the on-disk raid stripe
> extents get created once we're running the delayed_refs for the extent
> item this stripe extent is tied to.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/Makefile           |   2 +-
>  fs/btrfs/bio.c              |  29 ++++++
>  fs/btrfs/bio.h              |   2 +
>  fs/btrfs/delayed-ref.c      |   6 +-
>  fs/btrfs/delayed-ref.h      |   2 +
>  fs/btrfs/disk-io.c          |   9 +-
>  fs/btrfs/extent-tree.c      |  60 +++++++++++
>  fs/btrfs/fs.h               |   4 +
>  fs/btrfs/inode.c            |  15 ++-
>  fs/btrfs/raid-stripe-tree.c | 202 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  71 +++++++++++++
>  fs/btrfs/volumes.c          |   5 +-
>  fs/btrfs/volumes.h          |  12 +--
>  fs/btrfs/zoned.c            |   4 +
>  14 files changed, 410 insertions(+), 13 deletions(-)
>  create mode 100644 fs/btrfs/raid-stripe-tree.c
>  create mode 100644 fs/btrfs/raid-stripe-tree.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 90d53209755b..3bb869a84e54 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>  	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
>  	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> -	   lru_cache.o
> +	   lru_cache.o raid-stripe-tree.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index e6fe1b7dbc50..a01c6560ef49 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -15,6 +15,7 @@
>  #include "rcu-string.h"
>  #include "zoned.h"
>  #include "file-item.h"
> +#include "raid-stripe-tree.h"
>  
>  static struct bio_set btrfs_bioset;
>  static struct bio_set btrfs_clone_bioset;
> @@ -350,6 +351,21 @@ static void btrfs_raid56_end_io(struct bio *bio)
>  	btrfs_put_bioc(bioc);
>  }
>  
> +static void btrfs_raid_stripe_update(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio =
> +		container_of(work, struct btrfs_bio, raid_stripe_work);
> +	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
> +	struct btrfs_io_context *bioc = stripe->bioc;
> +	int ret;
> +
> +	ret = btrfs_add_ordered_stripe(bioc);
> +	if (ret)
> +		bbio->bio.bi_status = errno_to_blk_status(ret);
> +	btrfs_orig_bbio_end_io(bbio);
> +	btrfs_put_bioc(bioc);
> +}
> +
>  static void btrfs_orig_write_end_io(struct bio *bio)
>  {
>  	struct btrfs_io_stripe *stripe = bio->bi_private;
> @@ -372,6 +388,16 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>  	else
>  		bio->bi_status = BLK_STS_OK;
>  
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
> +		queue_work(bbio->inode->root->fs_info->raid_stripe_workers,
> +			   &bbio->raid_stripe_work);
> +		return;
> +	}
> +
>  	btrfs_orig_bbio_end_io(bbio);
>  	btrfs_put_bioc(bioc);
>  }
> @@ -383,6 +409,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
>  	if (bio->bi_status) {
>  		atomic_inc(&stripe->bioc->error);
>  		btrfs_log_dev_io_error(bio, stripe->dev);
> +	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>  	}
>  
>  	/* Pass on control to the original bio this one was cloned from */
> @@ -442,6 +470,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
>  	bio->bi_private = &bioc->stripes[dev_nr];
>  	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
>  	bioc->stripes[dev_nr].bioc = bioc;
> +	bioc->size = bio->bi_iter.bi_size;
>  	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
>  }
>  
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 20105806c8ac..bf5fbc105148 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -58,6 +58,8 @@ struct btrfs_bio {
>  	atomic_t pending_ios;
>  	struct work_struct end_io_work;
>  
> +	struct work_struct raid_stripe_work;
> +
>  	/*
>  	 * This member must come last, bio_alloc_bioset will allocate enough
>  	 * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 7660ac642c81..261f52ad8e12 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -14,6 +14,7 @@
>  #include "space-info.h"
>  #include "tree-mod-log.h"
>  #include "fs.h"
> +#include "raid-stripe-tree.h"
>  
>  struct kmem_cache *btrfs_delayed_ref_head_cachep;
>  struct kmem_cache *btrfs_delayed_tree_ref_cachep;
> @@ -637,8 +638,11 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
>  	exist->ref_mod += mod;
>  
>  	/* remove existing tail if its ref_mod is zero */
> -	if (exist->ref_mod == 0)
> +	if (exist->ref_mod == 0) {
> +		btrfs_drop_ordered_stripe(trans->fs_info, exist->bytenr);
>  		drop_delayed_ref(root, href, exist);
> +	}
> +
>  	spin_unlock(&href->lock);
>  	return ret;
>  inserted:
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 2eb34abf700f..5096c1a1ed3e 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -51,6 +51,8 @@ struct btrfs_delayed_ref_node {
>  	/* is this node still in the rbtree? */
>  	unsigned int is_head:1;
>  	unsigned int in_tree:1;
> +	/* Do we need RAID stripe tree modifications? */
> +	unsigned int must_insert_stripe:1;
>  };
>  
>  struct btrfs_delayed_extent_op {
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index ad64a79d052a..b130c8dcd8d9 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2022,6 +2022,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>  		destroy_workqueue(fs_info->rmw_workers);
>  	if (fs_info->compressed_write_workers)
>  		destroy_workqueue(fs_info->compressed_write_workers);
> +	if (fs_info->raid_stripe_workers)
> +		destroy_workqueue(fs_info->raid_stripe_workers);
>  	btrfs_destroy_workqueue(fs_info->endio_write_workers);
>  	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
>  	btrfs_destroy_workqueue(fs_info->delayed_workers);
> @@ -2240,12 +2242,14 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>  		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
>  	fs_info->discard_ctl.discard_workers =
>  		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
> +	fs_info->raid_stripe_workers =
> +		alloc_workqueue("btrfs-raid-stripe", flags, max_active);
>  
>  	if (!(fs_info->workers && fs_info->hipri_workers &&
>  	      fs_info->delalloc_workers && fs_info->flush_workers &&
>  	      fs_info->endio_workers && fs_info->endio_meta_workers &&
>  	      fs_info->compressed_write_workers &&
> -	      fs_info->endio_write_workers &&
> +	      fs_info->endio_write_workers && fs_info->raid_stripe_workers &&
>  	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
>  	      fs_info->caching_workers && fs_info->fixup_workers &&
>  	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
> @@ -3046,6 +3050,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  
>  	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
>  	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
> +
> +	rwlock_init(&fs_info->stripe_update_lock);
> +	fs_info->stripe_update_tree = RB_ROOT;
>  }
>  
>  static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 688cdf816957..50b3a2c3c0dd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -42,6 +42,7 @@
>  #include "file-item.h"
>  #include "orphan.h"
>  #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -1497,6 +1498,56 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static bool delayed_ref_needs_rst_update(struct btrfs_fs_info *fs_info,
> +					 struct btrfs_delayed_ref_head *head)
> +{
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	bool ret = false;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return ret;
> +
> +	em = btrfs_get_chunk_map(fs_info, head->bytenr, head->num_bytes);
> +	if (!em)
> +		return ret;
> +
> +	map = em->map_lookup;
> +
> +	if (btrfs_need_stripe_tree_update(fs_info, map->type))
> +		ret = true;
> +
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> +
> +static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
> +					    struct btrfs_delayed_ref_node *node)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_ordered_stripe *stripe;
> +	int ret = 0;
> +
> +	stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
> +	if (!stripe) {
> +		btrfs_err(fs_info,
> +			  "cannot get stripe extent for address %llu (%llu)",
> +			  node->bytenr, node->num_bytes);
> +		return -EINVAL;
> +	}
> +
> +	ASSERT(stripe->logical == node->bytenr);
> +
> +	ret = btrfs_insert_raid_extent(trans, stripe);
> +	/* once for us */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +	/* once for the tree */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +
> +	return ret;
> +}
> +
>  static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  				struct btrfs_delayed_ref_node *node,
>  				struct btrfs_delayed_extent_op *extent_op,
> @@ -1527,11 +1578,17 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  						 flags, ref->objectid,
>  						 ref->offset, &ins,
>  						 node->ref_mod);
> +		if (ret)
> +			return ret;
> +		if (node->must_insert_stripe)
> +			ret = add_stripe_entry_for_delayed_ref(trans, node);
>  	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
>  		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
>  					     ref->objectid, ref->offset,
>  					     node->ref_mod, extent_op);
>  	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
> +		if (node->must_insert_stripe)
> +			btrfs_drop_ordered_stripe(trans->fs_info, node->bytenr);
>  		ret = __btrfs_free_extent(trans, node, parent,
>  					  ref_root, ref->objectid,
>  					  ref->offset, node->ref_mod,
> @@ -1901,6 +1958,8 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
>  	struct btrfs_delayed_ref_root *delayed_refs;
>  	struct btrfs_delayed_extent_op *extent_op;
>  	struct btrfs_delayed_ref_node *ref;
> +	const bool need_rst_update =
> +		delayed_ref_needs_rst_update(fs_info, locked_ref);
>  	int must_insert_reserved = 0;
>  	int ret;
>  
> @@ -1951,6 +2010,7 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
>  		locked_ref->extent_op = NULL;
>  		spin_unlock(&locked_ref->lock);
>  
> +		ref->must_insert_stripe = need_rst_update;
>  		ret = run_one_delayed_ref(trans, ref, extent_op,
>  					  must_insert_reserved);
>  
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 93f2499a9c5b..bee7ed0304cd 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -551,6 +551,7 @@ struct btrfs_fs_info {
>  	struct btrfs_workqueue *endio_write_workers;
>  	struct btrfs_workqueue *endio_freespace_worker;
>  	struct btrfs_workqueue *caching_workers;
> +	struct workqueue_struct *raid_stripe_workers;
>  
>  	/*
>  	 * Fixup workers take dirty pages that didn't properly go through the
> @@ -791,6 +792,9 @@ struct btrfs_fs_info {
>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>  	struct lockdep_map btrfs_ordered_extent_map;
>  
> +	rwlock_t stripe_update_lock;
> +	struct rb_root stripe_update_tree;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 36ae541ad51b..74c0b486e496 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -70,6 +70,7 @@
>  #include "verity.h"
>  #include "super.h"
>  #include "orphan.h"
> +#include "raid-stripe-tree.h"
>  
>  struct btrfs_iget_args {
>  	u64 ino;
> @@ -9509,12 +9510,17 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  	if (qgroup_released < 0)
>  		return ERR_PTR(qgroup_released);
>  
> +	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
> +						    start, len);
> +	if (ret)
> +		goto free_qgroup;
> +
>  	if (trans) {
>  		ret = insert_reserved_file_extent(trans, inode,
>  						  file_offset, &stack_fi,
>  						  true, qgroup_released);
>  		if (ret)
> -			goto free_qgroup;
> +			goto free_stripe_extent;
>  		return trans;
>  	}
>  
> @@ -9532,7 +9538,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  	path = btrfs_alloc_path();
>  	if (!path) {
>  		ret = -ENOMEM;
> -		goto free_qgroup;
> +		goto free_stripe_extent;
>  	}
>  
>  	ret = btrfs_replace_file_extents(inode, path, file_offset,
> @@ -9540,9 +9546,12 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  				     &trans);
>  	btrfs_free_path(path);
>  	if (ret)
> -		goto free_qgroup;
> +		goto free_stripe_extent;
>  	return trans;
>  
> +free_stripe_extent:
> +	btrfs_drop_ordered_stripe(inode->root->fs_info, start);
> +
>  free_qgroup:
>  	/*
>  	 * We have released qgroup data range at the beginning of the function,
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..d184cd9dc96e
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,202 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Western Digital Corporation or its affiliates.
> + */
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "fs.h"
> +#include "accessors.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "disk-io.h"
> +#include "print-tree.h"
> +
> +static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +	const u64 *logical = key;
> +
> +	if (*logical < stripe->logical)
> +		return -1;
> +	if (*logical >= stripe->logical + stripe->num_bytes)
> +		return 1;
> +	return 0;
> +}
> +
> +static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
> +	return ordered_stripe_cmp(&stripe->logical, rbb);
> +}
> +
> +int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
> +{
> +	struct btrfs_fs_info *fs_info = bioc->fs_info;
> +	struct btrfs_ordered_stripe *stripe;
> +	struct btrfs_io_stripe *tmp;
> +	u64 logical = bioc->logical;
> +	u64 length = bioc->size;
> +	struct rb_node *node;
> +	size_t size;
> +
> +	size = bioc->num_stripes * sizeof(struct btrfs_io_stripe);
> +	stripe = kzalloc(sizeof(struct btrfs_ordered_stripe), GFP_NOFS);
> +	if (!stripe)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&stripe->lock);
> +	tmp = kmemdup(bioc->stripes, size, GFP_NOFS);
> +	if (!tmp) {
> +		kfree(stripe);
> +		return -ENOMEM;
> +	}
> +
> +	stripe->logical = logical;
> +	stripe->num_bytes = length;
> +	stripe->num_stripes = bioc->num_stripes;
> +	spin_lock(&stripe->lock);
> +	stripe->stripes = tmp;
> +	spin_unlock(&stripe->lock);
> +	refcount_set(&stripe->ref, 1);
> +
> +	write_lock(&fs_info->stripe_update_lock);
> +	node = rb_find_add(&stripe->rb_node, &fs_info->stripe_update_tree,
> +	       ordered_stripe_less);
> +	write_unlock(&fs_info->stripe_update_lock);
> +	if (node) {
> +		struct btrfs_ordered_stripe *old =
> +			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +

This is unsafe because we're not holding the lock anymore.

> +		btrfs_debug(fs_info, "logical: %llu, length: %llu already exists",
> +			  logical, length);
> +		ASSERT(logical == old->logical);
> +		write_lock(&fs_info->stripe_update_lock);
> +		rb_replace_node(node, &stripe->rb_node,
> +				&fs_info->stripe_update_tree);
> +		write_unlock(&fs_info->stripe_update_lock);

I don't love this, it feels like we can lookup and find the existing guy in
another thread, and then do this replace thing and fuck something up.  I'd
rather we keep all of this in the lock, so

write_lock();
node = rb_find_add();
if (node) {
	old = rb_entry();
	replace
}
write_unlock();

This may be theoretical and not really a problem in real life, but I'd rather
not have to squint at this part and be convinced it's the problem when it really
isn't.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-08 10:57 ` [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
  2023-02-08 19:47   ` Josef Bacik
@ 2023-02-08 19:52   ` Josef Bacik
  2023-02-13  6:57   ` Christoph Hellwig
  2023-02-13  7:40   ` Qu Wenruo
  3 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 19:52 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:41AM -0800, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
> 
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
> 
> This is done by creating in-memory ordered stripe extents, just like the
> in memory ordered extents, on I/O completion and the on-disk raid stripe
> extents get created once we're running the delayed_refs for the extent
> item this stripe extent is tied to.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/Makefile           |   2 +-
>  fs/btrfs/bio.c              |  29 ++++++
>  fs/btrfs/bio.h              |   2 +
>  fs/btrfs/delayed-ref.c      |   6 +-
>  fs/btrfs/delayed-ref.h      |   2 +
>  fs/btrfs/disk-io.c          |   9 +-
>  fs/btrfs/extent-tree.c      |  60 +++++++++++
>  fs/btrfs/fs.h               |   4 +
>  fs/btrfs/inode.c            |  15 ++-
>  fs/btrfs/raid-stripe-tree.c | 202 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  71 +++++++++++++
>  fs/btrfs/volumes.c          |   5 +-
>  fs/btrfs/volumes.h          |  12 +--
>  fs/btrfs/zoned.c            |   4 +
>  14 files changed, 410 insertions(+), 13 deletions(-)
>  create mode 100644 fs/btrfs/raid-stripe-tree.c
>  create mode 100644 fs/btrfs/raid-stripe-tree.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 90d53209755b..3bb869a84e54 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>  	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
>  	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> -	   lru_cache.o
> +	   lru_cache.o raid-stripe-tree.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index e6fe1b7dbc50..a01c6560ef49 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -15,6 +15,7 @@
>  #include "rcu-string.h"
>  #include "zoned.h"
>  #include "file-item.h"
> +#include "raid-stripe-tree.h"
>  
>  static struct bio_set btrfs_bioset;
>  static struct bio_set btrfs_clone_bioset;
> @@ -350,6 +351,21 @@ static void btrfs_raid56_end_io(struct bio *bio)
>  	btrfs_put_bioc(bioc);
>  }
>  
> +static void btrfs_raid_stripe_update(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio =
> +		container_of(work, struct btrfs_bio, raid_stripe_work);
> +	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
> +	struct btrfs_io_context *bioc = stripe->bioc;
> +	int ret;
> +
> +	ret = btrfs_add_ordered_stripe(bioc);
> +	if (ret)
> +		bbio->bio.bi_status = errno_to_blk_status(ret);
> +	btrfs_orig_bbio_end_io(bbio);
> +	btrfs_put_bioc(bioc);
> +}
> +
>  static void btrfs_orig_write_end_io(struct bio *bio)
>  {
>  	struct btrfs_io_stripe *stripe = bio->bi_private;
> @@ -372,6 +388,16 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>  	else
>  		bio->bi_status = BLK_STS_OK;
>  
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
> +		queue_work(bbio->inode->root->fs_info->raid_stripe_workers,
> +			   &bbio->raid_stripe_work);
> +		return;
> +	}
> +
>  	btrfs_orig_bbio_end_io(bbio);
>  	btrfs_put_bioc(bioc);
>  }
> @@ -383,6 +409,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
>  	if (bio->bi_status) {
>  		atomic_inc(&stripe->bioc->error);
>  		btrfs_log_dev_io_error(bio, stripe->dev);
> +	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>  	}
>  
>  	/* Pass on control to the original bio this one was cloned from */
> @@ -442,6 +470,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
>  	bio->bi_private = &bioc->stripes[dev_nr];
>  	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
>  	bioc->stripes[dev_nr].bioc = bioc;
> +	bioc->size = bio->bi_iter.bi_size;
>  	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
>  }
>  
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 20105806c8ac..bf5fbc105148 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -58,6 +58,8 @@ struct btrfs_bio {
>  	atomic_t pending_ios;
>  	struct work_struct end_io_work;
>  
> +	struct work_struct raid_stripe_work;
> +
>  	/*
>  	 * This member must come last, bio_alloc_bioset will allocate enough
>  	 * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 7660ac642c81..261f52ad8e12 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -14,6 +14,7 @@
>  #include "space-info.h"
>  #include "tree-mod-log.h"
>  #include "fs.h"
> +#include "raid-stripe-tree.h"
>  
>  struct kmem_cache *btrfs_delayed_ref_head_cachep;
>  struct kmem_cache *btrfs_delayed_tree_ref_cachep;
> @@ -637,8 +638,11 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
>  	exist->ref_mod += mod;
>  
>  	/* remove existing tail if its ref_mod is zero */
> -	if (exist->ref_mod == 0)
> +	if (exist->ref_mod == 0) {
> +		btrfs_drop_ordered_stripe(trans->fs_info, exist->bytenr);
>  		drop_delayed_ref(root, href, exist);
> +	}
> +
>  	spin_unlock(&href->lock);
>  	return ret;
>  inserted:
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 2eb34abf700f..5096c1a1ed3e 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -51,6 +51,8 @@ struct btrfs_delayed_ref_node {
>  	/* is this node still in the rbtree? */
>  	unsigned int is_head:1;
>  	unsigned int in_tree:1;
> +	/* Do we need RAID stripe tree modifications? */
> +	unsigned int must_insert_stripe:1;
>  };
>  
>  struct btrfs_delayed_extent_op {
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index ad64a79d052a..b130c8dcd8d9 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2022,6 +2022,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>  		destroy_workqueue(fs_info->rmw_workers);
>  	if (fs_info->compressed_write_workers)
>  		destroy_workqueue(fs_info->compressed_write_workers);
> +	if (fs_info->raid_stripe_workers)
> +		destroy_workqueue(fs_info->raid_stripe_workers);
>  	btrfs_destroy_workqueue(fs_info->endio_write_workers);
>  	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
>  	btrfs_destroy_workqueue(fs_info->delayed_workers);
> @@ -2240,12 +2242,14 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>  		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
>  	fs_info->discard_ctl.discard_workers =
>  		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
> +	fs_info->raid_stripe_workers =
> +		alloc_workqueue("btrfs-raid-stripe", flags, max_active);
>  
>  	if (!(fs_info->workers && fs_info->hipri_workers &&
>  	      fs_info->delalloc_workers && fs_info->flush_workers &&
>  	      fs_info->endio_workers && fs_info->endio_meta_workers &&
>  	      fs_info->compressed_write_workers &&
> -	      fs_info->endio_write_workers &&
> +	      fs_info->endio_write_workers && fs_info->raid_stripe_workers &&
>  	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
>  	      fs_info->caching_workers && fs_info->fixup_workers &&
>  	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
> @@ -3046,6 +3050,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  
>  	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
>  	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
> +
> +	rwlock_init(&fs_info->stripe_update_lock);
> +	fs_info->stripe_update_tree = RB_ROOT;
>  }
>  
>  static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 688cdf816957..50b3a2c3c0dd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -42,6 +42,7 @@
>  #include "file-item.h"
>  #include "orphan.h"
>  #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -1497,6 +1498,56 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static bool delayed_ref_needs_rst_update(struct btrfs_fs_info *fs_info,
> +					 struct btrfs_delayed_ref_head *head)
> +{
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	bool ret = false;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return ret;
> +
> +	em = btrfs_get_chunk_map(fs_info, head->bytenr, head->num_bytes);
> +	if (!em)
> +		return ret;
> +
> +	map = em->map_lookup;
> +
> +	if (btrfs_need_stripe_tree_update(fs_info, map->type))
> +		ret = true;
> +
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> +
> +static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
> +					    struct btrfs_delayed_ref_node *node)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_ordered_stripe *stripe;
> +	int ret = 0;
> +
> +	stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
> +	if (!stripe) {
> +		btrfs_err(fs_info,
> +			  "cannot get stripe extent for address %llu (%llu)",
> +			  node->bytenr, node->num_bytes);
> +		return -EINVAL;
> +	}
> +
> +	ASSERT(stripe->logical == node->bytenr);
> +
> +	ret = btrfs_insert_raid_extent(trans, stripe);
> +	/* once for us */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +	/* once for the tree */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +
> +	return ret;
> +}
> +
>  static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  				struct btrfs_delayed_ref_node *node,
>  				struct btrfs_delayed_extent_op *extent_op,
> @@ -1527,11 +1578,17 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  						 flags, ref->objectid,
>  						 ref->offset, &ins,
>  						 node->ref_mod);
> +		if (ret)
> +			return ret;
> +		if (node->must_insert_stripe)
> +			ret = add_stripe_entry_for_delayed_ref(trans, node);
>  	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
>  		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
>  					     ref->objectid, ref->offset,
>  					     node->ref_mod, extent_op);
>  	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
> +		if (node->must_insert_stripe)
> +			btrfs_drop_ordered_stripe(trans->fs_info, node->bytenr);
>  		ret = __btrfs_free_extent(trans, node, parent,
>  					  ref_root, ref->objectid,
>  					  ref->offset, node->ref_mod,
> @@ -1901,6 +1958,8 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
>  	struct btrfs_delayed_ref_root *delayed_refs;
>  	struct btrfs_delayed_extent_op *extent_op;
>  	struct btrfs_delayed_ref_node *ref;
> +	const bool need_rst_update =
> +		delayed_ref_needs_rst_update(fs_info, locked_ref);
>  	int must_insert_reserved = 0;
>  	int ret;
>  
> @@ -1951,6 +2010,7 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
>  		locked_ref->extent_op = NULL;
>  		spin_unlock(&locked_ref->lock);
>  
> +		ref->must_insert_stripe = need_rst_update;
>  		ret = run_one_delayed_ref(trans, ref, extent_op,
>  					  must_insert_reserved);
>  
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 93f2499a9c5b..bee7ed0304cd 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -551,6 +551,7 @@ struct btrfs_fs_info {
>  	struct btrfs_workqueue *endio_write_workers;
>  	struct btrfs_workqueue *endio_freespace_worker;
>  	struct btrfs_workqueue *caching_workers;
> +	struct workqueue_struct *raid_stripe_workers;
>  
>  	/*
>  	 * Fixup workers take dirty pages that didn't properly go through the
> @@ -791,6 +792,9 @@ struct btrfs_fs_info {
>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>  	struct lockdep_map btrfs_ordered_extent_map;
>  
> +	rwlock_t stripe_update_lock;
> +	struct rb_root stripe_update_tree;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 36ae541ad51b..74c0b486e496 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -70,6 +70,7 @@
>  #include "verity.h"
>  #include "super.h"
>  #include "orphan.h"
> +#include "raid-stripe-tree.h"
>  
>  struct btrfs_iget_args {
>  	u64 ino;
> @@ -9509,12 +9510,17 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  	if (qgroup_released < 0)
>  		return ERR_PTR(qgroup_released);
>  
> +	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
> +						    start, len);
> +	if (ret)
> +		goto free_qgroup;
> +
>  	if (trans) {
>  		ret = insert_reserved_file_extent(trans, inode,
>  						  file_offset, &stack_fi,
>  						  true, qgroup_released);
>  		if (ret)
> -			goto free_qgroup;
> +			goto free_stripe_extent;
>  		return trans;
>  	}
>  
> @@ -9532,7 +9538,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  	path = btrfs_alloc_path();
>  	if (!path) {
>  		ret = -ENOMEM;
> -		goto free_qgroup;
> +		goto free_stripe_extent;
>  	}
>  
>  	ret = btrfs_replace_file_extents(inode, path, file_offset,
> @@ -9540,9 +9546,12 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  				     &trans);
>  	btrfs_free_path(path);
>  	if (ret)
> -		goto free_qgroup;
> +		goto free_stripe_extent;
>  	return trans;
>  
> +free_stripe_extent:
> +	btrfs_drop_ordered_stripe(inode->root->fs_info, start);
> +
>  free_qgroup:
>  	/*
>  	 * We have released qgroup data range at the beginning of the function,
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..d184cd9dc96e
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,202 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Western Digital Corporation or its affiliates.
> + */
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "fs.h"
> +#include "accessors.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "disk-io.h"
> +#include "print-tree.h"
> +
> +static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +	const u64 *logical = key;
> +
> +	if (*logical < stripe->logical)
> +		return -1;
> +	if (*logical >= stripe->logical + stripe->num_bytes)
> +		return 1;
> +	return 0;
> +}
> +
> +static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
> +	return ordered_stripe_cmp(&stripe->logical, rbb);
> +}
> +
> +int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
> +{
> +	struct btrfs_fs_info *fs_info = bioc->fs_info;
> +	struct btrfs_ordered_stripe *stripe;
> +	struct btrfs_io_stripe *tmp;
> +	u64 logical = bioc->logical;
> +	u64 length = bioc->size;
> +	struct rb_node *node;
> +	size_t size;
> +
> +	size = bioc->num_stripes * sizeof(struct btrfs_io_stripe);
> +	stripe = kzalloc(sizeof(struct btrfs_ordered_stripe), GFP_NOFS);
> +	if (!stripe)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&stripe->lock);
> +	tmp = kmemdup(bioc->stripes, size, GFP_NOFS);
> +	if (!tmp) {
> +		kfree(stripe);
> +		return -ENOMEM;
> +	}
> +
> +	stripe->logical = logical;
> +	stripe->num_bytes = length;
> +	stripe->num_stripes = bioc->num_stripes;
> +	spin_lock(&stripe->lock);
> +	stripe->stripes = tmp;
> +	spin_unlock(&stripe->lock);
> +	refcount_set(&stripe->ref, 1);
> +
> +	write_lock(&fs_info->stripe_update_lock);
> +	node = rb_find_add(&stripe->rb_node, &fs_info->stripe_update_tree,
> +	       ordered_stripe_less);
> +	write_unlock(&fs_info->stripe_update_lock);
> +	if (node) {
> +		struct btrfs_ordered_stripe *old =
> +			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +
> +		btrfs_debug(fs_info, "logical: %llu, length: %llu already exists",
> +			  logical, length);
> +		ASSERT(logical == old->logical);
> +		write_lock(&fs_info->stripe_update_lock);
> +		rb_replace_node(node, &stripe->rb_node,
> +				&fs_info->stripe_update_tree);
> +		write_unlock(&fs_info->stripe_update_lock);
> +	}
> +
> +	return 0;
> +}
> +
> +struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *fs_info,
> +							 u64 logical)
> +{
> +	struct rb_root *root = &fs_info->stripe_update_tree;
> +	struct btrfs_ordered_stripe *stripe = NULL;
> +	struct rb_node *node;
> +
> +	read_lock(&fs_info->stripe_update_lock);
> +	node = rb_find(&logical, root, ordered_stripe_cmp);
> +	if (node) {
> +		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +		refcount_inc(&stripe->ref);
> +	}
> +	read_unlock(&fs_info->stripe_update_lock);
> +
> +	return stripe;
> +}
> +
> +void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
> +				 struct btrfs_ordered_stripe *stripe)
> +{
> +	write_lock(&fs_info->stripe_update_lock);
> +	if (refcount_dec_and_test(&stripe->ref)) {

Can we re-work this to not take the write_lock() unconditionally?  For the
lookup do something like

node = rb_find();
if (node) {
	stripe = rb_entry();
	if (!refcount_inc_not_zero(stripe))
		stripe = NULL;
}

and then we can do 

if (refcount_dec_and_test()) {
	write_lock();
	rb_erase();
	RB_CLEAR_NODE(node);
	write_unlock();
}

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 05/13] btrfs: delete stripe extent on extent deletion
  2023-02-08 10:57 ` [PATCH v5 05/13] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
@ 2023-02-08 20:00   ` Josef Bacik
  2023-02-13 15:21     ` Johannes Thumshirn
  0 siblings, 1 reply; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 20:00 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:42AM -0800, Johannes Thumshirn wrote:
> As each stripe extent is tied to an extent item, delete the stripe extent
> once the corresponding extent item is deleted.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/extent-tree.c      |  8 ++++++++
>  fs/btrfs/raid-stripe-tree.c | 31 +++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  2 ++
>  3 files changed, 41 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 50b3a2c3c0dd..f08ee7d9211c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3238,6 +3238,14 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  			}
>  		}
>  
> +		if (is_data) {
> +			ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
> +			if (ret) {
> +				btrfs_abort_transaction(trans, ret);
> +				return ret;
> +			}
> +		}
> +

We're still holding the path open, so now we have a lockdep thing of extent root
-> RST, which will for sure bite us in the ass in the future.  Push this part
just under this part of the code, after the btrfs_release_path().

Also since we have a path here, add it to the arguments for
btrfs_delete_raid_extent() so we don't have to allocate a new path.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 06/13] btrfs: lookup physical address from stripe extent
  2023-02-08 10:57 ` [PATCH v5 06/13] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2023-02-08 20:16   ` Josef Bacik
  0 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 20:16 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:43AM -0800, Johannes Thumshirn wrote:
> Lookup the physical address from the raid stripe tree when a read on an
> RAID volume formatted with the raid stripe tree was attempted.
> 
> If the requested logical address was not found in the stripe tree, it may
> still be in the in-memory ordered stripe tree, so fallback to searching
> the ordered stripe tree in this case.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/raid-stripe-tree.c | 145 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |   3 +
>  fs/btrfs/volumes.c          |  31 ++++++--
>  3 files changed, 172 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index ff5787a19454..ba7015a8012c 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -231,3 +231,148 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
>  
>  	return ret;
>  }
> +
> +static bool btrfs_physical_from_ordered_stripe(struct btrfs_fs_info *fs_info,
> +					      u64 logical, u64 *length,
> +					      int num_stripes,
> +					      struct btrfs_io_stripe *stripe)
> +{
> +	struct btrfs_ordered_stripe *os;
> +	u64 offset;
> +	u64 found_end;
> +	u64 end;
> +	int i;
> +
> +	os = btrfs_lookup_ordered_stripe(fs_info, logical);
> +	if (!os)
> +		return false;
> +
> +	end = logical + *length;
> +	found_end = os->logical + os->num_bytes;
> +	if (end > found_end)
> +		*length -= end - found_end;
> +
> +	for (i = 0; i < num_stripes; i++) {
> +		if (os->stripes[i].dev != stripe->dev)
> +			continue;
> +
> +		offset = logical - os->logical;
> +		ASSERT(offset >= 0);

This is always going to be true, should probably be

ASSERT(logical >= os->logical);

otherwise you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 08/13] btrfs: zoned: allow zoned RAID
  2023-02-08 10:57 ` [PATCH v5 08/13] btrfs: zoned: allow zoned RAID Johannes Thumshirn
@ 2023-02-08 20:18   ` Josef Bacik
  0 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 20:18 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:45AM -0800, Johannes Thumshirn wrote:
> When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
> data block-groups. For meta-data block-groups, we don't actually need
> anything special, as all meta-data I/O is protected by the
> btrfs_zoned_meta_io_lock() already.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 09/13] btrfs: check for leaks of ordered stripes on umount
  2023-02-08 10:57 ` [PATCH v5 09/13] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
@ 2023-02-08 20:19   ` Josef Bacik
  0 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 20:19 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:46AM -0800, Johannes Thumshirn wrote:
> Check if we're leaking any ordered stripes when unmounting a filesystem
> with an stripe tree.
> 
> This check is gated behind CONFIG_BTRFS_DEBUG to not affect any production
> type systems.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 11/13] btrfs: announce presence of raid-stripe-tree in sysfs
  2023-02-08 10:57 ` [PATCH v5 11/13] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
@ 2023-02-08 20:20   ` Josef Bacik
  0 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 20:20 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:48AM -0800, Johannes Thumshirn wrote:
> If a filesystem with a raid-stripe-tree is mounted, show the RST feature
> in sysfs.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 12/13] btrfs: consult raid-stripe-tree when scrubbing
  2023-02-08 10:57 ` [PATCH v5 12/13] btrfs: consult raid-stripe-tree when scrubbing Johannes Thumshirn
@ 2023-02-08 20:21   ` Josef Bacik
  0 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 20:21 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:49AM -0800, Johannes Thumshirn wrote:
> When scrubbing a filesystem which uses the raid-stripe-tree for logical to
> physical address translation, consult the RST to perform the address
> translation instead of relying on fixed block group offsets.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 13/13] btrfs: add raid-stripe-tree to features enabled with debug
  2023-02-08 10:57 ` [PATCH v5 13/13] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
@ 2023-02-08 20:23   ` Josef Bacik
  0 siblings, 0 replies; 48+ messages in thread
From: Josef Bacik @ 2023-02-08 20:23 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:50AM -0800, Johannes Thumshirn wrote:
> Until the RAID stripe tree code is well enough tested and feature
> complete, "hide" it behind CONFIG_BTRFS_DEBUG so only people who
> want to use it are actually using it.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (12 preceding siblings ...)
  2023-02-08 10:57 ` [PATCH v5 13/13] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
@ 2023-02-09  0:42 ` Qu Wenruo
  2023-02-09  8:47   ` Johannes Thumshirn
  2023-02-09 15:57 ` Phillip Susi
  14 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-02-09  0:42 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs



On 2023/2/8 18:57, Johannes Thumshirn wrote:
> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
> i.e. when the logical to physical translation happens for regular btrfs RAID
> as well.
> 
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
> 
> For an example 1M write (split into 126K segments due to zone-append)
> rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
> 
> The tree will look as follows:
> 
> rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
> btrfs-progs v5.16.1
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
> leaf 805847040 flags 0x1(WRITTEN) backref revision 1
> checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
> checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
> fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
> chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
>          item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
>                          stripe 0 devid 1 offset 939524096
>                          stripe 1 devid 2 offset 536870912

Considering we already have the length as the key offset, can we merge 
continuous entries?

I'm pretty sure if we go this path, the rst tree itself can be too 
large, and it's better we consider this before it's too problematic.

Thanks,
Qu

>          item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
>                          stripe 0 devid 1 offset 939651072
>                          stripe 1 devid 2 offset 536997888
>          item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
>                          stripe 0 devid 1 offset 939778048
>                          stripe 1 devid 2 offset 537124864
>          item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
>                          stripe 0 devid 1 offset 939905024
>                          stripe 1 devid 2 offset 537251840
>          item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
>                          stripe 0 devid 1 offset 940032000
>                          stripe 1 devid 2 offset 537378816
>          item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
>                          stripe 0 devid 1 offset 940158976
>                          stripe 1 devid 2 offset 537505792
>          item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
>                          stripe 0 devid 1 offset 940285952
>                          stripe 1 devid 2 offset 537632768
>          item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
>                          stripe 0 devid 1 offset 940412928
>                          stripe 1 devid 2 offset 537759744
>          item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
>                          stripe 0 devid 1 offset 940539904
>                          stripe 1 devid 2 offset 537886720
> total bytes 26843545600
> bytes used 1245184
> uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
> 
> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
> 
> 
> Changes to v4:
> - Added patch to check for RST feature in sysfs
> - Added RST lookups for scrubbing
> - Fixed the error handling bug Josef pointed out
> - Only check if we need to write out a RST once per delayed_ref head
> - Added support for zoned data DUP with RST
> 
> Changes to v3:
> - Rebased onto 20221120124734.18634-1-hch@lst.de
> - Incorporated Josef's review
> - Merged related patches
> 
> v3 of the patchset can be found here:
> https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com
> 
> Changes to v2:
> - Bug fixes
> - Rebased onto 20220901074216.1849941-1-hch@lst.de
> - Added tracepoints
> - Added leak checker
> - Added RAID0 and RAID10
> 
> v2 of the patchset can be found here:
> https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com
> 
> Changes to v1:
> - Write the stripe-tree at delayed-ref time (Qu)
> - Add a different write path for preallocation
> 
> v1 of the patchset can be found here:
> https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/
> 
> Johannes Thumshirn (13):
>    btrfs: re-add trans parameter to insert_delayed_ref
>    btrfs: add raid stripe tree definitions
>    btrfs: read raid-stripe-tree from disk
>    btrfs: add support for inserting raid stripe extents
>    btrfs: delete stripe extent on extent deletion
>    btrfs: lookup physical address from stripe extent
>    btrfs: add raid stripe tree pretty printer
>    btrfs: zoned: allow zoned RAID
>    btrfs: check for leaks of ordered stripes on umount
>    btrfs: add tracepoints for ordered stripes
>    btrfs: announce presence of raid-stripe-tree in sysfs
>    btrfs: consult raid-stripe-tree when scrubbing
>    btrfs: add raid-stripe-tree to features enabled with debug
> 
>   fs/btrfs/Makefile               |   2 +-
>   fs/btrfs/accessors.h            |  29 +++
>   fs/btrfs/bio.c                  |  29 +++
>   fs/btrfs/bio.h                  |   2 +
>   fs/btrfs/block-rsv.c            |   1 +
>   fs/btrfs/delayed-ref.c          |  13 +-
>   fs/btrfs/delayed-ref.h          |   2 +
>   fs/btrfs/disk-io.c              |  30 ++-
>   fs/btrfs/disk-io.h              |   5 +
>   fs/btrfs/extent-tree.c          |  68 ++++++
>   fs/btrfs/fs.h                   |   8 +-
>   fs/btrfs/inode.c                |  15 +-
>   fs/btrfs/print-tree.c           |  21 ++
>   fs/btrfs/raid-stripe-tree.c     | 415 ++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  87 +++++++
>   fs/btrfs/scrub.c                |  33 ++-
>   fs/btrfs/super.c                |   1 +
>   fs/btrfs/sysfs.c                |   3 +
>   fs/btrfs/volumes.c              |  39 ++-
>   fs/btrfs/volumes.h              |  12 +-
>   fs/btrfs/zoned.c                |  49 +++-
>   include/trace/events/btrfs.h    |  50 ++++
>   include/uapi/linux/btrfs.h      |   1 +
>   include/uapi/linux/btrfs_tree.h |  20 +-
>   24 files changed, 905 insertions(+), 30 deletions(-)
>   create mode 100644 fs/btrfs/raid-stripe-tree.c
>   create mode 100644 fs/btrfs/raid-stripe-tree.h
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-09  0:42 ` [PATCH v5 00/13] btrfs: introduce RAID stripe tree Qu Wenruo
@ 2023-02-09  8:47   ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-09  8:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 09.02.23 01:42, Qu Wenruo wrote:
> Considering we already have the length as the key offset, can we merge 
> continuous entries?
> 
> I'm pretty sure if we go this path, the rst tree itself can be too 
> large, and it's better we consider this before it's too problematic.

Yes this is something I was considering to do as a 3rd (or 4th) step,
once the basics are landed.

It can be easily done afterwards without breaking any eventual 
existing installations.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (13 preceding siblings ...)
  2023-02-09  0:42 ` [PATCH v5 00/13] btrfs: introduce RAID stripe tree Qu Wenruo
@ 2023-02-09 15:57 ` Phillip Susi
  2023-02-10  8:44   ` Johannes Thumshirn
  14 siblings, 1 reply; 48+ messages in thread
From: Phillip Susi @ 2023-02-09 15:57 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs


Johannes Thumshirn <johannes.thumshirn@wdc.com> writes:

> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true

Nice document, but I'm still not quite sure I understand the problem.
As long as both disks have the same zone layout, and the raid chunk is
aligned to the start of a zone, then shouldn't they be appended together
and have a deterministic layout?

If so, then is this additional metadata just needed in the case where
the disks *don't* have the same zone layout?

If so, then is this an optional feature that would only be enabled when
the disks don't have the same zone layout?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-09 15:57 ` Phillip Susi
@ 2023-02-10  8:44   ` Johannes Thumshirn
  2023-02-10 10:33     ` Johannes Thumshirn
  2023-02-13 16:42     ` Phillip Susi
  0 siblings, 2 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-10  8:44 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-btrfs

On 09.02.23 17:01, Phillip Susi wrote:
> 
> Johannes Thumshirn <johannes.thumshirn@wdc.com> writes:
> 
>> A design document can be found here:
>> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
> 
> Nice document, but I'm still not quite sure I understand the problem.
> As long as both disks have the same zone layout, and the raid chunk is
> aligned to the start of a zone, then shouldn't they be appended together
> and have a deterministic layout?
> 
> If so, then is this additional metadata just needed in the case where
> the disks *don't* have the same zone layout?
> 
> If so, then is this an optional feature that would only be enabled when
> the disks don't have the same zone layout?
> 
> 

No. With zoned drives we're writing using the Zone Append command [1].
This has several advantages, one being that you can issue IO at a high
queue depth and don't need any locking to. But it has one downside for
the RAID application, that is, that you don't have any control of the 
LBA where the data lands, only the zone.

Therefor we need another logical to physical mapping layer, which is
the RAID stripe tree. Coincidentally we can also use this tree to do
l2p mapping for RAID5/6 and eliminate the write hole this way.


[1] https://zonedstorage.io/docs/introduction/zns#zone-append

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-10  8:44   ` Johannes Thumshirn
@ 2023-02-10 10:33     ` Johannes Thumshirn
  2023-02-13 16:42     ` Phillip Susi
  1 sibling, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-10 10:33 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-btrfs

On 10.02.23 09:44, Johannes Thumshirn wrote:
> On 09.02.23 17:01, Phillip Susi wrote:
>>
>> Johannes Thumshirn <johannes.thumshirn@wdc.com> writes:
>>
>>> A design document can be found here:
>>> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
>>
>> Nice document, but I'm still not quite sure I understand the problem.
>> As long as both disks have the same zone layout, and the raid chunk is
>> aligned to the start of a zone, then shouldn't they be appended together
>> and have a deterministic layout?
>>
>> If so, then is this additional metadata just needed in the case where
>> the disks *don't* have the same zone layout?
>>
>> If so, then is this an optional feature that would only be enabled when
>> the disks don't have the same zone layout?
>>
>>
> 
> No. With zoned drives we're writing using the Zone Append command [1].
> This has several advantages, one being that you can issue IO at a high
> queue depth and don't need any locking to. But it has one downside for
> the RAID application, that is, that you don't have any control of the 
> LBA where the data lands, only the zone.
> 
> Therefor we need another logical to physical mapping layer, which is
> the RAID stripe tree. Coincidentally we can also use this tree to do
> l2p mapping for RAID5/6 and eliminate the write hole this way.
> 
> 
> [1] https://zonedstorage.io/docs/introduction/zns#zone-append
> 

Actually that's the one I was looking for:
https://zonedstorage.io/docs/introduction/zoned-storage#zone-append

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 02/13] btrfs: add raid stripe tree definitions
  2023-02-08 10:57 ` [PATCH v5 02/13] btrfs: add raid stripe tree definitions Johannes Thumshirn
  2023-02-08 19:42   ` Josef Bacik
@ 2023-02-13  6:50   ` Christoph Hellwig
  2023-02-13 11:10     ` Johannes Thumshirn
  2023-02-13 11:34   ` Anand Jain
  2 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-02-13  6:50 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Wed, Feb 08, 2023 at 02:57:39AM -0800, Johannes Thumshirn wrote:
> Add definitions for the raid stripe tree. This tree will hold information
> about the on-disk layout of the stripes in a RAID set.
> 
> Each stripe extent has a 1:1 relationship with an on-disk extent item and
> is doing the logical to per-drive physical address translation for the
> extent item in question.

So this basially removes the need to trak the physical address in
the chunk tree.  Is there any way to stop maintaining it at all?
If not, why?  


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-08 10:57 ` [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
  2023-02-08 19:47   ` Josef Bacik
  2023-02-08 19:52   ` Josef Bacik
@ 2023-02-13  6:57   ` Christoph Hellwig
  2023-02-13 10:43     ` Johannes Thumshirn
  2023-02-13  7:40   ` Qu Wenruo
  3 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-02-13  6:57 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -58,6 +58,8 @@ struct btrfs_bio {
>  	atomic_t pending_ios;
>  	struct work_struct end_io_work;
>  
> +	struct work_struct raid_stripe_work;

You should be able to reused end_io_work here, as it is only used
for reads currently.

>  
> +static bool delayed_ref_needs_rst_update(struct btrfs_fs_info *fs_info,
> +					 struct btrfs_delayed_ref_head *head)
> +{
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	bool ret = false;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return ret;
> +
> +	em = btrfs_get_chunk_map(fs_info, head->bytenr, head->num_bytes);
> +	if (!em)
> +		return ret;
> +
> +	map = em->map_lookup;
> +
> +	if (btrfs_need_stripe_tree_update(fs_info, map->type))

This just seems very expensive.  Is there no way to propafate
this information without doign a chunk map lookup every time?

> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1687,6 +1687,10 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
>  	u64 *logical = NULL;
>  	int nr, stripe_len;
>  
> +	/* Filesystems with a stripe tree have their own l2p mapping */
> +	if (btrfs_stripe_tree_root(fs_info))
> +		return;

I don't think we should even be able to readch this, as the call to
btrfs_rewrite_logical_zoned is guarded by having a valid
ordered->physical… and that is only set in btrfs_simple_end_io.
So this could just be an assert.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-08 10:57 ` [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
                     ` (2 preceding siblings ...)
  2023-02-13  6:57   ` Christoph Hellwig
@ 2023-02-13  7:40   ` Qu Wenruo
  2023-02-13 10:49     ` Johannes Thumshirn
  3 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-02-13  7:40 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs



On 2023/2/8 18:57, Johannes Thumshirn wrote:
[...]
>   }
>   
> +static void btrfs_raid_stripe_update(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio =
> +		container_of(work, struct btrfs_bio, raid_stripe_work);
> +	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
> +	struct btrfs_io_context *bioc = stripe->bioc;
> +	int ret;
> +
> +	ret = btrfs_add_ordered_stripe(bioc);
> +	if (ret)
> +		bbio->bio.bi_status = errno_to_blk_status(ret);
> +	btrfs_orig_bbio_end_io(bbio);
> +	btrfs_put_bioc(bioc);
> +}
> +
>   static void btrfs_orig_write_end_io(struct bio *bio)
>   {
>   	struct btrfs_io_stripe *stripe = bio->bi_private;
> @@ -372,6 +388,16 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>   	else
>   		bio->bi_status = BLK_STS_OK;
>   
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);

So the in-memory stripe tree entry insersion is delated.

Could the following race happen?

              T1                  |              T2
---------------------------------+----------------------------------
write_pages()                    |
btrfs_orig_write_end_io()        |
|- INIT_WORK();                  |
`- queue_work();                 |
                                  | btrfs_invalidate_folio()
                                  | `- the pages are no longer cached
                                  |
                                  | btrfs_do_readpage()
                                  | |- do whatever the rst lookup
workqueue                        |
`- btrfs_raid_stripe_update()    |
    `- btrfs_add_ordered_stripe() |

In above case, T2 read will fail as it can not grab the RST mapping.

I really believe the in-memory rst update should not be delayed into a 
workqueue, but done inside the write endio function.

Thanks,
Qu


> +		queue_work(bbio->inode->root->fs_info->raid_stripe_workers,
> +			   &bbio->raid_stripe_work);
> +		return;
> +	}
> +
>   	btrfs_orig_bbio_end_io(bbio);
>   	btrfs_put_bioc(bioc);
>   }
> @@ -383,6 +409,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
>   	if (bio->bi_status) {
>   		atomic_inc(&stripe->bioc->error);
>   		btrfs_log_dev_io_error(bio, stripe->dev);
> +	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>   	}
>   
>   	/* Pass on control to the original bio this one was cloned from */
> @@ -442,6 +470,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
>   	bio->bi_private = &bioc->stripes[dev_nr];
>   	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
>   	bioc->stripes[dev_nr].bioc = bioc;
> +	bioc->size = bio->bi_iter.bi_size;
>   	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
>   }
>   
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 20105806c8ac..bf5fbc105148 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -58,6 +58,8 @@ struct btrfs_bio {
>   	atomic_t pending_ios;
>   	struct work_struct end_io_work;
>   
> +	struct work_struct raid_stripe_work;
> +
>   	/*
>   	 * This member must come last, bio_alloc_bioset will allocate enough
>   	 * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 7660ac642c81..261f52ad8e12 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -14,6 +14,7 @@
>   #include "space-info.h"
>   #include "tree-mod-log.h"
>   #include "fs.h"
> +#include "raid-stripe-tree.h"
>   
>   struct kmem_cache *btrfs_delayed_ref_head_cachep;
>   struct kmem_cache *btrfs_delayed_tree_ref_cachep;
> @@ -637,8 +638,11 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
>   	exist->ref_mod += mod;
>   
>   	/* remove existing tail if its ref_mod is zero */
> -	if (exist->ref_mod == 0)
> +	if (exist->ref_mod == 0) {
> +		btrfs_drop_ordered_stripe(trans->fs_info, exist->bytenr);
>   		drop_delayed_ref(root, href, exist);
> +	}
> +
>   	spin_unlock(&href->lock);
>   	return ret;
>   inserted:
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 2eb34abf700f..5096c1a1ed3e 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -51,6 +51,8 @@ struct btrfs_delayed_ref_node {
>   	/* is this node still in the rbtree? */
>   	unsigned int is_head:1;
>   	unsigned int in_tree:1;
> +	/* Do we need RAID stripe tree modifications? */
> +	unsigned int must_insert_stripe:1;
>   };
>   
>   struct btrfs_delayed_extent_op {
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index ad64a79d052a..b130c8dcd8d9 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2022,6 +2022,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>   		destroy_workqueue(fs_info->rmw_workers);
>   	if (fs_info->compressed_write_workers)
>   		destroy_workqueue(fs_info->compressed_write_workers);
> +	if (fs_info->raid_stripe_workers)
> +		destroy_workqueue(fs_info->raid_stripe_workers);
>   	btrfs_destroy_workqueue(fs_info->endio_write_workers);
>   	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
>   	btrfs_destroy_workqueue(fs_info->delayed_workers);
> @@ -2240,12 +2242,14 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
>   	fs_info->discard_ctl.discard_workers =
>   		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
> +	fs_info->raid_stripe_workers =
> +		alloc_workqueue("btrfs-raid-stripe", flags, max_active);
>   
>   	if (!(fs_info->workers && fs_info->hipri_workers &&
>   	      fs_info->delalloc_workers && fs_info->flush_workers &&
>   	      fs_info->endio_workers && fs_info->endio_meta_workers &&
>   	      fs_info->compressed_write_workers &&
> -	      fs_info->endio_write_workers &&
> +	      fs_info->endio_write_workers && fs_info->raid_stripe_workers &&
>   	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
>   	      fs_info->caching_workers && fs_info->fixup_workers &&
>   	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
> @@ -3046,6 +3050,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>   
>   	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
>   	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
> +
> +	rwlock_init(&fs_info->stripe_update_lock);
> +	fs_info->stripe_update_tree = RB_ROOT;
>   }
>   
>   static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 688cdf816957..50b3a2c3c0dd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -42,6 +42,7 @@
>   #include "file-item.h"
>   #include "orphan.h"
>   #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>   
>   #undef SCRAMBLE_DELAYED_REFS
>   
> @@ -1497,6 +1498,56 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>   
> +static bool delayed_ref_needs_rst_update(struct btrfs_fs_info *fs_info,
> +					 struct btrfs_delayed_ref_head *head)
> +{
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	bool ret = false;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return ret;
> +
> +	em = btrfs_get_chunk_map(fs_info, head->bytenr, head->num_bytes);
> +	if (!em)
> +		return ret;
> +
> +	map = em->map_lookup;
> +
> +	if (btrfs_need_stripe_tree_update(fs_info, map->type))
> +		ret = true;
> +
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> +
> +static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
> +					    struct btrfs_delayed_ref_node *node)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_ordered_stripe *stripe;
> +	int ret = 0;
> +
> +	stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
> +	if (!stripe) {
> +		btrfs_err(fs_info,
> +			  "cannot get stripe extent for address %llu (%llu)",
> +			  node->bytenr, node->num_bytes);
> +		return -EINVAL;
> +	}
> +
> +	ASSERT(stripe->logical == node->bytenr);
> +
> +	ret = btrfs_insert_raid_extent(trans, stripe);
> +	/* once for us */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +	/* once for the tree */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +
> +	return ret;
> +}
> +
>   static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>   				struct btrfs_delayed_ref_node *node,
>   				struct btrfs_delayed_extent_op *extent_op,
> @@ -1527,11 +1578,17 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>   						 flags, ref->objectid,
>   						 ref->offset, &ins,
>   						 node->ref_mod);
> +		if (ret)
> +			return ret;
> +		if (node->must_insert_stripe)
> +			ret = add_stripe_entry_for_delayed_ref(trans, node);
>   	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
>   		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
>   					     ref->objectid, ref->offset,
>   					     node->ref_mod, extent_op);
>   	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
> +		if (node->must_insert_stripe)
> +			btrfs_drop_ordered_stripe(trans->fs_info, node->bytenr);
>   		ret = __btrfs_free_extent(trans, node, parent,
>   					  ref_root, ref->objectid,
>   					  ref->offset, node->ref_mod,
> @@ -1901,6 +1958,8 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
>   	struct btrfs_delayed_ref_root *delayed_refs;
>   	struct btrfs_delayed_extent_op *extent_op;
>   	struct btrfs_delayed_ref_node *ref;
> +	const bool need_rst_update =
> +		delayed_ref_needs_rst_update(fs_info, locked_ref);
>   	int must_insert_reserved = 0;
>   	int ret;
>   
> @@ -1951,6 +2010,7 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
>   		locked_ref->extent_op = NULL;
>   		spin_unlock(&locked_ref->lock);
>   
> +		ref->must_insert_stripe = need_rst_update;
>   		ret = run_one_delayed_ref(trans, ref, extent_op,
>   					  must_insert_reserved);
>   
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 93f2499a9c5b..bee7ed0304cd 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -551,6 +551,7 @@ struct btrfs_fs_info {
>   	struct btrfs_workqueue *endio_write_workers;
>   	struct btrfs_workqueue *endio_freespace_worker;
>   	struct btrfs_workqueue *caching_workers;
> +	struct workqueue_struct *raid_stripe_workers;
>   
>   	/*
>   	 * Fixup workers take dirty pages that didn't properly go through the
> @@ -791,6 +792,9 @@ struct btrfs_fs_info {
>   	struct lockdep_map btrfs_trans_pending_ordered_map;
>   	struct lockdep_map btrfs_ordered_extent_map;
>   
> +	rwlock_t stripe_update_lock;
> +	struct rb_root stripe_update_tree;
> +
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   	spinlock_t ref_verify_lock;
>   	struct rb_root block_tree;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 36ae541ad51b..74c0b486e496 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -70,6 +70,7 @@
>   #include "verity.h"
>   #include "super.h"
>   #include "orphan.h"
> +#include "raid-stripe-tree.h"
>   
>   struct btrfs_iget_args {
>   	u64 ino;
> @@ -9509,12 +9510,17 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>   	if (qgroup_released < 0)
>   		return ERR_PTR(qgroup_released);
>   
> +	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
> +						    start, len);
> +	if (ret)
> +		goto free_qgroup;
> +
>   	if (trans) {
>   		ret = insert_reserved_file_extent(trans, inode,
>   						  file_offset, &stack_fi,
>   						  true, qgroup_released);
>   		if (ret)
> -			goto free_qgroup;
> +			goto free_stripe_extent;
>   		return trans;
>   	}
>   
> @@ -9532,7 +9538,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>   	path = btrfs_alloc_path();
>   	if (!path) {
>   		ret = -ENOMEM;
> -		goto free_qgroup;
> +		goto free_stripe_extent;
>   	}
>   
>   	ret = btrfs_replace_file_extents(inode, path, file_offset,
> @@ -9540,9 +9546,12 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>   				     &trans);
>   	btrfs_free_path(path);
>   	if (ret)
> -		goto free_qgroup;
> +		goto free_stripe_extent;
>   	return trans;
>   
> +free_stripe_extent:
> +	btrfs_drop_ordered_stripe(inode->root->fs_info, start);
> +
>   free_qgroup:
>   	/*
>   	 * We have released qgroup data range at the beginning of the function,
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..d184cd9dc96e
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,202 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Western Digital Corporation or its affiliates.
> + */
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "fs.h"
> +#include "accessors.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "disk-io.h"
> +#include "print-tree.h"
> +
> +static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +	const u64 *logical = key;
> +
> +	if (*logical < stripe->logical)
> +		return -1;
> +	if (*logical >= stripe->logical + stripe->num_bytes)
> +		return 1;
> +	return 0;
> +}
> +
> +static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
> +	return ordered_stripe_cmp(&stripe->logical, rbb);
> +}
> +
> +int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
> +{
> +	struct btrfs_fs_info *fs_info = bioc->fs_info;
> +	struct btrfs_ordered_stripe *stripe;
> +	struct btrfs_io_stripe *tmp;
> +	u64 logical = bioc->logical;
> +	u64 length = bioc->size;
> +	struct rb_node *node;
> +	size_t size;
> +
> +	size = bioc->num_stripes * sizeof(struct btrfs_io_stripe);
> +	stripe = kzalloc(sizeof(struct btrfs_ordered_stripe), GFP_NOFS);
> +	if (!stripe)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&stripe->lock);
> +	tmp = kmemdup(bioc->stripes, size, GFP_NOFS);
> +	if (!tmp) {
> +		kfree(stripe);
> +		return -ENOMEM;
> +	}
> +
> +	stripe->logical = logical;
> +	stripe->num_bytes = length;
> +	stripe->num_stripes = bioc->num_stripes;
> +	spin_lock(&stripe->lock);
> +	stripe->stripes = tmp;
> +	spin_unlock(&stripe->lock);
> +	refcount_set(&stripe->ref, 1);
> +
> +	write_lock(&fs_info->stripe_update_lock);
> +	node = rb_find_add(&stripe->rb_node, &fs_info->stripe_update_tree,
> +	       ordered_stripe_less);
> +	write_unlock(&fs_info->stripe_update_lock);
> +	if (node) {
> +		struct btrfs_ordered_stripe *old =
> +			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +
> +		btrfs_debug(fs_info, "logical: %llu, length: %llu already exists",
> +			  logical, length);
> +		ASSERT(logical == old->logical);
> +		write_lock(&fs_info->stripe_update_lock);
> +		rb_replace_node(node, &stripe->rb_node,
> +				&fs_info->stripe_update_tree);
> +		write_unlock(&fs_info->stripe_update_lock);
> +	}
> +
> +	return 0;
> +}
> +
> +struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *fs_info,
> +							 u64 logical)
> +{
> +	struct rb_root *root = &fs_info->stripe_update_tree;
> +	struct btrfs_ordered_stripe *stripe = NULL;
> +	struct rb_node *node;
> +
> +	read_lock(&fs_info->stripe_update_lock);
> +	node = rb_find(&logical, root, ordered_stripe_cmp);
> +	if (node) {
> +		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +		refcount_inc(&stripe->ref);
> +	}
> +	read_unlock(&fs_info->stripe_update_lock);
> +
> +	return stripe;
> +}
> +
> +void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
> +				 struct btrfs_ordered_stripe *stripe)
> +{
> +	write_lock(&fs_info->stripe_update_lock);
> +	if (refcount_dec_and_test(&stripe->ref)) {
> +		struct rb_node *node = &stripe->rb_node;
> +
> +		rb_erase(node, &fs_info->stripe_update_tree);
> +		RB_CLEAR_NODE(node);
> +
> +		spin_lock(&stripe->lock);
> +		kfree(stripe->stripes);
> +		spin_unlock(&stripe->lock);
> +		kfree(stripe);
> +	}
> +	write_unlock(&fs_info->stripe_update_lock);
> +}
> +
> +int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
> +					  u64 start, u64 len)
> +{
> +	struct btrfs_io_context *bioc = NULL;
> +	struct btrfs_ordered_stripe *stripe;
> +	u64 map_length = len;
> +	int ret;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return 0;
> +
> +	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, start, &map_length,
> +			      &bioc, 0);
> +	if (ret)
> +		return ret;
> +
> +	bioc->size = len;
> +
> +	stripe = btrfs_lookup_ordered_stripe(fs_info, start);
> +	if (!stripe) {
> +		ret = btrfs_add_ordered_stripe(bioc);
> +		if (ret)
> +			return ret;
> +	} else {
> +		spin_lock(&stripe->lock);
> +		memcpy(stripe->stripes, bioc->stripes,
> +		       bioc->num_stripes * sizeof(struct btrfs_io_stripe));
> +		spin_unlock(&stripe->lock);
> +		btrfs_put_ordered_stripe(fs_info, stripe);
> +	}
> +
> +	return 0;
> +}
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> +			     struct btrfs_ordered_stripe *stripe)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_key stripe_key;
> +	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
> +	struct btrfs_stripe_extent *stripe_extent;
> +	size_t item_size;
> +	int ret;
> +
> +	item_size = stripe->num_stripes * sizeof(struct btrfs_raid_stride);
> +
> +	stripe_extent = kzalloc(item_size, GFP_NOFS);
> +	if (!stripe_extent) {
> +		btrfs_abort_transaction(trans, -ENOMEM);
> +		btrfs_end_transaction(trans);
> +		return -ENOMEM;
> +	}
> +
> +	spin_lock(&stripe->lock);
> +	for (int i = 0; i < stripe->num_stripes; i++) {
> +		u64 devid = stripe->stripes[i].dev->devid;
> +		u64 physical = stripe->stripes[i].physical;
> +		struct btrfs_raid_stride *raid_stride =
> +						&stripe_extent->strides[i];
> +
> +		btrfs_set_stack_raid_stride_devid(raid_stride, devid);
> +		btrfs_set_stack_raid_stride_physical(raid_stride, physical);
> +	}
> +	spin_unlock(&stripe->lock);
> +
> +	stripe_key.objectid = stripe->logical;
> +	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> +	stripe_key.offset = stripe->num_bytes;
> +
> +	ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> +				item_size);
> +	if (ret)
> +		btrfs_abort_transaction(trans, ret);
> +
> +	kfree(stripe_extent);
> +
> +	return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> new file mode 100644
> index 000000000000..60d3f8489cc9
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2022 Western Digital Corporation or its affiliates.
> + */
> +
> +#ifndef BTRFS_RAID_STRIPE_TREE_H
> +#define BTRFS_RAID_STRIPE_TREE_H
> +
> +#include "disk-io.h"
> +#include "messages.h"
> +
> +struct btrfs_io_context;
> +
> +struct btrfs_ordered_stripe {
> +	struct rb_node rb_node;
> +
> +	u64 logical;
> +	u64 num_bytes;
> +	int num_stripes;
> +	struct btrfs_io_stripe *stripes;
> +	spinlock_t lock;
> +	refcount_t ref;
> +};
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> +			     struct btrfs_ordered_stripe *stripe);
> +int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
> +					  u64 start, u64 len);
> +struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
> +						 struct btrfs_fs_info *fs_info,
> +						 u64 logical);
> +int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
> +void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
> +					    struct btrfs_ordered_stripe *stripe);
> +
> +static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
> +						 u64 map_type)
> +{
> +	u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
> +	u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return false;
> +
> +	if (type != BTRFS_BLOCK_GROUP_DATA)
> +		return false;
> +
> +	if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline void btrfs_drop_ordered_stripe(struct btrfs_fs_info *fs_info,
> +					     u64 logical)
> +{
> +	struct btrfs_ordered_stripe *stripe;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return;
> +
> +	stripe = btrfs_lookup_ordered_stripe(fs_info, logical);
> +	if (!stripe)
> +		return;
> +	ASSERT(refcount_read(&stripe->ref) == 2);
> +	/* once for us */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +	/* once for the tree */
> +	btrfs_put_ordered_stripe(fs_info, stripe);
> +}
> +#endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 707dd0456cea..e7c0353e5655 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5885,6 +5885,7 @@ static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
>   }
>   
>   static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
> +						       u64 logical,
>   						       int total_stripes,
>   						       int real_stripes)
>   {
> @@ -5908,6 +5909,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
>   	refcount_set(&bioc->refs, 1);
>   
>   	bioc->fs_info = fs_info;
> +	bioc->logical = logical;
>   	bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes);
>   	bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
>   
> @@ -6513,7 +6515,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>   		goto out;
>   	}
>   
> -	bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes, tgtdev_indexes);
> +	bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes,
> +				      tgtdev_indexes);
>   	if (!bioc) {
>   		ret = -ENOMEM;
>   		goto out;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 7e51f2238f72..5d7547b5fa87 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -368,12 +368,10 @@ struct btrfs_fs_devices {
>   
>   struct btrfs_io_stripe {
>   	struct btrfs_device *dev;
> -	union {
> -		/* Block mapping */
> -		u64 physical;
> -		/* For the endio handler */
> -		struct btrfs_io_context *bioc;
> -	};
> +	/* Block mapping */
> +	u64 physical;
> +	/* For the endio handler */
> +	struct btrfs_io_context *bioc;
>   };
>   
>   struct btrfs_discard_stripe {
> @@ -409,6 +407,8 @@ struct btrfs_io_context {
>   	int mirror_num;
>   	int num_tgtdevs;
>   	int *tgtdev_map;
> +	u64 logical;
> +	u64 size;
>   	/*
>   	 * logical block numbers for the start of each stripe
>   	 * The last one or two are p/q.  These are sorted,
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index d862477f79f3..ed49150e6e6f 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1687,6 +1687,10 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
>   	u64 *logical = NULL;
>   	int nr, stripe_len;
>   
> +	/* Filesystems with a stripe tree have their own l2p mapping */
> +	if (btrfs_stripe_tree_root(fs_info))
> +		return;
> +
>   	/* Zoned devices should not have partitions. So, we can assume it is 0 */
>   	ASSERT(!bdev_is_partition(ordered->bdev));
>   	if (WARN_ON(!ordered->bdev))

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-13  6:57   ` Christoph Hellwig
@ 2023-02-13 10:43     ` Johannes Thumshirn
  2023-02-13 13:12       ` Johannes Thumshirn
  0 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-13 10:43 UTC (permalink / raw)
  To: hch; +Cc: linux-btrfs

On 13.02.23 07:57, Christoph Hellwig wrote:
>> --- a/fs/btrfs/bio.h
>> +++ b/fs/btrfs/bio.h
>> @@ -58,6 +58,8 @@ struct btrfs_bio {
>>  	atomic_t pending_ios;
>>  	struct work_struct end_io_work;
>>  
>> +	struct work_struct raid_stripe_work;
> 
> You should be able to reused end_io_work here, as it is only used
> for reads currently.

OK, then lets do that.

>>  
>> +static bool delayed_ref_needs_rst_update(struct btrfs_fs_info *fs_info,
>> +					 struct btrfs_delayed_ref_head *head)
>> +{
>> +	struct extent_map *em;
>> +	struct map_lookup *map;
>> +	bool ret = false;
>> +
>> +	if (!btrfs_stripe_tree_root(fs_info))
>> +		return ret;
>> +
>> +	em = btrfs_get_chunk_map(fs_info, head->bytenr, head->num_bytes);
>> +	if (!em)
>> +		return ret;
>> +
>> +	map = em->map_lookup;
>> +
>> +	if (btrfs_need_stripe_tree_update(fs_info, map->type))
> 
> This just seems very expensive.  Is there no way to propafate
> this information without doign a chunk map lookup every time?


Yes I thought I did already by using btrfs_delayed_ref_node::must_insert_stripe,
but obviously forgot to delete the lookup here.

> 
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -1687,6 +1687,10 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
>>  	u64 *logical = NULL;
>>  	int nr, stripe_len;
>>  
>> +	/* Filesystems with a stripe tree have their own l2p mapping */
>> +	if (btrfs_stripe_tree_root(fs_info))
>> +		return;
> 
> I don't think we should even be able to readch this, as the call to
> btrfs_rewrite_logical_zoned is guarded by having a valid
> ordered->physical… and that is only set in btrfs_simple_end_io.
> So this could just be an assert.
> 
> 

Let me check, but yeah only having an ASSERT() here would be even better.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-13  7:40   ` Qu Wenruo
@ 2023-02-13 10:49     ` Johannes Thumshirn
  2023-02-13 11:12       ` Qu Wenruo
  0 siblings, 1 reply; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-13 10:49 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 13.02.23 08:40, Qu Wenruo wrote:
> So the in-memory stripe tree entry insersion is delated.
> 
> Could the following race happen?
> 
>               T1                  |              T2
> ---------------------------------+----------------------------------
> write_pages()                    |
> btrfs_orig_write_end_io()        |
> |- INIT_WORK();                  |
> `- queue_work();                 |
>                                   | btrfs_invalidate_folio()
>                                   | `- the pages are no longer cached
>                                   |
>                                   | btrfs_do_readpage()
>                                   | |- do whatever the rst lookup
> workqueue                        |
> `- btrfs_raid_stripe_update()    |
>     `- btrfs_add_ordered_stripe() |
> 
> In above case, T2 read will fail as it can not grab the RST mapping.
> 
> I really believe the in-memory rst update should not be delayed into a 
> workqueue, but done inside the write endio function.

I haven't yet thought about that race, but doing memory allocations from
inside an endio function doesn't sound appealing to me.

An obvious solution to this would of cause be to bump the refcount on the 
btrfs_io_context (which I have forgotten here thanks for catching it).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 02/13] btrfs: add raid stripe tree definitions
  2023-02-13  6:50   ` Christoph Hellwig
@ 2023-02-13 11:10     ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-13 11:10 UTC (permalink / raw)
  To: hch, Josef Bacik, David Sterba; +Cc: linux-btrfs

On 13.02.23 07:50, Christoph Hellwig wrote:
> On Wed, Feb 08, 2023 at 02:57:39AM -0800, Johannes Thumshirn wrote:
>> Add definitions for the raid stripe tree. This tree will hold information
>> about the on-disk layout of the stripes in a RAID set.
>>
>> Each stripe extent has a 1:1 relationship with an on-disk extent item and
>> is doing the logical to per-drive physical address translation for the
>> extent item in question.
> 
> So this basially removes the need to trak the physical address in
> the chunk tree.  Is there any way to stop maintaining it at all?
> If not, why?  
> 
> 

Isn't the chunk tree only storing the physical start of a 
chunk/block-group? 

What we /could/ do is change the absolute physical addresses in the stripe
tree to offsets from the chunk start. On the upside that would give us the
ability to use u32 instead of u64 and thus shrink the on-disk format, but
on the flip-side we'd need to obtain the chunk start addresses and calculate
the offsets on each endio. Classic time-memory tradeoff I guess.

But then the chunk tree is needed to bootstrap the FS as well. And the RST
is an optional incompatible feature so that'll make the code more ugly if we'd
have to distinguish between these two cases.

Josef, David? What's your thoughts on this?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-13 10:49     ` Johannes Thumshirn
@ 2023-02-13 11:12       ` Qu Wenruo
  2023-02-13 11:37         ` Johannes Thumshirn
  0 siblings, 1 reply; 48+ messages in thread
From: Qu Wenruo @ 2023-02-13 11:12 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs



On 2023/2/13 18:49, Johannes Thumshirn wrote:
> On 13.02.23 08:40, Qu Wenruo wrote:
>> So the in-memory stripe tree entry insersion is delated.
>>
>> Could the following race happen?
>>
>>                T1                  |              T2
>> ---------------------------------+----------------------------------
>> write_pages()                    |
>> btrfs_orig_write_end_io()        |
>> |- INIT_WORK();                  |
>> `- queue_work();                 |
>>                                    | btrfs_invalidate_folio()
>>                                    | `- the pages are no longer cached
>>                                    |
>>                                    | btrfs_do_readpage()
>>                                    | |- do whatever the rst lookup
>> workqueue                        |
>> `- btrfs_raid_stripe_update()    |
>>      `- btrfs_add_ordered_stripe() |
>>
>> In above case, T2 read will fail as it can not grab the RST mapping.
>>
>> I really believe the in-memory rst update should not be delayed into a
>> workqueue, but done inside the write endio function.
> 
> I haven't yet thought about that race, but doing memory allocations from
> inside an endio function doesn't sound appealing to me.

Another solution is always try to pre-allocate a memory for the 
in-memory structure.

> 
> An obvious solution to this would of cause be to bump the refcount on the
> btrfs_io_context (which I have forgotten here thanks for catching it).

I'm not sure if the bioc refcount is involved.

As long as the writeback flag is gone, the race can happen at any time.

For non-RST we prevent this by the following check:

- btrfs_lock_and_flush_ordered_range() at read time
   Ensure we wait any ordered extent to finish before read.

But the delayed rst tree update is not ensured to happen before 
btrfs_finish_ordered_io(), thus the race can happen.

So my idea is to put the RST tree update to btrfs_finish_ordered_io().
Although this also means we need some argument passing for ordered extents.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 02/13] btrfs: add raid stripe tree definitions
  2023-02-08 10:57 ` [PATCH v5 02/13] btrfs: add raid stripe tree definitions Johannes Thumshirn
  2023-02-08 19:42   ` Josef Bacik
  2023-02-13  6:50   ` Christoph Hellwig
@ 2023-02-13 11:34   ` Anand Jain
  2 siblings, 0 replies; 48+ messages in thread
From: Anand Jain @ 2023-02-13 11:34 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs

On 08/02/2023 18:57, Johannes Thumshirn wrote:
> Add definitions for the raid stripe tree. This tree will hold information
> about the on-disk layout of the stripes in a RAID set.
> 
> Each stripe extent has a 1:1 relationship with an on-disk extent item and
> is doing the logical to per-drive physical address translation for the
> extent item in question.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

LGTM

Reviewed-by: Anand Jain <anand.jain@oracle.com>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk
  2023-02-08 10:57 ` [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
  2023-02-08 19:43   ` Josef Bacik
@ 2023-02-13 11:35   ` Anand Jain
  1 sibling, 0 replies; 48+ messages in thread
From: Anand Jain @ 2023-02-13 11:35 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs

On 08/02/2023 18:57, Johannes Thumshirn wrote:
> If we find a raid-stripe-tree on mount, read it from disk.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Anand Jain <anand.jain@oralce.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-13 11:12       ` Qu Wenruo
@ 2023-02-13 11:37         ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-13 11:37 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 13.02.23 12:12, Qu Wenruo wrote:
> 
> 
> On 2023/2/13 18:49, Johannes Thumshirn wrote:
>> On 13.02.23 08:40, Qu Wenruo wrote:
>>> So the in-memory stripe tree entry insersion is delated.
>>>
>>> Could the following race happen?
>>>
>>>                T1                  |              T2
>>> ---------------------------------+----------------------------------
>>> write_pages()                    |
>>> btrfs_orig_write_end_io()        |
>>> |- INIT_WORK();                  |
>>> `- queue_work();                 |
>>>                                    | btrfs_invalidate_folio()
>>>                                    | `- the pages are no longer cached
>>>                                    |
>>>                                    | btrfs_do_readpage()
>>>                                    | |- do whatever the rst lookup
>>> workqueue                        |
>>> `- btrfs_raid_stripe_update()    |
>>>      `- btrfs_add_ordered_stripe() |
>>>
>>> In above case, T2 read will fail as it can not grab the RST mapping.
>>>
>>> I really believe the in-memory rst update should not be delayed into a
>>> workqueue, but done inside the write endio function.
>>
>> I haven't yet thought about that race, but doing memory allocations from
>> inside an endio function doesn't sound appealing to me.
> 
> Another solution is always try to pre-allocate a memory for the 
> in-memory structure.
> 
>>
>> An obvious solution to this would of cause be to bump the refcount on the
>> btrfs_io_context (which I have forgotten here thanks for catching it).
> 
> I'm not sure if the bioc refcount is involved.
> 
> As long as the writeback flag is gone, the race can happen at any time.
> 
> For non-RST we prevent this by the following check:
> 
> - btrfs_lock_and_flush_ordered_range() at read time
>    Ensure we wait any ordered extent to finish before read.
> 
> But the delayed rst tree update is not ensured to happen before 
> btrfs_finish_ordered_io(), thus the race can happen.
> 
> So my idea is to put the RST tree update to btrfs_finish_ordered_io().
> Although this also means we need some argument passing for ordered extents.

Ah OK gotcha. I think this race can happen. Let me think about it.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents
  2023-02-13 10:43     ` Johannes Thumshirn
@ 2023-02-13 13:12       ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-13 13:12 UTC (permalink / raw)
  To: hch; +Cc: linux-btrfs

On 13.02.23 11:43, Johannes Thumshirn wrote:
>>> +static bool delayed_ref_needs_rst_update(struct btrfs_fs_info *fs_info,
>>> +					 struct btrfs_delayed_ref_head *head)
>>> +{
>>> +	struct extent_map *em;
>>> +	struct map_lookup *map;
>>> +	bool ret = false;
>>> +
>>> +	if (!btrfs_stripe_tree_root(fs_info))
>>> +		return ret;
>>> +
>>> +	em = btrfs_get_chunk_map(fs_info, head->bytenr, head->num_bytes);
>>> +	if (!em)
>>> +		return ret;
>>> +
>>> +	map = em->map_lookup;
>>> +
>>> +	if (btrfs_need_stripe_tree_update(fs_info, map->type))
>> This just seems very expensive.  Is there no way to propafate
>> this information without doign a chunk map lookup every time?
> 
> Yes I thought I did already by using btrfs_delayed_ref_node::must_insert_stripe,
> but obviously forgot to delete the lookup here.
> 


Actually I've been confused. We're only doing the chunk map lookup once per 
delayed_ref head now instead of once per delayed_ref. I already changed the code.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 05/13] btrfs: delete stripe extent on extent deletion
  2023-02-08 20:00   ` Josef Bacik
@ 2023-02-13 15:21     ` Johannes Thumshirn
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-13 15:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 08.02.23 21:00, Josef Bacik wrote:
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 50b3a2c3c0dd..f08ee7d9211c 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -3238,6 +3238,14 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>>  			}
>>  		}
>>  
>> +		if (is_data) {
>> +			ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
>> +			if (ret) {
>> +				btrfs_abort_transaction(trans, ret);
>> +				return ret;
>> +			}
>> +		}
>> +
> We're still holding the path open, so now we have a lockdep thing of extent root
> -> RST, which will for sure bite us in the ass in the future.  Push this part
> just under this part of the code, after the btrfs_release_path().
> 
> Also since we have a path here, add it to the arguments for
> btrfs_delete_raid_extent() so we don't have to allocate a new path. 

Not sure what you mean here.

With below change, I'm getting this splat:

bash-5.1# rm /mnt/test/test                                                                                                                                                           
[   46.519048] ------------[ cut here ]------------                                                                                                                                   
[   46.520012] WARNING: CPU: 0 PID: 80 at fs/btrfs/ctree.c:2050 btrfs_search_slot+0x99a/0xd50                                                                                         
[   46.521652] CPU: 0 PID: 80 Comm: kworker/u8:8 Not tainted 6.2.0-rc7+ #657                                                                                                          
[   46.522883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
[   46.524575] Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space        
[   46.526016] RIP: 0010:btrfs_search_slot+0x99a/0xd50
[   46.526911] Code: 00 e9 63 fb ff ff c7 44 24 44 08 00 00 00 89 c8 c7 44 24 28 08 00 00 00 e9 60 f7 ff ff 0f 0b 49 83 3c 24 00 0f 84 e9 f6 ff ff <0f> 0b e9 e2 f6 ff ff 0f 0b e8 38 2c db ff 85 c0 0f 85 5f f7 ff ff
[   46.530527] RSP: 0018:ffffc90000b5faf8 EFLAGS: 00010286
[   46.531479] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffff88810235b700
[   46.532817] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81ec6182
[   46.534250] RBP: ffff8881056b8000 R08: 00000000ffffffff R09: 0000000000000001
[   46.535524] R10: 0000000000001000 R11: 00000000000000d0 R12: ffff88810a75e230
[   46.536968] R13: ffff88810524e000 R14: ffff88810a75e230 R15: 000000000000002d
[   46.538273] FS:  0000000000000000(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
[   46.539963] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   46.541085] CR2: 000055aea08c9410 CR3: 00000001122fa000 CR4: 0000000000350eb0
[   46.542395] Call Trace:
[   46.542868]  <TASK>
[   46.543448]  ? set_extent_buffer_dirty+0x17/0x170
[   46.544409]  ? btrfs_mark_buffer_dirty+0xd7/0x100
[   46.545270]  ? btrfs_del_items+0x394/0x4c0
[   46.546031]  btrfs_delete_raid_extent+0x4a/0x80
[   46.547015]  __btrfs_free_extent+0x446/0x7e0
[   46.547812]  __btrfs_run_delayed_refs+0x331/0x1210
[   46.548794]  btrfs_run_delayed_refs+0x89/0x1b0
[   46.549661]  flush_space+0x392/0x5e0
[   46.550418]  ? _raw_spin_unlock+0x12/0x30
[   46.551156]  ? btrfs_get_alloc_profile+0xbd/0x1a0
[   46.552095]  btrfs_preempt_reclaim_metadata_space+0x96/0x1e0
[   46.553191]  process_one_work+0x1d9/0x3e0
[   46.554038]  worker_thread+0x4a/0x3c0
[   46.554726]  ? __pfx_worker_thread+0x10/0x10
[   46.555514]  kthread+0xf5/0x120
[   46.556173]  ? __pfx_kthread+0x10/0x10
[   46.557020]  ret_from_fork+0x2c/0x50
[   46.557695]  </TASK>
[   46.558107] ---[ end trace 0000000000000000 ]---




diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 10827b7db13c..c040c1c70075 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3238,20 +3238,23 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 
-		if (is_data) {
-			ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
-			if (ret) {
-				btrfs_abort_transaction(trans, ret);
-				return ret;
-			}
-		}
-
 		ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
 				      num_to_del);
 		if (ret) {
 			btrfs_abort_transaction(trans, ret);
 			goto out;
 		}
+
+		if (is_data) {
+			ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes,
+						path);
+			if (ret) {
+				btrfs_abort_transaction(trans, ret);
+				goto out;
+			}
+		}
+
+
 		btrfs_release_path(path);
 
 		ret = do_free_extent_accounting(trans, bytenr, num_bytes, is_data);
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 8799a7abaf38..a60cdfaa9359 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -157,11 +157,11 @@ void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 }
 
 int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
-			     u64 length)
+			u64 length, struct btrfs_path *path)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
-	struct btrfs_path *path;
+	//	struct btrfs_path *path;
 	struct btrfs_key stripe_key;
 	int ret;
 
@@ -172,9 +172,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
 	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
 	stripe_key.offset = length;
 
-	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
+	/* path = btrfs_alloc_path(); */
+	/* if (!path) */
+	/* 	return -ENOMEM; */
 
 	ret = btrfs_search_slot(trans, stripe_root, &stripe_key, path, -1, 1);
 	if (ret < 0)
@@ -182,7 +182,7 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
 
 	ret = btrfs_del_item(trans, stripe_root, path);
 out:
-	btrfs_free_path(path);
+	/* btrfs_free_path(path); */
 	return ret;
 
 }
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 371409351d60..fc415ec77c36 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -27,7 +27,7 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
 				 u32 stripe_index,
 				 struct btrfs_io_stripe *stripe);
 int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
-			     u64 length);
+			u64 length, struct btrfs_path *path);
 int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_ordered_stripe *stripe);
 int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-10  8:44   ` Johannes Thumshirn
  2023-02-10 10:33     ` Johannes Thumshirn
@ 2023-02-13 16:42     ` Phillip Susi
  2023-02-13 17:44       ` Johannes Thumshirn
  1 sibling, 1 reply; 48+ messages in thread
From: Phillip Susi @ 2023-02-13 16:42 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs


Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:

> No. With zoned drives we're writing using the Zone Append command [1].
> This has several advantages, one being that you can issue IO at a high
> queue depth and don't need any locking to. But it has one downside for
> the RAID application, that is, that you don't have any control of the 
> LBA where the data lands, only the zone.

Can they be reordered in the queue?  As long as they are issued in the
same order on both drives and can't get reordered, I would think that
the write pointer on both drives would remain in sync.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-13 16:42     ` Phillip Susi
@ 2023-02-13 17:44       ` Johannes Thumshirn
  2023-02-13 17:56         ` Phillip Susi
  2023-02-14  5:51         ` Christoph Hellwig
  0 siblings, 2 replies; 48+ messages in thread
From: Johannes Thumshirn @ 2023-02-13 17:44 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-btrfs

On 13.02.23 17:47, Phillip Susi wrote:
> 
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:
> 
>> No. With zoned drives we're writing using the Zone Append command [1].
>> This has several advantages, one being that you can issue IO at a high
>> queue depth and don't need any locking to. But it has one downside for
>> the RAID application, that is, that you don't have any control of the 
>> LBA where the data lands, only the zone.
> 
> Can they be reordered in the queue?  As long as they are issued in the
> same order on both drives and can't get reordered, I would think that
> the write pointer on both drives would remain in sync.
> 

There is no guarantee for that, no. The block layer can theoretically
re-order all WRITEs. This is why btrfs also needs the mq-deadline IO 
scheduler as metadata is written as WRITE with QD=1 (protected by the
btrfs_meta_io_lock() inside btrfs and the zone write lock in the 
IO scheduler.

I unfortunately can't remember the exact reasons why the block layer
cannot be made in a way that it can't re-order the IO. I'd have to defer
that question to Christoph.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-13 17:44       ` Johannes Thumshirn
@ 2023-02-13 17:56         ` Phillip Susi
  2023-02-14  5:52           ` Christoph Hellwig
  2023-02-14  5:51         ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Phillip Susi @ 2023-02-13 17:56 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs


Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:

> There is no guarantee for that, no. The block layer can theoretically
> re-order all WRITEs. This is why btrfs also needs the mq-deadline IO

Unless you submit barriers to prevent that right?  Why not do that?

> scheduler as metadata is written as WRITE with QD=1 (protected by the
> btrfs_meta_io_lock() inside btrfs and the zone write lock in the 
> IO scheduler.
>
> I unfortunately can't remember the exact reasons why the block layer
> cannot be made in a way that it can't re-order the IO. I'd have to defer
> that question to Christoph.

I would think that to prevent fragmentation, you would want to try to
flush a large portion of data from a particular file in order then move
to another file.  If you have large streaming writes to two files at the
same time and the allocator decides to put them in the same zone, and
they are just submitted to the stack to do in any order, isn't this
likely to lead to a lot of fragmentation?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-13 17:44       ` Johannes Thumshirn
  2023-02-13 17:56         ` Phillip Susi
@ 2023-02-14  5:51         ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-02-14  5:51 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Phillip Susi, linux-btrfs

On Mon, Feb 13, 2023 at 05:44:38PM +0000, Johannes Thumshirn wrote:
> I unfortunately can't remember the exact reasons why the block layer
> cannot be made in a way that it can't re-order the IO. I'd have to defer
> that question to Christoph.

That block layer can avoid reordering, but it's very costly and limits
you to a single queue instead of the multiple queues that blk-mq has.

Similarly the protocol and device can reorder or more typically just
not preserve order (e.g. multiple queues, multiple connections, error
handling).


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/13] btrfs: introduce RAID stripe tree
  2023-02-13 17:56         ` Phillip Susi
@ 2023-02-14  5:52           ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-02-14  5:52 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Johannes Thumshirn, linux-btrfs

On Mon, Feb 13, 2023 at 12:56:09PM -0500, Phillip Susi wrote:
> 
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:
> 
> > There is no guarantee for that, no. The block layer can theoretically
> > re-order all WRITEs. This is why btrfs also needs the mq-deadline IO
> 
> Unless you submit barriers to prevent that right?  Why not do that?

There is no such thing as a "barrier" since 2.6.10 or so.  And that's
a good thing as they are extremely costly.

> I would think that to prevent fragmentation, you would want to try to
> flush a large portion of data from a particular file in order then move
> to another file.  If you have large streaming writes to two files at the
> same time and the allocator decides to put them in the same zone, and
> they are just submitted to the stack to do in any order, isn't this
> likely to lead to a lot of fragmentation?

If you submit small chunks of different files to the same block group
you're always going to get fragmentation, zones or not.  Zones will
make it even worse due to the lack of preallocations or sparse use
of the block groups, though.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-02-14  5:53 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-08 10:57 [PATCH v5 00/13] btrfs: introduce RAID stripe tree Johannes Thumshirn
2023-02-08 10:57 ` [PATCH v5 01/13] btrfs: re-add trans parameter to insert_delayed_ref Johannes Thumshirn
2023-02-08 19:41   ` Josef Bacik
2023-02-08 10:57 ` [PATCH v5 02/13] btrfs: add raid stripe tree definitions Johannes Thumshirn
2023-02-08 19:42   ` Josef Bacik
2023-02-13  6:50   ` Christoph Hellwig
2023-02-13 11:10     ` Johannes Thumshirn
2023-02-13 11:34   ` Anand Jain
2023-02-08 10:57 ` [PATCH v5 03/13] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2023-02-08 19:43   ` Josef Bacik
2023-02-13 11:35   ` Anand Jain
2023-02-08 10:57 ` [PATCH v5 04/13] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2023-02-08 19:47   ` Josef Bacik
2023-02-08 19:52   ` Josef Bacik
2023-02-13  6:57   ` Christoph Hellwig
2023-02-13 10:43     ` Johannes Thumshirn
2023-02-13 13:12       ` Johannes Thumshirn
2023-02-13  7:40   ` Qu Wenruo
2023-02-13 10:49     ` Johannes Thumshirn
2023-02-13 11:12       ` Qu Wenruo
2023-02-13 11:37         ` Johannes Thumshirn
2023-02-08 10:57 ` [PATCH v5 05/13] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
2023-02-08 20:00   ` Josef Bacik
2023-02-13 15:21     ` Johannes Thumshirn
2023-02-08 10:57 ` [PATCH v5 06/13] btrfs: lookup physical address from stripe extent Johannes Thumshirn
2023-02-08 20:16   ` Josef Bacik
2023-02-08 10:57 ` [PATCH v5 07/13] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2023-02-08 10:57 ` [PATCH v5 08/13] btrfs: zoned: allow zoned RAID Johannes Thumshirn
2023-02-08 20:18   ` Josef Bacik
2023-02-08 10:57 ` [PATCH v5 09/13] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
2023-02-08 20:19   ` Josef Bacik
2023-02-08 10:57 ` [PATCH v5 10/13] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
2023-02-08 10:57 ` [PATCH v5 11/13] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
2023-02-08 20:20   ` Josef Bacik
2023-02-08 10:57 ` [PATCH v5 12/13] btrfs: consult raid-stripe-tree when scrubbing Johannes Thumshirn
2023-02-08 20:21   ` Josef Bacik
2023-02-08 10:57 ` [PATCH v5 13/13] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
2023-02-08 20:23   ` Josef Bacik
2023-02-09  0:42 ` [PATCH v5 00/13] btrfs: introduce RAID stripe tree Qu Wenruo
2023-02-09  8:47   ` Johannes Thumshirn
2023-02-09 15:57 ` Phillip Susi
2023-02-10  8:44   ` Johannes Thumshirn
2023-02-10 10:33     ` Johannes Thumshirn
2023-02-13 16:42     ` Phillip Susi
2023-02-13 17:44       ` Johannes Thumshirn
2023-02-13 17:56         ` Phillip Susi
2023-02-14  5:52           ` Christoph Hellwig
2023-02-14  5:51         ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.