All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 00/11] btrfs: raid-stripe-tree draft patches
@ 2022-10-17 11:55 Johannes Thumshirn
  2022-10-17 11:55 ` [RFC v3 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
                   ` (11 more replies)
  0 siblings, 12 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Here's a yet another draft of my btrfs zoned RAID patches. It's based on
Christoph's bio splitting series for btrfs.

Updates of the raid-stripe-tree are done at delayed-ref time to safe on
bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
i.e. when the logical to physical translation happens for regular btrfs RAID
as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows:

rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
btrfs-progs v5.16.1 
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
leaf 805847040 flags 0x1(WRITTEN) backref revision 1
checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
        item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
                        stripe 0 devid 1 offset 939524096
                        stripe 1 devid 2 offset 536870912
        item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
                        stripe 0 devid 1 offset 939651072
                        stripe 1 devid 2 offset 536997888
        item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
                        stripe 0 devid 1 offset 939778048
                        stripe 1 devid 2 offset 537124864
        item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
                        stripe 0 devid 1 offset 939905024
                        stripe 1 devid 2 offset 537251840
        item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
                        stripe 0 devid 1 offset 940032000
                        stripe 1 devid 2 offset 537378816
        item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
                        stripe 0 devid 1 offset 940158976
                        stripe 1 devid 2 offset 537505792
        item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
                        stripe 0 devid 1 offset 940285952
                        stripe 1 devid 2 offset 537632768
        item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
                        stripe 0 devid 1 offset 940412928
                        stripe 1 devid 2 offset 537759744
        item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
                        stripe 0 devid 1 offset 940539904
                        stripe 1 devid 2 offset 537886720
total bytes 26843545600
bytes used 1245184
uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Johannes Thumshirn (11):
  btrfs: add raid stripe tree definitions
  btrfs: read raid-stripe-tree from disk
  btrfs: add support for inserting raid stripe extents
  btrfs: delete stripe extent on extent deletion
  btrfs: lookup physical address from stripe extent
  btrfs: add raid stripe tree pretty printer
  btrfs: zoned: allow zoned RAID1
  btrfs: allow zoned RAID0 and 10
  btrfs: fix striping with RST
  btrfs: check for leaks of ordered stripes on umount
  btrfs: add tracepoints for ordered stripes

 fs/btrfs/Makefile               |   2 +-
 fs/btrfs/block-rsv.c            |   1 +
 fs/btrfs/ctree.h                |  33 +++
 fs/btrfs/disk-io.c              |  17 ++
 fs/btrfs/extent-tree.c          |  56 +++++
 fs/btrfs/inode.c                |   6 +
 fs/btrfs/print-tree.c           |  21 ++
 fs/btrfs/raid-stripe-tree.c     | 394 ++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  60 +++++
 fs/btrfs/super.c                |   1 +
 fs/btrfs/volumes.c              |  66 +++++-
 fs/btrfs/volumes.h              |  14 +-
 fs/btrfs/zoned.c                |  43 ++++
 include/trace/events/btrfs.h    |  50 ++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  20 +-
 16 files changed, 768 insertions(+), 17 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

-- 
2.37.3


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC v3 01/11] btrfs: add raid stripe tree definitions
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:21   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Add definitions for the raid stripe tree. This tree will hold information
about the on-disk layout of the stripes in a RAID set.

Each stripe extent has a 1:1 relationship with an on-disk extent item and
is doing the logical to per-drive physical address translation for the
extent item in question.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/ctree.h                | 29 +++++++++++++++++++++++++++++
 include/uapi/linux/btrfs_tree.h | 20 ++++++++++++++++++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 87bb4218276b..80ead27299dc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1992,6 +1992,35 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
 
+BTRFS_SETGET_FUNCS(stripe_extent_devid, struct btrfs_stripe_extent, devid, 64);
+BTRFS_SETGET_FUNCS(stripe_extent_physical, struct btrfs_stripe_extent, physical, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_devid, struct btrfs_stripe_extent, devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_physical, struct btrfs_stripe_extent, physical, 64);
+
+static inline struct btrfs_stripe_extent *btrfs_stripe_extent_nr(
+					 struct btrfs_dp_stripe *dps, int nr)
+{
+	unsigned long offset = (unsigned long)dps;
+
+	offset += offsetof(struct btrfs_dp_stripe, extents);
+	offset += nr * sizeof(struct btrfs_stripe_extent);
+	return (struct btrfs_stripe_extent *)offset;
+}
+
+static inline u64 btrfs_stripe_extent_devid_nr(const struct extent_buffer *eb,
+					       struct btrfs_dp_stripe *dps,
+					       int nr)
+{
+	return btrfs_stripe_extent_devid(eb, btrfs_stripe_extent_nr(dps, nr));
+}
+
+static inline u64 btrfs_stripe_extent_physical_nr(const struct extent_buffer *eb,
+						  struct btrfs_dp_stripe *dps,
+						  int nr)
+{
+	return btrfs_stripe_extent_physical(eb, btrfs_stripe_extent_nr(dps, nr));
+}
+
 /* struct btrfs_dev_extent */
 BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent,
 		   chunk_tree, 64);
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 5f32a2a495dc..047e1d0b2ff6 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -4,9 +4,8 @@
 
 #include <linux/btrfs.h>
 #include <linux/types.h>
-#ifdef __KERNEL__
 #include <linux/stddef.h>
-#else
+#ifndef __KERNEL__
 #include <stddef.h>
 #endif
 
@@ -56,6 +55,9 @@
 /* Holds the block group items for extent tree v2. */
 #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
 
+/* tracks RAID stripes in block groups. */
+#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -264,6 +266,8 @@
  */
 #define BTRFS_QGROUP_RELATION_KEY       246
 
+#define BTRFS_RAID_STRIPE_KEY		247
+
 /*
  * Obsolete name, see BTRFS_TEMPORARY_ITEM_KEY.
  */
@@ -488,6 +492,18 @@ struct btrfs_free_space_header {
 	__le64 num_bitmaps;
 } __attribute__ ((__packed__));
 
+struct btrfs_stripe_extent {
+	/* btrfs device-id this raid extent lives on */
+	__le64 devid;
+	/* physical location on disk */
+	__le64 physical;
+} __attribute__ ((__packed__));
+
+struct btrfs_dp_stripe {
+	/* array of stripe extents this stripe is composed of */
+	__DECLARE_FLEX_ARRAY(struct btrfs_stripe_extent, extents);
+} __attribute__ ((__packed__));
+
 #define BTRFS_HEADER_FLAG_WRITTEN	(1ULL << 0)
 #define BTRFS_HEADER_FLAG_RELOC		(1ULL << 1)
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 02/11] btrfs: read raid-stripe-tree from disk
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
  2022-10-17 11:55 ` [RFC v3 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-17 11:55 ` [RFC v3 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

If we find a raid-stripe-tree on mount, read it from disk.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-rsv.c       |  1 +
 fs/btrfs/ctree.h           |  1 +
 fs/btrfs/disk-io.c         | 12 ++++++++++++
 include/uapi/linux/btrfs.h |  1 +
 4 files changed, 15 insertions(+)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 6ce704d3bdd2..6794443cb0e8 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -425,6 +425,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
 	case BTRFS_EXTENT_TREE_OBJECTID:
 	case BTRFS_FREE_SPACE_TREE_OBJECTID:
 	case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+	case BTRFS_RAID_STRIPE_TREE_OBJECTID:
 		root->block_rsv = &fs_info->delayed_refs_rsv;
 		break;
 	case BTRFS_ROOT_TREE_OBJECTID:
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 80ead27299dc..430f224743a9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -680,6 +680,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *data_reloc_root;
 	struct btrfs_root *block_group_root;
+	struct btrfs_root *stripe_root;
 
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7bb00a010c74..a166b2602c40 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1396,6 +1396,9 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
 
 		return btrfs_grab_root(root) ? root : ERR_PTR(-ENOENT);
 	}
+	if (objectid == BTRFS_RAID_STRIPE_TREE_OBJECTID)
+		return btrfs_grab_root(fs_info->stripe_root) ?
+			fs_info->stripe_root : ERR_PTR(-ENOENT);
 	return NULL;
 }
 
@@ -1474,6 +1477,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->fs_root);
 	btrfs_put_root(fs_info->data_reloc_root);
 	btrfs_put_root(fs_info->block_group_root);
+	btrfs_put_root(fs_info->stripe_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
 	kfree(fs_info->super_copy);
@@ -2008,6 +2012,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
 	free_root_extent_buffers(info->fs_root);
 	free_root_extent_buffers(info->data_reloc_root);
 	free_root_extent_buffers(info->block_group_root);
+	free_root_extent_buffers(info->stripe_root);
 	if (free_chunk_root)
 		free_root_extent_buffers(info->chunk_root);
 }
@@ -2457,6 +2462,13 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 		fs_info->uuid_root = root;
 	}
 
+	location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
+	root = btrfs_read_tree_root(tree_root, &location);
+	if (!IS_ERR(root)) {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->stripe_root = root;
+	}
+
 	return 0;
 out:
 	btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d",
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index c413fca6f581..74508dee2878 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -316,6 +316,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_STRIPE_TREE	(1ULL << 14)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 03/11] btrfs: add support for inserting raid stripe extents
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
  2022-10-17 11:55 ` [RFC v3 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
  2022-10-17 11:55 ` [RFC v3 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:24   ` Josef Bacik
  2022-10-20 15:30   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.

Inserting the stripe extents happens after the data I/O has completed,
this is done to a) support zone-append and b) rule out the possibility of
a RAID-write-hole.

This is done by creating in-memory ordered stripe extents, just like the
in memory ordered extents, on I/O completion and the on-disk raid stripe
extents get created once we're running the delayed_refs for the extent
item this stripe extent is tied to.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/ctree.h            |   3 +
 fs/btrfs/disk-io.c          |   3 +
 fs/btrfs/extent-tree.c      |  48 +++++++++
 fs/btrfs/inode.c            |   6 ++
 fs/btrfs/raid-stripe-tree.c | 189 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  49 ++++++++++
 fs/btrfs/volumes.c          |  35 ++++++-
 fs/btrfs/volumes.h          |  14 +--
 fs/btrfs/zoned.c            |   4 +
 10 files changed, 345 insertions(+), 8 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 99f9995670ea..4484831ac624 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o
+	   subpage.o tree-mod-log.o raid-stripe-tree.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 430f224743a9..1f75ab8702bb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1102,6 +1102,9 @@ struct btrfs_fs_info {
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
 
+	struct mutex stripe_update_lock;
+	struct rb_root stripe_update_tree;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a166b2602c40..190caabf5fb7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2973,6 +2973,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 
 	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
 	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
+
+	mutex_init(&fs_info->stripe_update_lock);
+	fs_info->stripe_update_tree = RB_ROOT;
 }
 
 static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bcd0e72cded3..061296bcdfb4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -36,6 +36,7 @@
 #include "rcu-string.h"
 #include "zoned.h"
 #include "dev-replace.h"
+#include "raid-stripe-tree.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1491,6 +1492,50 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
+					    struct btrfs_delayed_ref_node *node)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct extent_map *em;
+	struct map_lookup *map;
+	int ret;
+
+	if (!fs_info->stripe_root)
+		return 0;
+
+	em = btrfs_get_chunk_map(fs_info, node->bytenr, node->num_bytes);
+	if (!em) {
+		btrfs_err(fs_info,
+			  "cannot get chunk map for address %llu",
+			  node->bytenr);
+		return -EINVAL;
+	}
+
+	map = em->map_lookup;
+
+	if (btrfs_need_stripe_tree_update(fs_info, map->type)) {
+		struct btrfs_ordered_stripe *stripe;
+
+		stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
+		if (!stripe) {
+			btrfs_err(fs_info,
+				  "cannot get stripe extent for address %llu (%llu)",
+				  node->bytenr, node->num_bytes);
+			free_extent_map(em);
+			return -EINVAL;
+		}
+		ASSERT(stripe->logical == node->bytenr);
+		ret = btrfs_insert_raid_extent(trans, stripe);
+		/* once for us */
+		btrfs_put_ordered_stripe(fs_info, stripe);
+		/* once for the tree */
+		btrfs_put_ordered_stripe(fs_info, stripe);
+	}
+	free_extent_map(em);
+
+	return ret;
+}
+
 static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 				struct btrfs_delayed_ref_node *node,
 				struct btrfs_delayed_extent_op *extent_op,
@@ -1521,6 +1566,9 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 						 flags, ref->objectid,
 						 ref->offset, &ins,
 						 node->ref_mod);
+		if (ret)
+			return ret;
+		ret = add_stripe_entry_for_delayed_ref(trans, node);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
 					     ref->objectid, ref->offset,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5f2d9d4f6d43..5414ba573022 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -55,6 +55,7 @@
 #include "zoned.h"
 #include "subpage.h"
 #include "inode-item.h"
+#include "raid-stripe-tree.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -9550,6 +9551,11 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
 	if (qgroup_released < 0)
 		return ERR_PTR(qgroup_released);
 
+	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
+						    start, len);
+	if (ret)
+		goto free_qgroup;
+
 	if (trans) {
 		ret = insert_reserved_file_extent(trans, inode,
 						  file_offset, &stack_fi,
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
new file mode 100644
index 000000000000..d8a69060b54b
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/btrfs_tree.h>
+
+#include "ctree.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "raid-stripe-tree.h"
+#include "volumes.h"
+#include "misc.h"
+#include "print-tree.h"
+
+static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
+{
+	struct btrfs_ordered_stripe *stripe =
+		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+	const u64 *logical = key;
+
+	if (*logical < stripe->logical)
+		return -1;
+	if (*logical >= stripe->logical + stripe->num_bytes)
+		return 1;
+	return 0;
+}
+
+static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
+{
+	struct btrfs_ordered_stripe *stripe =
+		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
+	return ordered_stripe_cmp(&stripe->logical, rbb);
+}
+
+int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
+{
+	struct btrfs_fs_info *fs_info = bioc->fs_info;
+	struct btrfs_ordered_stripe *stripe;
+	struct btrfs_io_stripe *tmp;
+	u64 logical = bioc->logical;
+	u64 length = bioc->size;
+	struct rb_node *node;
+	size_t size;
+
+	size = bioc->num_stripes * sizeof(struct btrfs_io_stripe);
+	stripe = kzalloc(sizeof(struct btrfs_ordered_stripe), GFP_NOFS);
+	if (!stripe)
+		return -ENOMEM;
+
+	spin_lock_init(&stripe->lock);
+	tmp = kmemdup(bioc->stripes, size, GFP_NOFS);
+	if (!tmp) {
+		kfree(stripe);
+		return -ENOMEM;
+	}
+
+	stripe->logical = logical;
+	stripe->num_bytes = length;
+	stripe->num_stripes = bioc->num_stripes;
+	spin_lock(&stripe->lock);
+	stripe->stripes = tmp;
+	spin_unlock(&stripe->lock);
+	refcount_set(&stripe->ref, 1);
+
+	mutex_lock(&fs_info->stripe_update_lock);
+	node = rb_find_add(&stripe->rb_node, &fs_info->stripe_update_tree,
+	       ordered_stripe_less);
+	mutex_unlock(&fs_info->stripe_update_lock);
+	if (node) {
+		btrfs_err(fs_info, "logical: %llu, length: %llu already exists",
+			  logical, length);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *fs_info,
+							 u64 logical)
+{
+	struct rb_root *root = &fs_info->stripe_update_tree;
+	struct btrfs_ordered_stripe *stripe = NULL;
+	struct rb_node *node;
+
+	mutex_lock(&fs_info->stripe_update_lock);
+	node = rb_find(&logical, root, ordered_stripe_cmp);
+	if (node) {
+		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+		refcount_inc(&stripe->ref);
+	}
+	mutex_unlock(&fs_info->stripe_update_lock);
+
+	return stripe;
+}
+
+void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
+				 struct btrfs_ordered_stripe *stripe)
+{
+	mutex_lock(&fs_info->stripe_update_lock);
+	if (refcount_dec_and_test(&stripe->ref)) {
+		struct rb_node *node = &stripe->rb_node;
+
+		rb_erase(node, &fs_info->stripe_update_tree);
+		RB_CLEAR_NODE(node);
+
+		spin_lock(&stripe->lock);
+		kfree(stripe->stripes);
+		spin_unlock(&stripe->lock);
+		kfree(stripe);
+	}
+	mutex_unlock(&fs_info->stripe_update_lock);
+}
+
+int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
+					  u64 start, u64 len)
+{
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_ordered_stripe *stripe;
+	u64 map_length = len;
+	int ret;
+
+	if (!fs_info->stripe_root)
+		return 0;
+
+	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, start, &map_length,
+			      &bioc, 0);
+	if (ret)
+		return ret;
+
+	bioc->size = len;
+
+	stripe = btrfs_lookup_ordered_stripe(fs_info, start);
+	if (!stripe) {
+		ret = btrfs_add_ordered_stripe(bioc);
+		if (ret)
+			return ret;
+	} else {
+		spin_lock(&stripe->lock);
+		memcpy(stripe->stripes, bioc->stripes,
+		       bioc->num_stripes * sizeof(struct btrfs_io_stripe));
+		spin_unlock(&stripe->lock);
+		btrfs_put_ordered_stripe(fs_info, stripe);
+	}
+
+	return 0;
+}
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+			     struct btrfs_ordered_stripe *stripe)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key stripe_key;
+	struct btrfs_root *stripe_root = fs_info->stripe_root;
+	struct btrfs_dp_stripe *raid_stripe;
+	size_t item_size;
+	int ret;
+
+	item_size = stripe->num_stripes * sizeof(struct btrfs_stripe_extent);
+
+	raid_stripe = kzalloc(item_size, GFP_NOFS);
+	if (!raid_stripe) {
+		btrfs_abort_transaction(trans, -ENOMEM);
+		btrfs_end_transaction(trans);
+		return -ENOMEM;
+	}
+
+	spin_lock(&stripe->lock);
+	for (int i = 0; i < stripe->num_stripes; i++) {
+		u64 devid = stripe->stripes[i].dev->devid;
+		u64 physical = stripe->stripes[i].physical;
+		struct btrfs_stripe_extent *stripe_extent =
+						&raid_stripe->extents[i];
+
+		btrfs_set_stack_stripe_extent_devid(stripe_extent, devid);
+		btrfs_set_stack_stripe_extent_physical(stripe_extent, physical);
+	}
+	spin_unlock(&stripe->lock);
+
+	stripe_key.objectid = stripe->logical;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = stripe->num_bytes;
+
+	ret = btrfs_insert_item(trans, stripe_root, &stripe_key, raid_stripe,
+				item_size);
+	if (ret)
+		btrfs_abort_transaction(trans, ret);
+
+	kfree(raid_stripe);
+
+	return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
new file mode 100644
index 000000000000..fdcaad405742
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_RAID_STRIPE_TREE_H
+#define BTRFS_RAID_STRIPE_TREE_H
+
+struct btrfs_io_context;
+
+struct btrfs_ordered_stripe {
+	struct rb_node rb_node;
+
+	u64 logical;
+	u64 num_bytes;
+	int num_stripes;
+	struct btrfs_io_stripe *stripes;
+	spinlock_t lock;
+	refcount_t ref;
+};
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+			     struct btrfs_ordered_stripe *stripe);
+int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
+					  u64 start, u64 len);
+struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
+						 struct btrfs_fs_info *fs_info,
+						 u64 logical);
+int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
+void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
+					    struct btrfs_ordered_stripe *stripe);
+
+static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
+						 u64 map_type)
+{
+	u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
+	u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	if (!fs_info->stripe_root)
+		return false;
+
+	// for now
+	if (type != BTRFS_BLOCK_GROUP_DATA)
+		return false;
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
+		return true;
+
+	return false;
+}
+
+#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f5be296b863e..261bf6dd17bc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -33,6 +33,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "zoned.h"
+#include "raid-stripe-tree.h"
 
 static struct bio_set btrfs_bioset;
 static struct bio_set btrfs_clone_bioset;
@@ -45,6 +46,8 @@ struct btrfs_failed_bio {
 	atomic_t repair_count;
 };
 
+static void btrfs_raid_stripe_update(struct work_struct *work);
+
 #define BTRFS_BLOCK_GROUP_STRIPE_MASK	(BTRFS_BLOCK_GROUP_RAID0 | \
 					 BTRFS_BLOCK_GROUP_RAID10 | \
 					 BTRFS_BLOCK_GROUP_RAID56_MASK)
@@ -5887,6 +5890,7 @@ static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
 }
 
 static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
+						       u64 logical,
 						       int total_stripes,
 						       int real_stripes)
 {
@@ -5907,6 +5911,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 	refcount_set(&bioc->refs, 1);
 
 	bioc->fs_info = fs_info;
+	bioc->logical = logical;
 	bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes);
 	bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
 
@@ -6512,7 +6517,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 		goto out;
 	}
 
-	bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes, tgtdev_indexes);
+	bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes,
+				      tgtdev_indexes);
 	if (!bioc) {
 		ret = -ENOMEM;
 		goto out;
@@ -6921,6 +6927,21 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	btrfs_put_bioc(bioc);
 }
 
+static void btrfs_raid_stripe_update(struct work_struct *work)
+{
+	struct btrfs_bio *bbio =
+		container_of(work, struct btrfs_bio, raid_stripe_work);
+	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
+	struct btrfs_io_context *bioc = stripe->bioc;
+	int ret;
+
+	ret = btrfs_add_ordered_stripe(bioc);
+	if (ret)
+		bbio->bio.bi_status = errno_to_blk_status(ret);
+	btrfs_orig_bbio_end_io(bbio);
+	btrfs_put_bioc(bioc);
+}
+
 static void btrfs_orig_write_end_io(struct bio *bio)
 {
 	struct btrfs_io_stripe *stripe = bio->bi_private;
@@ -6943,6 +6964,15 @@ static void btrfs_orig_write_end_io(struct bio *bio)
 	else
 		bio->bi_status = BLK_STS_OK;
 
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
+	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
+		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
+		schedule_work(&bbio->raid_stripe_work);
+		return;
+	}
+
 	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
 }
@@ -6954,6 +6984,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
 	if (bio->bi_status) {
 		atomic_inc(&stripe->bioc->error);
 		btrfs_log_dev_io_error(bio, stripe->dev);
+	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	}
 
 	/* Pass on control to the original bio this one was cloned from */
@@ -7013,6 +7045,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 	bio->bi_private = &bioc->stripes[dev_nr];
 	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
 	bioc->stripes[dev_nr].bioc = bioc;
+	bioc->size = bio->bi_iter.bi_size;
 	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 3b1fe04ff078..1b4a6e46eec3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -374,6 +374,8 @@ struct btrfs_bio {
 	atomic_t pending_ios;
 	struct work_struct end_io_work;
 
+	struct work_struct raid_stripe_work;
+
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
@@ -403,12 +405,10 @@ static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 
 struct btrfs_io_stripe {
 	struct btrfs_device *dev;
-	union {
-		/* Block mapping */
-		u64 physical;
-		/* For the endio handler */
-		struct btrfs_io_context *bioc;
-	};
+	/* Block mapping */
+	u64 physical;
+	/* For the endio handler */
+	struct btrfs_io_context *bioc;
 };
 
 struct btrfs_discard_stripe {
@@ -444,6 +444,8 @@ struct btrfs_io_context {
 	int mirror_num;
 	int num_tgtdevs;
 	int *tgtdev_map;
+	u64 logical;
+	u64 size;
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 5d28479b7da4..286b99f04ae2 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1659,6 +1659,10 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 	u64 *logical = NULL;
 	int nr, stripe_len;
 
+	/* Filesystems with a stripe tree have their own l2p mapping */
+	if (fs_info->stripe_root)
+		return;
+
 	/* Zoned devices should not have partitions. So, we can assume it is 0 */
 	ASSERT(!bdev_is_partition(ordered->bdev));
 	if (WARN_ON(!ordered->bdev))
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 04/11] btrfs: delete stripe extent on extent deletion
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-17 11:55 ` [RFC v3 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

As each stripe extent is tied to an extent item, delete the stripe extent
once the corresponding extent item is deleted.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent-tree.c      |  8 ++++++++
 fs/btrfs/raid-stripe-tree.c | 31 +++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  2 ++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 061296bcdfb4..d6f52e101d5a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3211,6 +3211,14 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 
+		if (is_data) {
+			ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
+			if (ret) {
+				btrfs_abort_transaction(trans, ret);
+				return ret;
+			}
+		}
+
 		ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
 				      num_to_del);
 		if (ret) {
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index d8a69060b54b..5750857c2a75 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -109,6 +109,37 @@ void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 	mutex_unlock(&fs_info->stripe_update_lock);
 }
 
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+			     u64 length)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_root *stripe_root = fs_info->stripe_root;
+	struct btrfs_path *path;
+	struct btrfs_key stripe_key;
+	int ret;
+
+	if (!stripe_root)
+		return 0;
+
+	stripe_key.objectid = start;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(trans, stripe_root, &stripe_key, path, -1, 1);
+	if (ret < 0)
+		goto out;
+
+	ret = btrfs_del_item(trans, stripe_root, path);
+out:
+	btrfs_free_path(path);
+	return ret;
+
+}
+
 int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
 					  u64 start, u64 len)
 {
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index fdcaad405742..3456251d0739 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -16,6 +16,8 @@ struct btrfs_ordered_stripe {
 	refcount_t ref;
 };
 
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+			     u64 length);
 int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_ordered_stripe *stripe);
 int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 05/11] btrfs: lookup physical address from stripe extent
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:34   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 06/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Lookup the physical address from the raid stripe tree when a read on an
RAID volume formatted with the raid stripe tree was attempted.

If the requested logical address was not found in the stripe tree, it may
still be in the in-memory ordered stripe tree, so fallback to searching
the ordered stripe tree in this case.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c | 142 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |   3 +
 fs/btrfs/volumes.c          |  30 ++++++--
 3 files changed, 168 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 5750857c2a75..91e67600e01a 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -218,3 +218,145 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
 
 	return ret;
 }
+
+static bool btrfs_physical_from_ordered_stripe(struct btrfs_fs_info *fs_info,
+					      u64 logical, u64 *length,
+					      int num_stripes,
+					      struct btrfs_io_stripe *stripe)
+{
+	struct btrfs_ordered_stripe *os;
+	u64 offset;
+	u64 found_end;
+	u64 end;
+	int i;
+
+	os = btrfs_lookup_ordered_stripe(fs_info, logical);
+	if (!os)
+		return false;
+
+	end = logical + *length;
+	found_end = os->logical + os->num_bytes;
+	if (end > found_end)
+		*length -= end - found_end;
+
+	for (i = 0; i < num_stripes; i++) {
+		if (os->stripes[i].dev != stripe->dev)
+			continue;
+
+		offset = logical - os->logical;
+		ASSERT(offset >= 0);
+		stripe->physical = os->stripes[i].physical + offset;
+		btrfs_put_ordered_stripe(fs_info, os);
+		break;
+	}
+
+	return true;
+}
+
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+				 u64 logical, u64 *length, u64 map_type,
+				 struct btrfs_io_stripe *stripe)
+{
+	struct btrfs_root *stripe_root = fs_info->stripe_root;
+	int num_stripes = btrfs_bg_type_to_factor(map_type);
+	struct btrfs_dp_stripe *raid_stripe;
+	struct btrfs_key stripe_key;
+	struct btrfs_key found_key;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	u64 offset;
+	u64 found_logical;
+	u64 found_length;
+	u64 end;
+	u64 found_end;
+	int slot;
+	int ret;
+	int i;
+
+	/*
+	 * If we still have the stripe in the ordered stripe tree get it from
+	 * there
+	 */
+	if (btrfs_physical_from_ordered_stripe(fs_info, logical, length,
+					       num_stripes, stripe))
+		return 0;
+
+	stripe_key.objectid = logical;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	if (ret) {
+		if (path->slots[0] != 0)
+			path->slots[0]--;
+	}
+
+	end = logical + *length;
+
+	while (1) {
+		leaf = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(leaf, &found_key, slot);
+		found_logical = found_key.objectid;
+		found_length = found_key.offset;
+
+		if (found_logical > end)
+			break;
+
+		if (!in_range(logical, found_logical, found_length))
+			goto next;
+
+		offset = logical - found_logical;
+		found_end = found_logical + found_length;
+
+		/*
+		 * If we have a logically contiguous, but physically
+		 * noncontinuous range, we need to split the bio. Record the
+		 * length after which we must split the bio.
+		 */
+		if (end > found_end)
+			*length -= end - found_end;
+
+		raid_stripe = btrfs_item_ptr(leaf, slot, struct btrfs_dp_stripe);
+		for (i = 0; i < num_stripes; i++) {
+			if (btrfs_stripe_extent_devid_nr(leaf, raid_stripe, i) != 
+			    stripe->dev->devid)
+				continue;
+			stripe->physical = btrfs_stripe_extent_physical_nr(leaf,
+						   raid_stripe, i) + offset;
+			ret = 0;
+			goto out;
+		}
+
+		/*
+		 * If we're here, we haven't found the requested devid in the
+		 * stripe.
+		 */
+		ret = -ENOENT;
+		goto out;
+next:
+		ret = btrfs_next_item(stripe_root, path);
+		if (ret)
+			break;
+	}
+
+out:
+	if (ret > 0)
+		ret = -ENOENT;
+	if (ret) {
+		btrfs_err(fs_info,
+			  "cannot find raid-stripe for logical [%llu, %llu]",
+			  logical, logical + *length);
+		btrfs_print_tree(leaf, 1);
+	}
+	btrfs_free_path(path);
+
+	return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 3456251d0739..083e754f5239 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -16,6 +16,9 @@ struct btrfs_ordered_stripe {
 	refcount_t ref;
 };
 
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+				 u64 logical, u64 *length, u64 map_type,
+				 struct btrfs_io_stripe *stripe);
 int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
 			     u64 length);
 int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 261bf6dd17bc..c67d76d93982 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6313,12 +6313,21 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
 	return U64_MAX;
 }
 
-static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
-		          u32 stripe_index, u64 stripe_offset, u64 stripe_nr)
+static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
+		      u64 logical, u64 *length, struct btrfs_io_stripe *dst,
+		      struct map_lookup *map, u32 stripe_index,
+		      u64 stripe_offset, u64 stripe_nr)
 {
 	dst->dev = map->stripes[stripe_index].dev;
+
+	if (fs_info->stripe_root && op == BTRFS_MAP_READ &&
+	    btrfs_need_stripe_tree_update(fs_info, map->type))
+		return btrfs_get_raid_extent_offset(fs_info, logical, length,
+						    map->type, dst);
+
 	dst->physical = map->stripes[stripe_index].physical +
 			stripe_offset + stripe_nr * map->stripe_len;
+	return 0;
 }
 
 static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
@@ -6507,13 +6516,14 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 			smap->dev = dev_replace->tgtdev;
 			smap->physical = physical_to_patch_in_first_stripe;
 			*mirror_num_ret = map->num_stripes + 1;
+			ret = 0;
 		} else {
-			set_io_stripe(smap, map, stripe_index, stripe_offset,
-				      stripe_nr);
 			*mirror_num_ret = mirror_num;
+			ret = set_io_stripe(fs_info, op, logical, length, smap,
+					    map, stripe_index, stripe_offset,
+					    stripe_nr);
 		}
 		*bioc_ret = NULL;
-		ret = 0;
 		goto out;
 	}
 
@@ -6525,8 +6535,14 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	}
 
 	for (i = 0; i < num_stripes; i++) {
-		set_io_stripe(&bioc->stripes[i], map, stripe_index, stripe_offset,
-			      stripe_nr);
+		ret = set_io_stripe(fs_info, op, logical, length,
+				 &bioc->stripes[i], map, stripe_index,
+				 stripe_offset, stripe_nr);
+		if (ret) {
+			btrfs_put_bioc(bioc);
+			goto out;
+		}
+
 		stripe_index++;
 	}
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 06/11] btrfs: add raid stripe tree pretty printer
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:34   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 07/11] btrfs: zoned: allow zoned RAID1 Johannes Thumshirn
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Decode raid-stripe-tree entries on btrfs_print_tree().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/print-tree.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index dd8777872143..1c781a30fd60 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -6,6 +6,7 @@
 #include "ctree.h"
 #include "disk-io.h"
 #include "print-tree.h"
+#include "raid-stripe-tree.h"
 
 struct root_name_map {
 	u64 id;
@@ -25,6 +26,7 @@ static const struct root_name_map root_map[] = {
 	{ BTRFS_FREE_SPACE_TREE_OBJECTID,	"FREE_SPACE_TREE"	},
 	{ BTRFS_BLOCK_GROUP_TREE_OBJECTID,	"BLOCK_GROUP_TREE"	},
 	{ BTRFS_DATA_RELOC_TREE_OBJECTID,	"DATA_RELOC_TREE"	},
+	{ BTRFS_RAID_STRIPE_TREE_OBJECTID,	"RAID_STRIPE_TREE"	},
 };
 
 const char *btrfs_root_name(const struct btrfs_key *key, char *buf)
@@ -184,6 +186,20 @@ static void print_uuid_item(struct extent_buffer *l, unsigned long offset,
 	}
 }
 
+static void print_raid_stripe_key(struct extent_buffer *eb, u32 item_size,
+				  struct btrfs_dp_stripe *stripe)
+{
+	int num_stripes;
+	int i;
+
+	num_stripes = item_size / sizeof(struct btrfs_stripe_extent);
+
+	for (i = 0; i < num_stripes; i++)
+		pr_info("\t\t\tstripe %d devid %llu physical %llu\n", i,
+			btrfs_stripe_extent_devid_nr(eb, stripe, i),
+			btrfs_stripe_extent_physical_nr(eb, stripe, i));
+}
+
 /*
  * Helper to output refs and locking status of extent buffer.  Useful to debug
  * race condition related problems.
@@ -348,6 +364,11 @@ void btrfs_print_leaf(struct extent_buffer *l)
 			print_uuid_item(l, btrfs_item_ptr_offset(l, i),
 					btrfs_item_size(l, i));
 			break;
+		case BTRFS_RAID_STRIPE_KEY:
+			print_raid_stripe_key(l, btrfs_item_size(l, i),
+					      btrfs_item_ptr(l, i,
+							     struct btrfs_dp_stripe));
+			break;
 		}
 	}
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 07/11] btrfs: zoned: allow zoned RAID1
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 06/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:35   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 08/11] btrfs: allow zoned RAID0 and 10 Johannes Thumshirn
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

When we have a raid-stripe-tree, we can do RAID1 on zoned devices for data
block-groups. For meta-data block-groups, we don't actually need
anything special, as all meta-data I/O is protected by the
btrfs_zoned_meta_io_lock() already.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/zoned.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 286b99f04ae2..f4ce39169468 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1486,6 +1486,45 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		cache->zone_capacity = min(caps[0], caps[1]);
 		break;
 	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
+		if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+		    !fs_info->stripe_root) {
+			btrfs_err(fs_info,
+				  "zoned: data RAID1 needs stripe_root");
+			ret = -EIO;
+			goto out;
+
+		}
+
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV ||
+			    alloc_offsets[i] == WP_CONVENTIONAL)
+				continue;
+
+			if (i == 0)
+				continue;
+
+			if (alloc_offsets[0] != alloc_offsets[i]) {
+				btrfs_err(fs_info,
+					  "zoned: write pointer offset mismatch of zones in RAID profile");
+				ret = -EIO;
+				goto out;
+			}
+			if (test_bit(0, active) != test_bit(i, active)) {
+				if (!btrfs_zone_activate(cache)) {
+					ret = -EIO;
+					goto out;
+				}
+			} else {
+				if (test_bit(0, active))
+					set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+						&cache->runtime_flags);
+			}
+			cache->zone_capacity = min(caps[0], caps[i]);
+		}
+		cache->alloc_offset = alloc_offsets[0];
+		break;
 	case BTRFS_BLOCK_GROUP_RAID0:
 	case BTRFS_BLOCK_GROUP_RAID10:
 	case BTRFS_BLOCK_GROUP_RAID5:
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 08/11] btrfs: allow zoned RAID0 and 10
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 07/11] btrfs: zoned: allow zoned RAID1 Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:36   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 09/11] btrfs: fix striping with RST Johannes Thumshirn
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.h | 7 ++++++-
 fs/btrfs/zoned.c            | 4 ++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 083e754f5239..d1885b428cd4 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -41,13 +41,18 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
 	if (!fs_info->stripe_root)
 		return false;
 
-	// for now
 	if (type != BTRFS_BLOCK_GROUP_DATA)
 		return false;
 
 	if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
 		return true;
 
+	if (profile & BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID10)
+		return true;
+
 	return false;
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index f4ce39169468..3325e7761ef7 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1488,6 +1488,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 	case BTRFS_BLOCK_GROUP_RAID1:
 	case BTRFS_BLOCK_GROUP_RAID1C3:
 	case BTRFS_BLOCK_GROUP_RAID1C4:
+	case BTRFS_BLOCK_GROUP_RAID0:
+	case BTRFS_BLOCK_GROUP_RAID10:
 		if (map->type & BTRFS_BLOCK_GROUP_DATA &&
 		    !fs_info->stripe_root) {
 			btrfs_err(fs_info,
@@ -1525,8 +1527,6 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		}
 		cache->alloc_offset = alloc_offsets[0];
 		break;
-	case BTRFS_BLOCK_GROUP_RAID0:
-	case BTRFS_BLOCK_GROUP_RAID10:
 	case BTRFS_BLOCK_GROUP_RAID5:
 	case BTRFS_BLOCK_GROUP_RAID6:
 		/* non-single profiles are not supported yet */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 09/11] btrfs: fix striping with RST
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 08/11] btrfs: allow zoned RAID0 and 10 Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:36   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 10/11] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/volumes.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c67d76d93982..5169001b5bba 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6509,6 +6509,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	 * I/O context structure.
 	 */
 	if (smap && num_alloc_stripes == 1 &&
+	    !((map->type & BTRFS_BLOCK_GROUP_STRIPE_MASK) && op != BTRFS_MAP_READ) &&
 	    !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1) &&
 	    (!need_full_stripe(op) || !dev_replace_is_ongoing ||
 	     !dev_replace->tgtdev)) {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 10/11] btrfs: check for leaks of ordered stripes on umount
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (8 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 09/11] btrfs: fix striping with RST Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:37   ` Josef Bacik
  2022-10-17 11:55 ` [RFC v3 11/11] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
  2022-10-20 15:42 ` [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Josef Bacik
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Check if we're leaking any ordered stripes when unmounting a filesystem
with an stripe tree.

This check is gated behind CONFIG_BTRFS_DEBUG to not affect any production
type systems.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c          |  2 ++
 fs/btrfs/raid-stripe-tree.c | 29 +++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  1 +
 3 files changed, 32 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 190caabf5fb7..e479e9829c3e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -43,6 +43,7 @@
 #include "space-info.h"
 #include "zoned.h"
 #include "subpage.h"
+#include "raid-stripe-tree.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -1480,6 +1481,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->stripe_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
+	btrfs_check_ordered_stripe_leak(fs_info);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	kfree(fs_info->subpage_info);
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 91e67600e01a..9a913c4cd44e 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -30,6 +30,35 @@ static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
 	return ordered_stripe_cmp(&stripe->logical, rbb);
 }
 
+void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info)
+{
+#ifdef CONFIG_BTRFS_DEBUG
+	struct rb_node *node;
+
+	if (!fs_info->stripe_root ||
+	    RB_EMPTY_ROOT(&fs_info->stripe_update_tree))
+		return;
+
+	mutex_lock(&fs_info->stripe_update_lock);
+	while ((node = rb_first_postorder(&fs_info->stripe_update_tree))
+	       != NULL) {
+		struct btrfs_ordered_stripe *stripe =
+			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+
+		mutex_unlock(&fs_info->stripe_update_lock);
+		btrfs_err(fs_info,
+			  "ordered_stripe [%llu, %llu] leaked, refcount=%d",
+			  stripe->logical, stripe->logical + stripe->num_bytes,
+			  refcount_read(&stripe->ref));
+		while (refcount_read(&stripe->ref) > 1)
+			btrfs_put_ordered_stripe(fs_info, stripe);
+		btrfs_put_ordered_stripe(fs_info, stripe);
+		mutex_lock(&fs_info->stripe_update_lock);
+	}
+	mutex_unlock(&fs_info->stripe_update_lock);
+#endif
+}
+
 int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
 {
 	struct btrfs_fs_info *fs_info = bioc->fs_info;
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index d1885b428cd4..5ffb10bf219e 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -31,6 +31,7 @@ struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
 int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
 void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 					    struct btrfs_ordered_stripe *stripe);
+void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info);
 
 static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
 						 u64 map_type)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v3 11/11] btrfs: add tracepoints for ordered stripes
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (9 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 10/11] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
@ 2022-10-17 11:55 ` Johannes Thumshirn
  2022-10-20 15:38   ` Josef Bacik
  2022-10-20 15:42 ` [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Josef Bacik
  11 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-17 11:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

Add tracepoints to check the lifetime of btrfs_ordered_stripe entries.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c  |  3 +++
 fs/btrfs/super.c             |  1 +
 include/trace/events/btrfs.h | 50 ++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 9a913c4cd44e..0d27b236445d 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -99,6 +99,7 @@ int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
 		return -EINVAL;
 	}
 
+	trace_btrfs_ordered_stripe_add(fs_info, stripe);
 	return 0;
 }
 
@@ -114,6 +115,7 @@ struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *f
 	if (node) {
 		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
 		refcount_inc(&stripe->ref);
+		trace_btrfs_ordered_stripe_lookup(fs_info, stripe);
 	}
 	mutex_unlock(&fs_info->stripe_update_lock);
 
@@ -124,6 +126,7 @@ void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 				 struct btrfs_ordered_stripe *stripe)
 {
 	mutex_lock(&fs_info->stripe_update_lock);
+	trace_btrfs_ordered_stripe_put(fs_info, stripe);
 	if (refcount_dec_and_test(&stripe->ref)) {
 		struct rb_node *node = &stripe->rb_node;
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index eb0ae7e396ef..e071245ef0b4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -49,6 +49,7 @@
 #include "discard.h"
 #include "qgroup.h"
 #include "raid56.h"
+#include "raid-stripe-tree.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index ed50e81174bf..49510c687977 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -32,6 +32,7 @@ struct prelim_ref;
 struct btrfs_space_info;
 struct btrfs_raid_bio;
 struct raid56_bio_trace_info;
+struct btrfs_ordered_stripe;
 
 #define show_ref_type(type)						\
 	__print_symbolic(type,						\
@@ -2414,6 +2415,55 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_scrub_read_recover,
 	TP_ARGS(rbio, bio, trace_info)
 );
 
+DECLARE_EVENT_CLASS(btrfs__ordered_stripe,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info,
+		 const struct btrfs_ordered_stripe *stripe),
+
+	TP_ARGS(fs_info, stripe),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	logical		)
+		__field(	u64,	num_bytes	)
+		__field(	int,	num_stripes	)
+		__field(	int,	ref		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->logical	= stripe->logical;
+		__entry->num_bytes	= stripe->num_bytes;
+		__entry->num_stripes	= stripe->num_stripes;
+		__entry->ref		= refcount_read(&stripe->ref);
+	),
+
+	TP_printk_btrfs("logical=%llu, num_bytes=%llu, num_stripes=%d, ref=%d",
+			__entry->logical, __entry->num_bytes,
+			__entry->num_stripes, __entry->ref)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_add,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_lookup,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_put,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC v3 01/11] btrfs: add raid stripe tree definitions
  2022-10-17 11:55 ` [RFC v3 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2022-10-20 15:21   ` Josef Bacik
  2022-10-20 15:49     ` Johannes Thumshirn
  0 siblings, 1 reply; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:21 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:19AM -0700, Johannes Thumshirn wrote:
> Add definitions for the raid stripe tree. This tree will hold information
> about the on-disk layout of the stripes in a RAID set.
> 
> Each stripe extent has a 1:1 relationship with an on-disk extent item and
> is doing the logical to per-drive physical address translation for the
> extent item in question.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/ctree.h                | 29 +++++++++++++++++++++++++++++
>  include/uapi/linux/btrfs_tree.h | 20 ++++++++++++++++++--
>  2 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 87bb4218276b..80ead27299dc 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1992,6 +1992,35 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
>  BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
>  BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
>  
> +BTRFS_SETGET_FUNCS(stripe_extent_devid, struct btrfs_stripe_extent, devid, 64);
> +BTRFS_SETGET_FUNCS(stripe_extent_physical, struct btrfs_stripe_extent, physical, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_devid, struct btrfs_stripe_extent, devid, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_physical, struct btrfs_stripe_extent, physical, 64);
> +
> +static inline struct btrfs_stripe_extent *btrfs_stripe_extent_nr(
> +					 struct btrfs_dp_stripe *dps, int nr)
> +{
> +	unsigned long offset = (unsigned long)dps;
> +
> +	offset += offsetof(struct btrfs_dp_stripe, extents);
> +	offset += nr * sizeof(struct btrfs_stripe_extent);
> +	return (struct btrfs_stripe_extent *)offset;
> +}
> +
> +static inline u64 btrfs_stripe_extent_devid_nr(const struct extent_buffer *eb,
> +					       struct btrfs_dp_stripe *dps,
> +					       int nr)
> +{
> +	return btrfs_stripe_extent_devid(eb, btrfs_stripe_extent_nr(dps, nr));
> +}
> +
> +static inline u64 btrfs_stripe_extent_physical_nr(const struct extent_buffer *eb,
> +						  struct btrfs_dp_stripe *dps,
> +						  int nr)
> +{
> +	return btrfs_stripe_extent_physical(eb, btrfs_stripe_extent_nr(dps, nr));
> +}
> +
>  /* struct btrfs_dev_extent */
>  BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent,
>  		   chunk_tree, 64);
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 5f32a2a495dc..047e1d0b2ff6 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -4,9 +4,8 @@
>  
>  #include <linux/btrfs.h>
>  #include <linux/types.h>
> -#ifdef __KERNEL__
>  #include <linux/stddef.h>
> -#else
> +#ifndef __KERNEL__
>  #include <stddef.h>
>  #endif
> 

Gotta act like I did something, this is unrelated and should probably be it's
own thing separate from the raid stripe tree.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 03/11] btrfs: add support for inserting raid stripe extents
  2022-10-17 11:55 ` [RFC v3 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2022-10-20 15:24   ` Josef Bacik
  2022-10-20 15:30   ` Josef Bacik
  1 sibling, 0 replies; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:24 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:21AM -0700, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
> 
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
> 
> This is done by creating in-memory ordered stripe extents, just like the
> in memory ordered extents, on I/O completion and the on-disk raid stripe
> extents get created once we're running the delayed_refs for the extent
> item this stripe extent is tied to.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/Makefile           |   2 +-
>  fs/btrfs/ctree.h            |   3 +
>  fs/btrfs/disk-io.c          |   3 +
>  fs/btrfs/extent-tree.c      |  48 +++++++++
>  fs/btrfs/inode.c            |   6 ++
>  fs/btrfs/raid-stripe-tree.c | 189 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  49 ++++++++++
>  fs/btrfs/volumes.c          |  35 ++++++-
>  fs/btrfs/volumes.h          |  14 +--
>  fs/btrfs/zoned.c            |   4 +
>  10 files changed, 345 insertions(+), 8 deletions(-)
>  create mode 100644 fs/btrfs/raid-stripe-tree.c
>  create mode 100644 fs/btrfs/raid-stripe-tree.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 99f9995670ea..4484831ac624 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>  	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> -	   subpage.o tree-mod-log.o
> +	   subpage.o tree-mod-log.o raid-stripe-tree.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 430f224743a9..1f75ab8702bb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1102,6 +1102,9 @@ struct btrfs_fs_info {
>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>  	struct lockdep_map btrfs_ordered_extent_map;
>  
> +	struct mutex stripe_update_lock;
> +	struct rb_root stripe_update_tree;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a166b2602c40..190caabf5fb7 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2973,6 +2973,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  
>  	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
>  	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
> +
> +	mutex_init(&fs_info->stripe_update_lock);
> +	fs_info->stripe_update_tree = RB_ROOT;
>  }
>  
>  static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index bcd0e72cded3..061296bcdfb4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -36,6 +36,7 @@
>  #include "rcu-string.h"
>  #include "zoned.h"
>  #include "dev-replace.h"
> +#include "raid-stripe-tree.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -1491,6 +1492,50 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
> +					    struct btrfs_delayed_ref_node *node)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	int ret;
> +
> +	if (!fs_info->stripe_root)
> +		return 0;
> +
> +	em = btrfs_get_chunk_map(fs_info, node->bytenr, node->num_bytes);
> +	if (!em) {
> +		btrfs_err(fs_info,
> +			  "cannot get chunk map for address %llu",
> +			  node->bytenr);
> +		return -EINVAL;
> +	}
> +
> +	map = em->map_lookup;
> +
> +	if (btrfs_need_stripe_tree_update(fs_info, map->type)) {
> +		struct btrfs_ordered_stripe *stripe;
> +
> +		stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
> +		if (!stripe) {
> +			btrfs_err(fs_info,
> +				  "cannot get stripe extent for address %llu (%llu)",
> +				  node->bytenr, node->num_bytes);
> +			free_extent_map(em);
> +			return -EINVAL;
> +		}
> +		ASSERT(stripe->logical == node->bytenr);
> +		ret = btrfs_insert_raid_extent(trans, stripe);
> +		/* once for us */
> +		btrfs_put_ordered_stripe(fs_info, stripe);
> +		/* once for the tree */
> +		btrfs_put_ordered_stripe(fs_info, stripe);
> +	}
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> +
>  static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  				struct btrfs_delayed_ref_node *node,
>  				struct btrfs_delayed_extent_op *extent_op,
> @@ -1521,6 +1566,9 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  						 flags, ref->objectid,
>  						 ref->offset, &ins,
>  						 node->ref_mod);
> +		if (ret)
> +			return ret;
> +		ret = add_stripe_entry_for_delayed_ref(trans, node);
>  	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
>  		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
>  					     ref->objectid, ref->offset,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5f2d9d4f6d43..5414ba573022 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -55,6 +55,7 @@
>  #include "zoned.h"
>  #include "subpage.h"
>  #include "inode-item.h"
> +#include "raid-stripe-tree.h"
>  
>  struct btrfs_iget_args {
>  	u64 ino;
> @@ -9550,6 +9551,11 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  	if (qgroup_released < 0)
>  		return ERR_PTR(qgroup_released);
>  
> +	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
> +						    start, len);
> +	if (ret)
> +		goto free_qgroup;
> +
>  	if (trans) {
>  		ret = insert_reserved_file_extent(trans, inode,
>  						  file_offset, &stack_fi,
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..d8a69060b54b
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,189 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "print-tree.h"
> +
> +static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +	const u64 *logical = key;
> +
> +	if (*logical < stripe->logical)
> +		return -1;
> +	if (*logical >= stripe->logical + stripe->num_bytes)
> +		return 1;
> +	return 0;
> +}
> +
> +static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
> +	return ordered_stripe_cmp(&stripe->logical, rbb);
> +}
> +
> +int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
> +{
> +	struct btrfs_fs_info *fs_info = bioc->fs_info;
> +	struct btrfs_ordered_stripe *stripe;
> +	struct btrfs_io_stripe *tmp;
> +	u64 logical = bioc->logical;
> +	u64 length = bioc->size;
> +	struct rb_node *node;
> +	size_t size;
> +
> +	size = bioc->num_stripes * sizeof(struct btrfs_io_stripe);
> +	stripe = kzalloc(sizeof(struct btrfs_ordered_stripe), GFP_NOFS);
> +	if (!stripe)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&stripe->lock);
> +	tmp = kmemdup(bioc->stripes, size, GFP_NOFS);
> +	if (!tmp) {
> +		kfree(stripe);
> +		return -ENOMEM;
> +	}
> +
> +	stripe->logical = logical;
> +	stripe->num_bytes = length;
> +	stripe->num_stripes = bioc->num_stripes;
> +	spin_lock(&stripe->lock);
> +	stripe->stripes = tmp;
> +	spin_unlock(&stripe->lock);
> +	refcount_set(&stripe->ref, 1);
> +
> +	mutex_lock(&fs_info->stripe_update_lock);
> +	node = rb_find_add(&stripe->rb_node, &fs_info->stripe_update_tree,
> +	       ordered_stripe_less);
> +	mutex_unlock(&fs_info->stripe_update_lock);
> +	if (node) {
> +		btrfs_err(fs_info, "logical: %llu, length: %llu already exists",
> +			  logical, length);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *fs_info,
> +							 u64 logical)
> +{
> +	struct rb_root *root = &fs_info->stripe_update_tree;
> +	struct btrfs_ordered_stripe *stripe = NULL;
> +	struct rb_node *node;
> +
> +	mutex_lock(&fs_info->stripe_update_lock);
> +	node = rb_find(&logical, root, ordered_stripe_cmp);
> +	if (node) {
> +		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +		refcount_inc(&stripe->ref);
> +	}
> +	mutex_unlock(&fs_info->stripe_update_lock);
> +
> +	return stripe;
> +}
> +
> +void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
> +				 struct btrfs_ordered_stripe *stripe)
> +{
> +	mutex_lock(&fs_info->stripe_update_lock);
> +	if (refcount_dec_and_test(&stripe->ref)) {
> +		struct rb_node *node = &stripe->rb_node;
> +
> +		rb_erase(node, &fs_info->stripe_update_tree);
> +		RB_CLEAR_NODE(node);
> +
> +		spin_lock(&stripe->lock);
> +		kfree(stripe->stripes);
> +		spin_unlock(&stripe->lock);
> +		kfree(stripe);
> +	}
> +	mutex_unlock(&fs_info->stripe_update_lock);
> +}
> +
> +int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
> +					  u64 start, u64 len)
> +{
> +	struct btrfs_io_context *bioc = NULL;
> +	struct btrfs_ordered_stripe *stripe;
> +	u64 map_length = len;
> +	int ret;
> +
> +	if (!fs_info->stripe_root)
> +		return 0;
> +
> +	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, start, &map_length,
> +			      &bioc, 0);
> +	if (ret)
> +		return ret;
> +
> +	bioc->size = len;
> +
> +	stripe = btrfs_lookup_ordered_stripe(fs_info, start);
> +	if (!stripe) {
> +		ret = btrfs_add_ordered_stripe(bioc);
> +		if (ret)
> +			return ret;
> +	} else {
> +		spin_lock(&stripe->lock);
> +		memcpy(stripe->stripes, bioc->stripes,
> +		       bioc->num_stripes * sizeof(struct btrfs_io_stripe));
> +		spin_unlock(&stripe->lock);
> +		btrfs_put_ordered_stripe(fs_info, stripe);
> +	}
> +
> +	return 0;
> +}
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> +			     struct btrfs_ordered_stripe *stripe)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_key stripe_key;
> +	struct btrfs_root *stripe_root = fs_info->stripe_root;
> +	struct btrfs_dp_stripe *raid_stripe;
> +	size_t item_size;
> +	int ret;
> +
> +	item_size = stripe->num_stripes * sizeof(struct btrfs_stripe_extent);
> +
> +	raid_stripe = kzalloc(item_size, GFP_NOFS);
> +	if (!raid_stripe) {
> +		btrfs_abort_transaction(trans, -ENOMEM);
> +		btrfs_end_transaction(trans);
> +		return -ENOMEM;
> +	}
> +
> +	spin_lock(&stripe->lock);
> +	for (int i = 0; i < stripe->num_stripes; i++) {
> +		u64 devid = stripe->stripes[i].dev->devid;
> +		u64 physical = stripe->stripes[i].physical;
> +		struct btrfs_stripe_extent *stripe_extent =
> +						&raid_stripe->extents[i];
> +
> +		btrfs_set_stack_stripe_extent_devid(stripe_extent, devid);
> +		btrfs_set_stack_stripe_extent_physical(stripe_extent, physical);
> +	}
> +	spin_unlock(&stripe->lock);
> +
> +	stripe_key.objectid = stripe->logical;
> +	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> +	stripe_key.offset = stripe->num_bytes;
> +
> +	ret = btrfs_insert_item(trans, stripe_root, &stripe_key, raid_stripe,
> +				item_size);
> +	if (ret)
> +		btrfs_abort_transaction(trans, ret);
> +
> +	kfree(raid_stripe);
> +
> +	return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> new file mode 100644
> index 000000000000..fdcaad405742
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef BTRFS_RAID_STRIPE_TREE_H
> +#define BTRFS_RAID_STRIPE_TREE_H
> +
> +struct btrfs_io_context;
> +
> +struct btrfs_ordered_stripe {
> +	struct rb_node rb_node;
> +
> +	u64 logical;
> +	u64 num_bytes;
> +	int num_stripes;
> +	struct btrfs_io_stripe *stripes;
> +	spinlock_t lock;
> +	refcount_t ref;
> +};
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> +			     struct btrfs_ordered_stripe *stripe);
> +int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
> +					  u64 start, u64 len);
> +struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
> +						 struct btrfs_fs_info *fs_info,
> +						 u64 logical);
> +int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
> +void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
> +					    struct btrfs_ordered_stripe *stripe);
> +
> +static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
> +						 u64 map_type)
> +{
> +	u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
> +	u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +
> +	if (!fs_info->stripe_root)
> +		return false;
> +
> +	// for now
> +	if (type != BTRFS_BLOCK_GROUP_DATA)
> +		return false;

Yay more nitpicking.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 03/11] btrfs: add support for inserting raid stripe extents
  2022-10-17 11:55 ` [RFC v3 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
  2022-10-20 15:24   ` Josef Bacik
@ 2022-10-20 15:30   ` Josef Bacik
  2022-10-21  8:13     ` Johannes Thumshirn
  1 sibling, 1 reply; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:30 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:21AM -0700, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
> 
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
> 
> This is done by creating in-memory ordered stripe extents, just like the
> in memory ordered extents, on I/O completion and the on-disk raid stripe
> extents get created once we're running the delayed_refs for the extent
> item this stripe extent is tied to.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/Makefile           |   2 +-
>  fs/btrfs/ctree.h            |   3 +
>  fs/btrfs/disk-io.c          |   3 +
>  fs/btrfs/extent-tree.c      |  48 +++++++++
>  fs/btrfs/inode.c            |   6 ++
>  fs/btrfs/raid-stripe-tree.c | 189 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  49 ++++++++++
>  fs/btrfs/volumes.c          |  35 ++++++-
>  fs/btrfs/volumes.h          |  14 +--
>  fs/btrfs/zoned.c            |   4 +
>  10 files changed, 345 insertions(+), 8 deletions(-)
>  create mode 100644 fs/btrfs/raid-stripe-tree.c
>  create mode 100644 fs/btrfs/raid-stripe-tree.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 99f9995670ea..4484831ac624 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>  	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> -	   subpage.o tree-mod-log.o
> +	   subpage.o tree-mod-log.o raid-stripe-tree.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 430f224743a9..1f75ab8702bb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1102,6 +1102,9 @@ struct btrfs_fs_info {
>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>  	struct lockdep_map btrfs_ordered_extent_map;
>  
> +	struct mutex stripe_update_lock;

The mutex seems a bit heavy handed here, but I don't have a good mental model
for the mix of adds/lookups there are.  I think at the very least a spin lock is
more appropriate, and maybe an rwlock depending on how often we're doing a
add/remove vs lookups.

<snip>
> +
> +static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +	const u64 *logical = key;
> +
> +	if (*logical < stripe->logical)
> +		return -1;
> +	if (*logical >= stripe->logical + stripe->num_bytes)
> +		return 1;
> +	return 0;
> +}
> +
> +static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
> +{
> +	struct btrfs_ordered_stripe *stripe =
> +		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
> +	return ordered_stripe_cmp(&stripe->logical, rbb);
> +}

Ahhh yay we're finally not adding new rb tree insert/lookup functions that are
99% the same.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 05/11] btrfs: lookup physical address from stripe extent
  2022-10-17 11:55 ` [RFC v3 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2022-10-20 15:34   ` Josef Bacik
  2022-10-21  8:16     ` Johannes Thumshirn
  0 siblings, 1 reply; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:34 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:23AM -0700, Johannes Thumshirn wrote:
> Lookup the physical address from the raid stripe tree when a read on an
> RAID volume formatted with the raid stripe tree was attempted.
> 
> If the requested logical address was not found in the stripe tree, it may
> still be in the in-memory ordered stripe tree, so fallback to searching
> the ordered stripe tree in this case.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/raid-stripe-tree.c | 142 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |   3 +
>  fs/btrfs/volumes.c          |  30 ++++++--
>  3 files changed, 168 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 5750857c2a75..91e67600e01a 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -218,3 +218,145 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
>  
>  	return ret;
>  }
> +
> +static bool btrfs_physical_from_ordered_stripe(struct btrfs_fs_info *fs_info,
> +					      u64 logical, u64 *length,
> +					      int num_stripes,
> +					      struct btrfs_io_stripe *stripe)
> +{
> +	struct btrfs_ordered_stripe *os;
> +	u64 offset;
> +	u64 found_end;
> +	u64 end;
> +	int i;
> +
> +	os = btrfs_lookup_ordered_stripe(fs_info, logical);
> +	if (!os)
> +		return false;
> +
> +	end = logical + *length;
> +	found_end = os->logical + os->num_bytes;
> +	if (end > found_end)
> +		*length -= end - found_end;
> +
> +	for (i = 0; i < num_stripes; i++) {
> +		if (os->stripes[i].dev != stripe->dev)
> +			continue;
> +
> +		offset = logical - os->logical;
> +		ASSERT(offset >= 0);
> +		stripe->physical = os->stripes[i].physical + offset;
> +		btrfs_put_ordered_stripe(fs_info, os);
> +		break;
> +	}
> +
> +	return true;
> +}
> +
> +int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 *length, u64 map_type,
> +				 struct btrfs_io_stripe *stripe)
> +{
> +	struct btrfs_root *stripe_root = fs_info->stripe_root;
> +	int num_stripes = btrfs_bg_type_to_factor(map_type);
> +	struct btrfs_dp_stripe *raid_stripe;
> +	struct btrfs_key stripe_key;
> +	struct btrfs_key found_key;
> +	struct btrfs_path *path;
> +	struct extent_buffer *leaf;
> +	u64 offset;
> +	u64 found_logical;
> +	u64 found_length;
> +	u64 end;
> +	u64 found_end;
> +	int slot;
> +	int ret;
> +	int i;
> +
> +	/*
> +	 * If we still have the stripe in the ordered stripe tree get it from
> +	 * there
> +	 */
> +	if (btrfs_physical_from_ordered_stripe(fs_info, logical, length,
> +					       num_stripes, stripe))
> +		return 0;
> +
> +	stripe_key.objectid = logical;
> +	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> +	stripe_key.offset = 0;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
> +	if (ret < 0)
> +		goto out;
> +	if (ret) {
> +		if (path->slots[0] != 0)
> +			path->slots[0]--;
> +	}
> +
> +	end = logical + *length;
> +
> +	while (1) {
> +		leaf = path->nodes[0];
> +		slot = path->slots[0];
> +
> +		btrfs_item_key_to_cpu(leaf, &found_key, slot);
> +		found_logical = found_key.objectid;
> +		found_length = found_key.offset;
> +

Don't we have fancy new iterators for walking through the btree?  Can that be
used here instead of this old style walk through?

> +		if (found_logical > end)
> +			break;
> +
> +		if (!in_range(logical, found_logical, found_length))
> +			goto next;
> +
> +		offset = logical - found_logical;
> +		found_end = found_logical + found_length;
> +
> +		/*
> +		 * If we have a logically contiguous, but physically
> +		 * noncontinuous range, we need to split the bio. Record the
> +		 * length after which we must split the bio.
> +		 */
> +		if (end > found_end)
> +			*length -= end - found_end;
> +
> +		raid_stripe = btrfs_item_ptr(leaf, slot, struct btrfs_dp_stripe);
> +		for (i = 0; i < num_stripes; i++) {
> +			if (btrfs_stripe_extent_devid_nr(leaf, raid_stripe, i) != 
> +			    stripe->dev->devid)
> +				continue;
> +			stripe->physical = btrfs_stripe_extent_physical_nr(leaf,
> +						   raid_stripe, i) + offset;
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		/*
> +		 * If we're here, we haven't found the requested devid in the
> +		 * stripe.
> +		 */
> +		ret = -ENOENT;
> +		goto out;
> +next:
> +		ret = btrfs_next_item(stripe_root, path);
> +		if (ret)
> +			break;
> +	}
> +
> +out:
> +	if (ret > 0)
> +		ret = -ENOENT;
> +	if (ret) {

Maybe instead

if (ret && ret != -EIO)

I have a lot of boxes, and a given percentage of them have bad disks, which ends
up with a lot of btrfs_print_tree()'s that I don't need.

> +		btrfs_err(fs_info,
> +			  "cannot find raid-stripe for logical [%llu, %llu]",
> +			  logical, logical + *length);
> +		btrfs_print_tree(leaf, 1);
> +	}
> +	btrfs_free_path(path);
> +
> +	return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> index 3456251d0739..083e754f5239 100644
> --- a/fs/btrfs/raid-stripe-tree.h
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -16,6 +16,9 @@ struct btrfs_ordered_stripe {
>  	refcount_t ref;
>  };
>  
> +int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 *length, u64 map_type,
> +				 struct btrfs_io_stripe *stripe);
>  int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
>  			     u64 length);
>  int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 261bf6dd17bc..c67d76d93982 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6313,12 +6313,21 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
>  	return U64_MAX;
>  }
>  
> -static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
> -		          u32 stripe_index, u64 stripe_offset, u64 stripe_nr)
> +static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> +		      u64 logical, u64 *length, struct btrfs_io_stripe *dst,
> +		      struct map_lookup *map, u32 stripe_index,
> +		      u64 stripe_offset, u64 stripe_nr)
>  {
>  	dst->dev = map->stripes[stripe_index].dev;
> +
> +	if (fs_info->stripe_root && op == BTRFS_MAP_READ &&
> +	    btrfs_need_stripe_tree_update(fs_info, map->type))

We already check if (fs_info->stripe_root) in here, so this can be simplified to

if (op == BTRFS_MAP_READ && btrfs_need_stripe_tree_update())

Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 06/11] btrfs: add raid stripe tree pretty printer
  2022-10-17 11:55 ` [RFC v3 06/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
@ 2022-10-20 15:34   ` Josef Bacik
  0 siblings, 0 replies; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:34 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:24AM -0700, Johannes Thumshirn wrote:
> Decode raid-stripe-tree entries on btrfs_print_tree().
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 07/11] btrfs: zoned: allow zoned RAID1
  2022-10-17 11:55 ` [RFC v3 07/11] btrfs: zoned: allow zoned RAID1 Johannes Thumshirn
@ 2022-10-20 15:35   ` Josef Bacik
  0 siblings, 0 replies; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:35 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:25AM -0700, Johannes Thumshirn wrote:
> When we have a raid-stripe-tree, we can do RAID1 on zoned devices for data
> block-groups. For meta-data block-groups, we don't actually need
> anything special, as all meta-data I/O is protected by the
> btrfs_zoned_meta_io_lock() already.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 08/11] btrfs: allow zoned RAID0 and 10
  2022-10-17 11:55 ` [RFC v3 08/11] btrfs: allow zoned RAID0 and 10 Johannes Thumshirn
@ 2022-10-20 15:36   ` Josef Bacik
  0 siblings, 0 replies; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:36 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:26AM -0700, Johannes Thumshirn wrote:
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/raid-stripe-tree.h | 7 ++++++-
>  fs/btrfs/zoned.c            | 4 ++--
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> index 083e754f5239..d1885b428cd4 100644
> --- a/fs/btrfs/raid-stripe-tree.h
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -41,13 +41,18 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
>  	if (!fs_info->stripe_root)
>  		return false;
>  
> -	// for now

Lol.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 09/11] btrfs: fix striping with RST
  2022-10-17 11:55 ` [RFC v3 09/11] btrfs: fix striping with RST Johannes Thumshirn
@ 2022-10-20 15:36   ` Josef Bacik
  0 siblings, 0 replies; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:36 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:27AM -0700, Johannes Thumshirn wrote:
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Why you no changelog?

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 10/11] btrfs: check for leaks of ordered stripes on umount
  2022-10-17 11:55 ` [RFC v3 10/11] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
@ 2022-10-20 15:37   ` Josef Bacik
  2022-10-21  8:17     ` Johannes Thumshirn
  0 siblings, 1 reply; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:37 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:28AM -0700, Johannes Thumshirn wrote:
> Check if we're leaking any ordered stripes when unmounting a filesystem
> with an stripe tree.
> 
> This check is gated behind CONFIG_BTRFS_DEBUG to not affect any production
> type systems.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/disk-io.c          |  2 ++
>  fs/btrfs/raid-stripe-tree.c | 29 +++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  1 +
>  3 files changed, 32 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 190caabf5fb7..e479e9829c3e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -43,6 +43,7 @@
>  #include "space-info.h"
>  #include "zoned.h"
>  #include "subpage.h"
> +#include "raid-stripe-tree.h"
>  
>  #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
>  				 BTRFS_HEADER_FLAG_RELOC |\
> @@ -1480,6 +1481,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
>  	btrfs_put_root(fs_info->stripe_root);
>  	btrfs_check_leaked_roots(fs_info);
>  	btrfs_extent_buffer_leak_debug_check(fs_info);
> +	btrfs_check_ordered_stripe_leak(fs_info);
>  	kfree(fs_info->super_copy);
>  	kfree(fs_info->super_for_commit);
>  	kfree(fs_info->subpage_info);
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 91e67600e01a..9a913c4cd44e 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -30,6 +30,35 @@ static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
>  	return ordered_stripe_cmp(&stripe->logical, rbb);
>  }
>  
> +void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info)
> +{
> +#ifdef CONFIG_BTRFS_DEBUG
> +	struct rb_node *node;
> +
> +	if (!fs_info->stripe_root ||
> +	    RB_EMPTY_ROOT(&fs_info->stripe_update_tree))
> +		return;
> +
> +	mutex_lock(&fs_info->stripe_update_lock);
> +	while ((node = rb_first_postorder(&fs_info->stripe_update_tree))
> +	       != NULL) {
> +		struct btrfs_ordered_stripe *stripe =
> +			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
> +

Can we get a WARN_ON_ONCE() in here?  That way xfstests failures will get
noticed as we'll get the dmesg failures.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 11/11] btrfs: add tracepoints for ordered stripes
  2022-10-17 11:55 ` [RFC v3 11/11] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
@ 2022-10-20 15:38   ` Josef Bacik
  0 siblings, 0 replies; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:38 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:29AM -0700, Johannes Thumshirn wrote:
> Add tracepoints to check the lifetime of btrfs_ordered_stripe entries.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 00/11] btrfs: raid-stripe-tree draft patches
  2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
                   ` (10 preceding siblings ...)
  2022-10-17 11:55 ` [RFC v3 11/11] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
@ 2022-10-20 15:42 ` Josef Bacik
  2022-10-21  8:40   ` Johannes Thumshirn
  11 siblings, 1 reply; 28+ messages in thread
From: Josef Bacik @ 2022-10-20 15:42 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Oct 17, 2022 at 04:55:18AM -0700, Johannes Thumshirn wrote:
> Here's a yet another draft of my btrfs zoned RAID patches. It's based on
> Christoph's bio splitting series for btrfs.
> 
> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
> i.e. when the logical to physical translation happens for regular btrfs RAID
> as well.
> 
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
> 

So generally I'm good with this design and everything, I just have a few asks

1. I want a design doc for btrfs-dev-docs that lays out the design and how it's
   meant to be used.  The ondisk stuff, as well as the post update after the
   delayed ref runs.
2. Additionally, I would love to see it written down where exactly you want to
   use this in the future.  I know you've talked about using it for other raid
   levels, but I have a hard time paying attention to my own stuff so I'd like
   to see what the long term vision is for this, again this would probably be
   well suited for the btrfs-dev-docs update.
3. I super don't love the fact that we have mirrored extents in both places,
   especially with zoned stripping it down to 128k, this tree is going to be
   huge.  There's no way around this, but this makes the global roots thing even
   more important for scalability with zoned+RST.  I don't really think you need
   to add that bit here now, I'll make it global in my patches for extent tree
   v2.  Mostly I'm just lamenting that you're going to be ready before me and
   now you'll have to wait for the benefits of the global roots work.

Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 01/11] btrfs: add raid stripe tree definitions
  2022-10-20 15:21   ` Josef Bacik
@ 2022-10-20 15:49     ` Johannes Thumshirn
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-20 15:49 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 20.10.22 17:21, Josef Bacik wrote:
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>> index 5f32a2a495dc..047e1d0b2ff6 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -4,9 +4,8 @@
>>  
>>  #include <linux/btrfs.h>
>>  #include <linux/types.h>
>> -#ifdef __KERNEL__
>>  #include <linux/stddef.h>
>> -#else
>> +#ifndef __KERNEL__
>>  #include <stddef.h>
>>  #endif
>>
> 
> Gotta act like I did something, this is unrelated and should probably be it's
> own thing separate from the raid stripe tree.  Thanks,

No it's needed for __DECLARE_FLEX_ARRAY() only.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 03/11] btrfs: add support for inserting raid stripe extents
  2022-10-20 15:30   ` Josef Bacik
@ 2022-10-21  8:13     ` Johannes Thumshirn
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-21  8:13 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 20.10.22 17:30, Josef Bacik wrote:
>>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 430f224743a9..1f75ab8702bb 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1102,6 +1102,9 @@ struct btrfs_fs_info {
>>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>>  	struct lockdep_map btrfs_ordered_extent_map;
>>  
>> +	struct mutex stripe_update_lock;
> 
> The mutex seems a bit heavy handed here, but I don't have a good mental model
> for the mix of adds/lookups there are.  I think at the very least a spin lock is
> more appropriate, and maybe an rwlock depending on how often we're doing a
> add/remove vs lookups.

spin_lock()s don't allow us sleeping for i.e. memory allocations, so I opted 
against them. What could work though are rw_semaphores, don't know how heavy
the are in comparisson to a mutex though.

> <snip>
>> +
>> +static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
>> +{
>> +	struct btrfs_ordered_stripe *stripe =
>> +		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
>> +	const u64 *logical = key;
>> +
>> +	if (*logical < stripe->logical)
>> +		return -1;
>> +	if (*logical >= stripe->logical + stripe->num_bytes)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
>> +{
>> +	struct btrfs_ordered_stripe *stripe =
>> +		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
>> +	return ordered_stripe_cmp(&stripe->logical, rbb);
>> +}
> 
> Ahhh yay we're finally not adding new rb tree insert/lookup functions that are
> 99% the same.

Yeah, I had it open-coded but then discovered rb_add() and friends. 
Switching to them also fixed a bug for me :)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 05/11] btrfs: lookup physical address from stripe extent
  2022-10-20 15:34   ` Josef Bacik
@ 2022-10-21  8:16     ` Johannes Thumshirn
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-21  8:16 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 20.10.22 17:34, Josef Bacik wrote:
>> +	ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
>> +	if (ret < 0)
>> +		goto out;
>> +	if (ret) {
>> +		if (path->slots[0] != 0)
>> +			path->slots[0]--;
>> +	}
>> +
>> +	end = logical + *length;
>> +
>> +	while (1) {
>> +		leaf = path->nodes[0];
>> +		slot = path->slots[0];
>> +
>> +		btrfs_item_key_to_cpu(leaf, &found_key, slot);
>> +		found_logical = found_key.objectid;
>> +		found_length = found_key.offset;
>> +
> 
> Don't we have fancy new iterators for walking through the btree?  Can that be
> used here instead of this old style walk through?

Possibly, let me give it a try.

[...]

>> +
>> +out:
>> +	if (ret > 0)
>> +		ret = -ENOENT;
>> +	if (ret) {
> 
> Maybe instead
> 
> if (ret && ret != -EIO)
> 
> I have a lot of boxes, and a given percentage of them have bad disks, which ends
> up with a lot of btrfs_print_tree()'s that I don't need.

Sure, will do.

[...]

>>  
>> -static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
>> -		          u32 stripe_index, u64 stripe_offset, u64 stripe_nr)
>> +static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> +		      u64 logical, u64 *length, struct btrfs_io_stripe *dst,
>> +		      struct map_lookup *map, u32 stripe_index,
>> +		      u64 stripe_offset, u64 stripe_nr)
>>  {
>>  	dst->dev = map->stripes[stripe_index].dev;
>> +
>> +	if (fs_info->stripe_root && op == BTRFS_MAP_READ &&
>> +	    btrfs_need_stripe_tree_update(fs_info, map->type))
> 
> We already check if (fs_info->stripe_root) in here, so this can be simplified to
> 
> if (op == BTRFS_MAP_READ && btrfs_need_stripe_tree_update())

I wanted to avoid the function call for all non stripe tree users, but I guess that's
a premature micro optimization that can go away, indeed.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 10/11] btrfs: check for leaks of ordered stripes on umount
  2022-10-20 15:37   ` Josef Bacik
@ 2022-10-21  8:17     ` Johannes Thumshirn
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-21  8:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 20.10.22 17:37, Josef Bacik wrote:
+void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info)
>> +{
>> +#ifdef CONFIG_BTRFS_DEBUG
>> +	struct rb_node *node;
>> +
>> +	if (!fs_info->stripe_root ||
>> +	    RB_EMPTY_ROOT(&fs_info->stripe_update_tree))
>> +		return;
>> +
>> +	mutex_lock(&fs_info->stripe_update_lock);
>> +	while ((node = rb_first_postorder(&fs_info->stripe_update_tree))
>> +	       != NULL) {
>> +		struct btrfs_ordered_stripe *stripe =
>> +			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
>> +
> 
> Can we get a WARN_ON_ONCE() in here?  That way xfstests failures will get
> noticed as we'll get the dmesg failures.  Thanks,

Sure.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v3 00/11] btrfs: raid-stripe-tree draft patches
  2022-10-20 15:42 ` [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Josef Bacik
@ 2022-10-21  8:40   ` Johannes Thumshirn
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2022-10-21  8:40 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 20.10.22 17:42, Josef Bacik wrote:
> On Mon, Oct 17, 2022 at 04:55:18AM -0700, Johannes Thumshirn wrote:
>> Here's a yet another draft of my btrfs zoned RAID patches. It's based on
>> Christoph's bio splitting series for btrfs.
>>
>> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
>> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
>> i.e. when the logical to physical translation happens for regular btrfs RAID
>> as well.
>>
>> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
>> it's contents are the respective physical device id and position.
>>
> 
> So generally I'm good with this design and everything, I just have a few asks
> 
> 1. I want a design doc for btrfs-dev-docs that lays out the design and how it's
>    meant to be used.  The ondisk stuff, as well as the post update after the
>    delayed ref runs.
> 2. Additionally, I would love to see it written down where exactly you want to
>    use this in the future.  I know you've talked about using it for other raid
>    levels, but I have a hard time paying attention to my own stuff so I'd like
>    to see what the long term vision is for this, again this would probably be
>    well suited for the btrfs-dev-docs update.

Sure I'll add a document to btrfs-dev-docs (and sent it to the list for review).

There's still a problem with the delayed-ref update to be solved resulting in the
leak checker yelling on unmount, so maybe documenting what I've done, shows me 
where I messed up.

> 3. I super don't love the fact that we have mirrored extents in both places,
>    especially with zoned stripping it down to 128k, this tree is going to be
>    huge.  There's no way around this, but this makes the global roots thing even
>    more important for scalability with zoned+RST.  I don't really think you need
>    to add that bit here now, I'll make it global in my patches for extent tree
>    v2.  Mostly I'm just lamenting that you're going to be ready before me and
>    now you'll have to wait for the benefits of the global roots work.

Well I'm pretty sure I won't be done before the global roots work is. But I agree
the extra amount of metadata for the RST is a concern to me as well. Especially for
overwrite kind of workloads it produces a lot of new extents for each CoW we write.
Combine that with zoned and we really need working reclaim, otherwise it goes all
down the drench.

Can you please remind me, with your global roots am I getting a root per metadata
or per data block group?


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-10-21  8:41 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-17 11:55 [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Johannes Thumshirn
2022-10-17 11:55 ` [RFC v3 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
2022-10-20 15:21   ` Josef Bacik
2022-10-20 15:49     ` Johannes Thumshirn
2022-10-17 11:55 ` [RFC v3 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2022-10-17 11:55 ` [RFC v3 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2022-10-20 15:24   ` Josef Bacik
2022-10-20 15:30   ` Josef Bacik
2022-10-21  8:13     ` Johannes Thumshirn
2022-10-17 11:55 ` [RFC v3 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
2022-10-17 11:55 ` [RFC v3 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
2022-10-20 15:34   ` Josef Bacik
2022-10-21  8:16     ` Johannes Thumshirn
2022-10-17 11:55 ` [RFC v3 06/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2022-10-20 15:34   ` Josef Bacik
2022-10-17 11:55 ` [RFC v3 07/11] btrfs: zoned: allow zoned RAID1 Johannes Thumshirn
2022-10-20 15:35   ` Josef Bacik
2022-10-17 11:55 ` [RFC v3 08/11] btrfs: allow zoned RAID0 and 10 Johannes Thumshirn
2022-10-20 15:36   ` Josef Bacik
2022-10-17 11:55 ` [RFC v3 09/11] btrfs: fix striping with RST Johannes Thumshirn
2022-10-20 15:36   ` Josef Bacik
2022-10-17 11:55 ` [RFC v3 10/11] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
2022-10-20 15:37   ` Josef Bacik
2022-10-21  8:17     ` Johannes Thumshirn
2022-10-17 11:55 ` [RFC v3 11/11] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
2022-10-20 15:38   ` Josef Bacik
2022-10-20 15:42 ` [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Josef Bacik
2022-10-21  8:40   ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.