All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] btrfs: introduce RAID stripe tree
@ 2022-12-07 14:22 Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 1/9] btrfs: add raid stripe tree definitions Johannes Thumshirn
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

Updates of the raid-stripe-tree are done at delayed-ref time to safe on
bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
i.e. when the logical to physical translation happens for regular btrfs RAID
as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows:

rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
btrfs-progs v5.16.1 
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
leaf 805847040 flags 0x1(WRITTEN) backref revision 1
checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
        item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
                        stripe 0 devid 1 offset 939524096
                        stripe 1 devid 2 offset 536870912
        item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
                        stripe 0 devid 1 offset 939651072
                        stripe 1 devid 2 offset 536997888
        item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
                        stripe 0 devid 1 offset 939778048
                        stripe 1 devid 2 offset 537124864
        item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
                        stripe 0 devid 1 offset 939905024
                        stripe 1 devid 2 offset 537251840
        item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
                        stripe 0 devid 1 offset 940032000
                        stripe 1 devid 2 offset 537378816
        item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
                        stripe 0 devid 1 offset 940158976
                        stripe 1 devid 2 offset 537505792
        item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
                        stripe 0 devid 1 offset 940285952
                        stripe 1 devid 2 offset 537632768
        item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
                        stripe 0 devid 1 offset 940412928
                        stripe 1 devid 2 offset 537759744
        item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
                        stripe 0 devid 1 offset 940539904
                        stripe 1 devid 2 offset 537886720
total bytes 26843545600
bytes used 1245184
uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb

A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true


Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches

v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Johannes Thumshirn (9):
  btrfs: add raid stripe tree definitions
  btrfs: read raid-stripe-tree from disk
  btrfs: add support for inserting raid stripe extents
  btrfs: delete stripe extent on extent deletion
  btrfs: lookup physical address from stripe extent
  btrfs: add raid stripe tree pretty printer
  btrfs: zoned: allow zoned RAID
  btrfs: check for leaks of ordered stripes on umount
  btrfs: add tracepoints for ordered stripes

 fs/btrfs/Makefile               |   3 +-
 fs/btrfs/accessors.h            |  29 +++
 fs/btrfs/bio.c                  |  30 ++-
 fs/btrfs/bio.h                  |   2 +
 fs/btrfs/block-rsv.c            |   1 +
 fs/btrfs/delayed-ref.c          |   5 +-
 fs/btrfs/disk-io.c              |  24 ++
 fs/btrfs/disk-io.h              |   5 +
 fs/btrfs/extent-tree.c          |  57 +++++
 fs/btrfs/fs.h                   |   7 +-
 fs/btrfs/inode.c                |   6 +
 fs/btrfs/print-tree.c           |  21 ++
 fs/btrfs/raid-stripe-tree.c     | 402 ++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  81 +++++++
 fs/btrfs/super.c                |   1 +
 fs/btrfs/volumes.c              |  38 ++-
 fs/btrfs/volumes.h              |  12 +-
 fs/btrfs/zoned.c                |  43 ++++
 include/trace/events/btrfs.h    |  50 ++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  20 +-
 21 files changed, 818 insertions(+), 20 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

-- 
2.38.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/9] btrfs: add raid stripe tree definitions
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 2/9] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

Add definitions for the raid stripe tree. This tree will hold information
about the on-disk layout of the stripes in a RAID set.

Each stripe extent has a 1:1 relationship with an on-disk extent item and
is doing the logical to per-drive physical address translation for the
extent item in question.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/accessors.h            | 29 +++++++++++++++++++++++++++++
 include/uapi/linux/btrfs_tree.h | 20 ++++++++++++++++++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index ceadfc5d6c66..6e753b63faae 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -306,6 +306,35 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
 
+BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+
+static inline struct btrfs_raid_stride *btrfs_raid_stride_nr(
+					 struct btrfs_stripe_extent *dps, int nr)
+{
+	unsigned long offset = (unsigned long)dps;
+
+	offset += offsetof(struct btrfs_stripe_extent, strides);
+	offset += nr * sizeof(struct btrfs_raid_stride);
+	return (struct btrfs_raid_stride *)offset;
+}
+
+static inline u64 btrfs_raid_stride_devid_nr(const struct extent_buffer *eb,
+					       struct btrfs_stripe_extent *dps,
+					       int nr)
+{
+	return btrfs_raid_stride_devid(eb, btrfs_raid_stride_nr(dps, nr));
+}
+
+static inline u64 btrfs_raid_stride_physical_nr(const struct extent_buffer *eb,
+						  struct btrfs_stripe_extent *dps,
+						  int nr)
+{
+	return btrfs_raid_stride_physical(eb, btrfs_raid_stride_nr(dps, nr));
+}
+
 /* struct btrfs_dev_extent */
 BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64);
 BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ab38d0f411fa..64e6bf2a10d8 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -4,9 +4,8 @@
 
 #include <linux/btrfs.h>
 #include <linux/types.h>
-#ifdef __KERNEL__
 #include <linux/stddef.h>
-#else
+#ifndef __KERNEL__
 #include <stddef.h>
 #endif
 
@@ -73,6 +72,9 @@
 /* Holds the block group items for extent tree v2. */
 #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
 
+/* tracks RAID stripes in block groups. */
+#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -281,6 +283,8 @@
  */
 #define BTRFS_QGROUP_RELATION_KEY       246
 
+#define BTRFS_RAID_STRIPE_KEY		247
+
 /*
  * Obsolete name, see BTRFS_TEMPORARY_ITEM_KEY.
  */
@@ -715,6 +719,18 @@ struct btrfs_free_space_header {
 	__le64 num_bitmaps;
 } __attribute__ ((__packed__));
 
+struct btrfs_raid_stride {
+	/* btrfs device-id this raid extent lives on */
+	__le64 devid;
+	/* physical location on disk */
+	__le64 physical;
+};
+
+struct btrfs_stripe_extent {
+	/* array of raid strides this stripe is composed of */
+	__DECLARE_FLEX_ARRAY(struct btrfs_raid_stride, strides);
+};
+
 #define BTRFS_HEADER_FLAG_WRITTEN	(1ULL << 0)
 #define BTRFS_HEADER_FLAG_RELOC		(1ULL << 1)
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/9] btrfs: read raid-stripe-tree from disk
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 1/9] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

If we find a raid-stripe-tree on mount, read it from disk.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-rsv.c       |  1 +
 fs/btrfs/disk-io.c         | 19 +++++++++++++++++++
 fs/btrfs/disk-io.h         |  5 +++++
 fs/btrfs/fs.h              |  4 +++-
 include/uapi/linux/btrfs.h |  1 +
 5 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 5367a14d44d2..384987343a64 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -402,6 +402,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
 	case BTRFS_EXTENT_TREE_OBJECTID:
 	case BTRFS_FREE_SPACE_TREE_OBJECTID:
 	case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+	case BTRFS_RAID_STRIPE_TREE_OBJECTID:
 		root->block_rsv = &fs_info->delayed_refs_rsv;
 		break;
 	case BTRFS_ROOT_TREE_OBJECTID:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f10b946454d7..5784c850a3ec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1444,6 +1444,9 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
 
 		return btrfs_grab_root(root) ? root : ERR_PTR(-ENOENT);
 	}
+	if (objectid == BTRFS_RAID_STRIPE_TREE_OBJECTID)
+		return btrfs_grab_root(fs_info->stripe_root) ?
+			fs_info->stripe_root : ERR_PTR(-ENOENT);
 	return NULL;
 }
 
@@ -1522,6 +1525,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->fs_root);
 	btrfs_put_root(fs_info->data_reloc_root);
 	btrfs_put_root(fs_info->block_group_root);
+	btrfs_put_root(fs_info->stripe_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
 	kfree(fs_info->super_copy);
@@ -2054,6 +2058,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
 	free_root_extent_buffers(info->fs_root);
 	free_root_extent_buffers(info->data_reloc_root);
 	free_root_extent_buffers(info->block_group_root);
+	free_root_extent_buffers(info->stripe_root);
 	if (free_chunk_root)
 		free_root_extent_buffers(info->chunk_root);
 }
@@ -2506,6 +2511,20 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 		fs_info->uuid_root = root;
 	}
 
+	if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
+		location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
+		root = btrfs_read_tree_root(tree_root, &location);
+		if (IS_ERR(root)) {
+			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
+				ret = PTR_ERR(root);
+				goto out;
+			}
+		} else {
+			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+			fs_info->stripe_root = root;
+		}
+	}
+
 	return 0;
 out:
 	btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d",
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index bc686de7eb80..e10f942c0a37 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -106,6 +106,11 @@ static inline struct btrfs_root *btrfs_grab_root(struct btrfs_root *root)
 	return NULL;
 }
 
+static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info)
+{
+	return fs_info->stripe_root;
+}
+
 void btrfs_put_root(struct btrfs_root *root);
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index d5224df7468b..f08b59320645 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -202,7 +202,8 @@ enum {
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
 	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
 	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
-	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2  |	\
+	 BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE)
 #else
 #define BTRFS_FEATURE_INCOMPAT_SUPP			\
 	(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF |		\
@@ -355,6 +356,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *data_reloc_root;
 	struct btrfs_root *block_group_root;
+	struct btrfs_root *stripe_root;
 
 	/* The log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index b4f0f9531119..593fb7930a37 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -322,6 +322,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE	(1ULL << 14)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 1/9] btrfs: add raid stripe tree definitions Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 2/9] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-12  7:22   ` Christoph Hellwig
                     ` (2 more replies)
  2022-12-07 14:22 ` [PATCH v4 4/9] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.

Inserting the stripe extents happens after the data I/O has completed,
this is done to a) support zone-append and b) rule out the possibility of
a RAID-write-hole.

This is done by creating in-memory ordered stripe extents, just like the
in memory ordered extents, on I/O completion and the on-disk raid stripe
extents get created once we're running the delayed_refs for the extent
item this stripe extent is tied to.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/Makefile           |   3 +-
 fs/btrfs/bio.c              |  30 +++++-
 fs/btrfs/bio.h              |   2 +
 fs/btrfs/delayed-ref.c      |   5 +-
 fs/btrfs/disk-io.c          |   3 +
 fs/btrfs/extent-tree.c      |  49 +++++++++
 fs/btrfs/fs.h               |   3 +
 fs/btrfs/inode.c            |   6 ++
 fs/btrfs/raid-stripe-tree.c | 195 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  69 +++++++++++++
 fs/btrfs/volumes.c          |   5 +-
 fs/btrfs/volumes.h          |  12 +--
 fs/btrfs/zoned.c            |   4 +
 13 files changed, 376 insertions(+), 10 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 555c962fdad6..63236ae2a87b 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -31,7 +31,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o
+	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
+	   raid-stripe-tree.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 4ccbc120e869..b60a50165703 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -15,6 +15,7 @@
 #include "rcu-string.h"
 #include "zoned.h"
 #include "file-item.h"
+#include "raid-stripe-tree.h"
 
 static struct bio_set btrfs_bioset;
 static struct bio_set btrfs_clone_bioset;
@@ -350,6 +351,21 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	btrfs_put_bioc(bioc);
 }
 
+static void btrfs_raid_stripe_update(struct work_struct *work)
+{
+	struct btrfs_bio *bbio =
+		container_of(work, struct btrfs_bio, raid_stripe_work);
+	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
+	struct btrfs_io_context *bioc = stripe->bioc;
+	int ret;
+
+	ret = btrfs_add_ordered_stripe(bioc);
+	if (ret)
+		bbio->bio.bi_status = errno_to_blk_status(ret);
+	btrfs_orig_bbio_end_io(bbio);
+	btrfs_put_bioc(bioc);
+}
+
 static void btrfs_orig_write_end_io(struct bio *bio)
 {
 	struct btrfs_io_stripe *stripe = bio->bi_private;
@@ -372,6 +388,15 @@ static void btrfs_orig_write_end_io(struct bio *bio)
 	else
 		bio->bi_status = BLK_STS_OK;
 
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
+	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
+		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
+		schedule_work(&bbio->raid_stripe_work);
+		return;
+	}
+
 	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
 }
@@ -383,7 +408,9 @@ static void btrfs_clone_write_end_io(struct bio *bio)
 	if (bio->bi_status) {
 		atomic_inc(&stripe->bioc->error);
 		btrfs_log_dev_io_error(bio, stripe->dev);
-	}
+	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+ 	}
 
 	/* Pass on control to the original bio this one was cloned from */
 	bio_endio(stripe->bioc->orig_bio);
@@ -442,6 +469,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 	bio->bi_private = &bioc->stripes[dev_nr];
 	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
 	bioc->stripes[dev_nr].bioc = bioc;
+	bioc->size = bio->bi_iter.bi_size;
 	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
 }
 
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 20105806c8ac..bf5fbc105148 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -58,6 +58,8 @@ struct btrfs_bio {
 	atomic_t pending_ios;
 	struct work_struct end_io_work;
 
+	struct work_struct raid_stripe_work;
+
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 573ebab886e2..0357f9327cd4 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -14,6 +14,7 @@
 #include "space-info.h"
 #include "tree-mod-log.h"
 #include "fs.h"
+#include "raid-stripe-tree.h"
 
 struct kmem_cache *btrfs_delayed_ref_head_cachep;
 struct kmem_cache *btrfs_delayed_tree_ref_cachep;
@@ -640,8 +641,10 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
 	exist->ref_mod += mod;
 
 	/* remove existing tail if its ref_mod is zero */
-	if (exist->ref_mod == 0)
+	if (exist->ref_mod == 0) {
+		btrfs_drop_ordered_stripe(trans->fs_info, exist->bytenr);
 		drop_delayed_ref(trans, root, href, exist);
+	}
 	spin_unlock(&href->lock);
 	return ret;
 inserted:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5784c850a3ec..bdef4e2e4ea3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3033,6 +3033,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 
 	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
 	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
+
+	rwlock_init(&fs_info->stripe_update_lock);
+	fs_info->stripe_update_tree = RB_ROOT;
 }
 
 static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 892d78c1853c..de479af062fd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -43,6 +43,7 @@
 #include "file-item.h"
 #include "orphan.h"
 #include "tree-checker.h"
+#include "raid-stripe-tree.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1498,6 +1499,51 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
+					    struct btrfs_delayed_ref_node *node)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct extent_map *em;
+	struct map_lookup *map;
+	int ret;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return 0;
+
+	em = btrfs_get_chunk_map(fs_info, node->bytenr, node->num_bytes);
+	if (!em) {
+		btrfs_err(fs_info,
+			  "cannot get chunk map for address %llu",
+			  node->bytenr);
+		return -EINVAL;
+	}
+
+	map = em->map_lookup;
+
+	if (btrfs_need_stripe_tree_update(fs_info, map->type)) {
+		struct btrfs_ordered_stripe *stripe;
+
+		stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
+		if (!stripe) {
+			btrfs_err(fs_info,
+				  "cannot get stripe extent for address %llu (%llu)",
+				  node->bytenr, node->num_bytes);
+			free_extent_map(em);
+			return -EINVAL;
+		}
+		ASSERT(stripe->logical == node->bytenr);
+		ASSERT(stripe->num_bytes == node->num_bytes);
+		ret = btrfs_insert_raid_extent(trans, stripe);
+		/* once for us */
+		btrfs_put_ordered_stripe(fs_info, stripe);
+		/* once for the tree */
+		btrfs_put_ordered_stripe(fs_info, stripe);
+	}
+	free_extent_map(em);
+
+	return ret;
+}
+
 static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 				struct btrfs_delayed_ref_node *node,
 				struct btrfs_delayed_extent_op *extent_op,
@@ -1528,6 +1574,9 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 						 flags, ref->objectid,
 						 ref->offset, &ins,
 						 node->ref_mod);
+		if (ret)
+			return ret;
+		ret = add_stripe_entry_for_delayed_ref(trans, node);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
 					     ref->objectid, ref->offset,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index f08b59320645..0dfe8ae2e450 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -780,6 +780,9 @@ struct btrfs_fs_info {
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
 
+	rwlock_t stripe_update_lock;
+	struct rb_root stripe_update_tree;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 373b7281f5c7..1299acf52c86 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -70,6 +70,7 @@
 #include "verity.h"
 #include "super.h"
 #include "orphan.h"
+#include "raid-stripe-tree.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -9507,6 +9508,11 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
 	if (qgroup_released < 0)
 		return ERR_PTR(qgroup_released);
 
+	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
+						    start, len);
+	if (ret)
+		goto free_qgroup;
+
 	if (trans) {
 		ret = insert_reserved_file_extent(trans, inode,
 						  file_offset, &stack_fi,
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
new file mode 100644
index 000000000000..aa60f784fb1f
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Western Digital Corporation or its affiliates.
+ */
+
+#include <linux/btrfs_tree.h>
+
+#include "ctree.h"
+#include "fs.h"
+#include "accessors.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "raid-stripe-tree.h"
+#include "volumes.h"
+#include "misc.h"
+#include "disk-io.h"
+#include "print-tree.h"
+
+static int ordered_stripe_cmp(const void *key, const struct rb_node *node)
+{
+	struct btrfs_ordered_stripe *stripe =
+		rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+	const u64 *logical = key;
+
+	if (*logical < stripe->logical)
+		return -1;
+	if (*logical >= stripe->logical + stripe->num_bytes)
+		return 1;
+	return 0;
+}
+
+static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
+{
+	struct btrfs_ordered_stripe *stripe =
+		rb_entry(rba, struct btrfs_ordered_stripe, rb_node);
+	return ordered_stripe_cmp(&stripe->logical, rbb);
+}
+
+int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
+{
+	struct btrfs_fs_info *fs_info = bioc->fs_info;
+	struct btrfs_ordered_stripe *stripe;
+	struct btrfs_io_stripe *tmp;
+	u64 logical = bioc->logical;
+	u64 length = bioc->size;
+	struct rb_node *node;
+	size_t size;
+
+	size = bioc->num_stripes * sizeof(struct btrfs_io_stripe);
+	stripe = kzalloc(sizeof(struct btrfs_ordered_stripe), GFP_NOFS);
+	if (!stripe)
+		return -ENOMEM;
+
+	spin_lock_init(&stripe->lock);
+	tmp = kmemdup(bioc->stripes, size, GFP_NOFS);
+	if (!tmp) {
+		kfree(stripe);
+		return -ENOMEM;
+	}
+
+	stripe->logical = logical;
+	stripe->num_bytes = length;
+	stripe->num_stripes = bioc->num_stripes;
+	spin_lock(&stripe->lock);
+	stripe->stripes = tmp;
+	spin_unlock(&stripe->lock);
+	refcount_set(&stripe->ref, 1);
+
+	write_lock(&fs_info->stripe_update_lock);
+	node = rb_find_add(&stripe->rb_node, &fs_info->stripe_update_tree,
+	       ordered_stripe_less);
+	write_unlock(&fs_info->stripe_update_lock);
+	if (node) {
+		btrfs_err(fs_info, "logical: %llu, length: %llu already exists",
+			  logical, length);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *fs_info,
+							 u64 logical)
+{
+	struct rb_root *root = &fs_info->stripe_update_tree;
+	struct btrfs_ordered_stripe *stripe = NULL;
+	struct rb_node *node;
+
+	read_lock(&fs_info->stripe_update_lock);
+	node = rb_find(&logical, root, ordered_stripe_cmp);
+	if (node) {
+		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+		refcount_inc(&stripe->ref);
+	}
+	read_unlock(&fs_info->stripe_update_lock);
+
+	return stripe;
+}
+
+void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
+				 struct btrfs_ordered_stripe *stripe)
+{
+	write_lock(&fs_info->stripe_update_lock);
+	if (refcount_dec_and_test(&stripe->ref)) {
+		struct rb_node *node = &stripe->rb_node;
+
+		rb_erase(node, &fs_info->stripe_update_tree);
+		RB_CLEAR_NODE(node);
+
+		spin_lock(&stripe->lock);
+		kfree(stripe->stripes);
+		spin_unlock(&stripe->lock);
+		kfree(stripe);
+	}
+	write_unlock(&fs_info->stripe_update_lock);
+}
+
+int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
+					  u64 start, u64 len)
+{
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_ordered_stripe *stripe;
+	u64 map_length = len;
+	int ret;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return 0;
+
+	ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, start, &map_length,
+			      &bioc, 0);
+	if (ret)
+		return ret;
+
+	bioc->size = len;
+
+	stripe = btrfs_lookup_ordered_stripe(fs_info, start);
+	if (!stripe) {
+		ret = btrfs_add_ordered_stripe(bioc);
+		if (ret)
+			return ret;
+	} else {
+		spin_lock(&stripe->lock);
+		memcpy(stripe->stripes, bioc->stripes,
+		       bioc->num_stripes * sizeof(struct btrfs_io_stripe));
+		spin_unlock(&stripe->lock);
+		btrfs_put_ordered_stripe(fs_info, stripe);
+	}
+
+	return 0;
+}
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+			     struct btrfs_ordered_stripe *stripe)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key stripe_key;
+	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+	struct btrfs_stripe_extent *stripe_extent;
+	size_t item_size;
+	int ret;
+
+	item_size = stripe->num_stripes * sizeof(struct btrfs_raid_stride);
+
+	stripe_extent = kzalloc(item_size, GFP_NOFS);
+	if (!stripe_extent) {
+		btrfs_abort_transaction(trans, -ENOMEM);
+		btrfs_end_transaction(trans);
+		return -ENOMEM;
+	}
+
+	spin_lock(&stripe->lock);
+	for (int i = 0; i < stripe->num_stripes; i++) {
+		u64 devid = stripe->stripes[i].dev->devid;
+		u64 physical = stripe->stripes[i].physical;
+		struct btrfs_raid_stride *raid_stride =
+						&stripe_extent->strides[i];
+
+		btrfs_set_stack_raid_stride_devid(raid_stride, devid);
+		btrfs_set_stack_raid_stride_physical(raid_stride, physical);
+	}
+	spin_unlock(&stripe->lock);
+
+	stripe_key.objectid = stripe->logical;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = stripe->num_bytes;
+
+	ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
+				item_size);
+	if (ret)
+		btrfs_abort_transaction(trans, ret);
+
+	kfree(stripe_extent);
+
+	return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
new file mode 100644
index 000000000000..807d9123270c
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Western Digital Corporation or its affiliates.
+ */
+
+#ifndef BTRFS_RAID_STRIPE_TREE_H
+#define BTRFS_RAID_STRIPE_TREE_H
+
+#include "disk-io.h"
+
+struct btrfs_io_context;
+
+struct btrfs_ordered_stripe {
+	struct rb_node rb_node;
+
+	u64 logical;
+	u64 num_bytes;
+	int num_stripes;
+	struct btrfs_io_stripe *stripes;
+	spinlock_t lock;
+	refcount_t ref;
+};
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+			     struct btrfs_ordered_stripe *stripe);
+int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
+					  u64 start, u64 len);
+struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
+						 struct btrfs_fs_info *fs_info,
+						 u64 logical);
+int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
+void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
+					    struct btrfs_ordered_stripe *stripe);
+
+static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
+						 u64 map_type)
+{
+	u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
+	u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return false;
+
+	if (type != BTRFS_BLOCK_GROUP_DATA)
+		return false;
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
+		return true;
+
+	return false;
+}
+
+static inline void btrfs_drop_ordered_stripe(struct btrfs_fs_info *fs_info,
+					     u64 logical)
+{
+	struct btrfs_ordered_stripe *stripe;
+
+	if (!btrfs_stripe_tree_root(fs_info))
+		return;
+
+	stripe = btrfs_lookup_ordered_stripe(fs_info, logical);
+	if (!stripe)
+		return;
+	/* once for us */
+	btrfs_put_ordered_stripe(fs_info, stripe);
+	/* once for the tree */
+	btrfs_put_ordered_stripe(fs_info, stripe);
+}
+#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f936e80a9b15..78b721251a09 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5872,6 +5872,7 @@ static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
 }
 
 static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
+						       u64 logical,
 						       int total_stripes,
 						       int real_stripes)
 {
@@ -5895,6 +5896,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 	refcount_set(&bioc->refs, 1);
 
 	bioc->fs_info = fs_info;
+	bioc->logical = logical;
 	bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes);
 	bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
 
@@ -6500,7 +6502,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		goto out;
 	}
 
-	bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes, tgtdev_indexes);
+	bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes,
+				      tgtdev_indexes);
 	if (!bioc) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7e51f2238f72..5d7547b5fa87 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -368,12 +368,10 @@ struct btrfs_fs_devices {
 
 struct btrfs_io_stripe {
 	struct btrfs_device *dev;
-	union {
-		/* Block mapping */
-		u64 physical;
-		/* For the endio handler */
-		struct btrfs_io_context *bioc;
-	};
+	/* Block mapping */
+	u64 physical;
+	/* For the endio handler */
+	struct btrfs_io_context *bioc;
 };
 
 struct btrfs_discard_stripe {
@@ -409,6 +407,8 @@ struct btrfs_io_context {
 	int mirror_num;
 	int num_tgtdevs;
 	int *tgtdev_map;
+	u64 logical;
+	u64 size;
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0b769dc8bcda..e5a083a9fd0f 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1682,6 +1682,10 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 	u64 *logical = NULL;
 	int nr, stripe_len;
 
+	/* Filesystems with a stripe tree have their own l2p mapping */
+	if (btrfs_stripe_tree_root(fs_info))
+		return;
+
 	/* Zoned devices should not have partitions. So, we can assume it is 0 */
 	ASSERT(!bdev_is_partition(ordered->bdev));
 	if (WARN_ON(!ordered->bdev))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 4/9] btrfs: delete stripe extent on extent deletion
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2022-12-07 14:22 ` [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 5/9] btrfs: lookup physical address from stripe extent Johannes Thumshirn
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

As each stripe extent is tied to an extent item, delete the stripe extent
once the corresponding extent item is deleted.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent-tree.c      |  8 ++++++++
 fs/btrfs/raid-stripe-tree.c | 31 +++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  2 ++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de479af062fd..3840ff52dcbc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3225,6 +3225,14 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 
+		if (is_data) {
+			ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
+			if (ret) {
+				btrfs_abort_transaction(trans, ret);
+				return ret;
+			}
+		}
+
 		ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
 				      num_to_del);
 		if (ret) {
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index aa60f784fb1f..4f516f71c4cf 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -115,6 +115,37 @@ void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 	write_unlock(&fs_info->stripe_update_lock);
 }
 
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+			     u64 length)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+	struct btrfs_path *path;
+	struct btrfs_key stripe_key;
+	int ret;
+
+	if (!stripe_root)
+		return 0;
+
+	stripe_key.objectid = start;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(trans, stripe_root, &stripe_key, path, -1, 1);
+	if (ret < 0)
+		goto out;
+
+	ret = btrfs_del_item(trans, stripe_root, path);
+out:
+	btrfs_free_path(path);
+	return ret;
+
+}
+
 int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
 					  u64 start, u64 len)
 {
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 807d9123270c..2b1e3fd9029e 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -21,6 +21,8 @@ struct btrfs_ordered_stripe {
 	refcount_t ref;
 };
 
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+			     u64 length);
 int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_ordered_stripe *stripe);
 int btrfs_insert_preallocated_raid_stripe(struct btrfs_fs_info *fs_info,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 5/9] btrfs: lookup physical address from stripe extent
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2022-12-07 14:22 ` [PATCH v4 4/9] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 6/9] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

Lookup the physical address from the raid stripe tree when a read on an
RAID volume formatted with the raid stripe tree was attempted.

If the requested logical address was not found in the stripe tree, it may
still be in the in-memory ordered stripe tree, so fallback to searching
the ordered stripe tree in this case.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c | 143 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |   3 +
 fs/btrfs/volumes.c          |  31 ++++++--
 3 files changed, 170 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 4f516f71c4cf..c57dfe9f5c86 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -224,3 +224,146 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
 
 	return ret;
 }
+
+static bool btrfs_physical_from_ordered_stripe(struct btrfs_fs_info *fs_info,
+					      u64 logical, u64 *length,
+					      int num_stripes,
+					      struct btrfs_io_stripe *stripe)
+{
+	struct btrfs_ordered_stripe *os;
+	u64 offset;
+	u64 found_end;
+	u64 end;
+	int i;
+
+	os = btrfs_lookup_ordered_stripe(fs_info, logical);
+	if (!os)
+		return false;
+
+	end = logical + *length;
+	found_end = os->logical + os->num_bytes;
+	if (end > found_end)
+		*length -= end - found_end;
+
+	for (i = 0; i < num_stripes; i++) {
+		if (os->stripes[i].dev != stripe->dev)
+			continue;
+
+		offset = logical - os->logical;
+		ASSERT(offset >= 0);
+		stripe->physical = os->stripes[i].physical + offset;
+		btrfs_put_ordered_stripe(fs_info, os);
+		break;
+	}
+
+	return true;
+}
+
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+				 u64 logical, u64 *length, u64 map_type,
+				 struct btrfs_io_stripe *stripe)
+{
+	struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+	int num_stripes = btrfs_bg_type_to_factor(map_type);
+	struct btrfs_stripe_extent *stripe_extent;
+	struct btrfs_key stripe_key;
+	struct btrfs_key found_key;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	u64 offset;
+	u64 found_logical;
+	u64 found_length;
+	u64 end;
+	u64 found_end;
+	int slot;
+	int ret;
+	int i;
+
+	/*
+	 * If we still have the stripe in the ordered stripe tree get it from
+	 * there
+	 */
+	if (btrfs_physical_from_ordered_stripe(fs_info, logical, length,
+					       num_stripes, stripe))
+		return 0;
+
+	stripe_key.objectid = logical;
+	stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+	stripe_key.offset = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	if (ret) {
+		if (path->slots[0] != 0)
+			path->slots[0]--;
+	}
+
+	end = logical + *length;
+
+	while (1) {
+		leaf = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(leaf, &found_key, slot);
+		found_logical = found_key.objectid;
+		found_length = found_key.offset;
+
+		if (found_logical > end)
+			break;
+
+		if (!in_range(logical, found_logical, found_length))
+			goto next;
+
+		offset = logical - found_logical;
+		found_end = found_logical + found_length;
+
+		/*
+		 * If we have a logically contiguous, but physically
+		 * noncontinuous range, we need to split the bio. Record the
+		 * length after which we must split the bio.
+		 */
+		if (end > found_end)
+			*length -= end - found_end;
+
+		stripe_extent =
+			btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+		for (i = 0; i < num_stripes; i++) {
+			if (btrfs_raid_stride_devid_nr(leaf,
+				       stripe_extent, i) != stripe->dev->devid)
+				continue;
+			stripe->physical = btrfs_raid_stride_physical_nr(leaf,
+						   stripe_extent, i) + offset;
+			ret = 0;
+			goto out;
+		}
+
+		/*
+		 * If we're here, we haven't found the requested devid in the
+		 * stripe.
+		 */
+		ret = -ENOENT;
+		goto out;
+next:
+		ret = btrfs_next_item(stripe_root, path);
+		if (ret)
+			break;
+	}
+
+out:
+	if (ret > 0)
+		ret = -ENOENT;
+	if (ret && ret != -EIO) {
+		btrfs_err(fs_info,
+			  "cannot find raid-stripe for logical [%llu, %llu]",
+			  logical, logical + *length);
+		btrfs_print_tree(leaf, 1);
+	}
+	btrfs_free_path(path);
+
+	return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 2b1e3fd9029e..d227299e8865 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -21,6 +21,9 @@ struct btrfs_ordered_stripe {
 	refcount_t ref;
 };
 
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+				 u64 logical, u64 *length, u64 map_type,
+				 struct btrfs_io_stripe *stripe);
 int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
 			     u64 length);
 int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 78b721251a09..be4f5075214c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -35,6 +35,7 @@
 #include "relocation.h"
 #include "scrub.h"
 #include "super.h"
+#include "raid-stripe-tree.h"
 
 #define BTRFS_BLOCK_GROUP_STRIPE_MASK	(BTRFS_BLOCK_GROUP_RAID0 | \
 					 BTRFS_BLOCK_GROUP_RAID10 | \
@@ -6298,12 +6299,21 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
 	return U64_MAX;
 }
 
-static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
-		          u32 stripe_index, u64 stripe_offset, u64 stripe_nr)
+static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
+		      u64 logical, u64 *length, struct btrfs_io_stripe *dst,
+		      struct map_lookup *map, u32 stripe_index,
+		      u64 stripe_offset, u64 stripe_nr)
 {
 	dst->dev = map->stripes[stripe_index].dev;
+
+	if (op == BTRFS_MAP_READ &&
+	    btrfs_need_stripe_tree_update(fs_info, map->type))
+		return btrfs_get_raid_extent_offset(fs_info, logical, length,
+						    map->type, dst);
+
 	dst->physical = map->stripes[stripe_index].physical +
 			stripe_offset + stripe_nr * map->stripe_len;
+	return 0;
 }
 
 int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
@@ -6492,13 +6502,14 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 			smap->dev = dev_replace->tgtdev;
 			smap->physical = physical_to_patch_in_first_stripe;
 			*mirror_num_ret = map->num_stripes + 1;
+			ret = 0;
 		} else {
-			set_io_stripe(smap, map, stripe_index, stripe_offset,
-				      stripe_nr);
 			*mirror_num_ret = mirror_num;
+			ret = set_io_stripe(fs_info, op, logical, length, smap,
+					    map, stripe_index, stripe_offset,
+					    stripe_nr);
 		}
 		*bioc_ret = NULL;
-		ret = 0;
 		goto out;
 	}
 
@@ -6510,8 +6521,14 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	}
 
 	for (i = 0; i < num_stripes; i++) {
-		set_io_stripe(&bioc->stripes[i], map, stripe_index, stripe_offset,
-			      stripe_nr);
+		ret = set_io_stripe(fs_info, op, logical, length,
+				 &bioc->stripes[i], map, stripe_index,
+				 stripe_offset, stripe_nr);
+		if (ret) {
+			btrfs_put_bioc(bioc);
+			goto out;
+		}
+
 		stripe_index++;
 	}
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 6/9] btrfs: add raid stripe tree pretty printer
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2022-12-07 14:22 ` [PATCH v4 5/9] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 7/9] btrfs: zoned: allow zoned RAID Johannes Thumshirn
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

Decode raid-stripe-tree entries on btrfs_print_tree().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/print-tree.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index b93c96213304..b70e0bfe8c84 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -9,6 +9,7 @@
 #include "print-tree.h"
 #include "accessors.h"
 #include "tree-checker.h"
+#include "raid-stripe-tree.h"
 
 struct root_name_map {
 	u64 id;
@@ -28,6 +29,7 @@ static const struct root_name_map root_map[] = {
 	{ BTRFS_FREE_SPACE_TREE_OBJECTID,	"FREE_SPACE_TREE"	},
 	{ BTRFS_BLOCK_GROUP_TREE_OBJECTID,	"BLOCK_GROUP_TREE"	},
 	{ BTRFS_DATA_RELOC_TREE_OBJECTID,	"DATA_RELOC_TREE"	},
+	{ BTRFS_RAID_STRIPE_TREE_OBJECTID,	"RAID_STRIPE_TREE"	},
 };
 
 const char *btrfs_root_name(const struct btrfs_key *key, char *buf)
@@ -187,6 +189,20 @@ static void print_uuid_item(struct extent_buffer *l, unsigned long offset,
 	}
 }
 
+static void print_raid_stripe_key(struct extent_buffer *eb, u32 item_size,
+				  struct btrfs_stripe_extent *stripe)
+{
+	int num_stripes;
+	int i;
+
+	num_stripes = item_size / sizeof(struct btrfs_raid_stride);
+
+	for (i = 0; i < num_stripes; i++)
+		pr_info("\t\t\tstrde %d devid %llu physical %llu\n", i,
+			btrfs_raid_stride_devid_nr(eb, stripe, i),
+			btrfs_raid_stride_physical_nr(eb, stripe, i));
+}
+
 /*
  * Helper to output refs and locking status of extent buffer.  Useful to debug
  * race condition related problems.
@@ -351,6 +367,11 @@ void btrfs_print_leaf(struct extent_buffer *l)
 			print_uuid_item(l, btrfs_item_ptr_offset(l, i),
 					btrfs_item_size(l, i));
 			break;
+		case BTRFS_RAID_STRIPE_KEY:
+			print_raid_stripe_key(l, btrfs_item_size(l, i),
+					      btrfs_item_ptr(l, i,
+							     struct btrfs_stripe_extent));
+			break;
 		}
 	}
 }
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 7/9] btrfs: zoned: allow zoned RAID
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2022-12-07 14:22 ` [PATCH v4 6/9] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 8/9] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 9/9] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
data block-groups. For meta-data block-groups, we don't actually need
anything special, as all meta-data I/O is protected by the
btrfs_zoned_meta_io_lock() already.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.h |  6 ++++++
 fs/btrfs/volumes.c          |  2 ++
 fs/btrfs/zoned.c            | 39 +++++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index d227299e8865..73167c775f66 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -52,6 +52,12 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
 	if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
 		return true;
 
+	if (profile & BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID10)
+		return true;
+
 	return false;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index be4f5075214c..385dcf8b2cc4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6495,6 +6495,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	 * I/O context structure.
 	 */
 	if (smap && num_alloc_stripes == 1 &&
+	    !(btrfs_need_stripe_tree_update(fs_info, map->type) &&
+	      op != BTRFS_MAP_READ) &&
 	    !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1) &&
 	    (!need_full_stripe(op) || !dev_replace_is_ongoing ||
 	     !dev_replace->tgtdev)) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index e5a083a9fd0f..d05b1180580d 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1510,8 +1510,47 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		cache->zone_capacity = min(caps[0], caps[1]);
 		break;
 	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 	case BTRFS_BLOCK_GROUP_RAID0:
 	case BTRFS_BLOCK_GROUP_RAID10:
+		if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+		    !btrfs_stripe_tree_root(fs_info)) {
+			btrfs_err(fs_info,
+				  "zoned: data RAID1 needs stripe_root");
+			ret = -EIO;
+			goto out;
+
+		}
+
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV ||
+			    alloc_offsets[i] == WP_CONVENTIONAL)
+				continue;
+
+			if (i == 0)
+				continue;
+
+			if (alloc_offsets[0] != alloc_offsets[i]) {
+				btrfs_err(fs_info,
+					  "zoned: write pointer offset mismatch of zones in RAID profile");
+				ret = -EIO;
+				goto out;
+			}
+			if (test_bit(0, active) != test_bit(i, active)) {
+				if (!btrfs_zone_activate(cache)) {
+					ret = -EIO;
+					goto out;
+				}
+			} else {
+				if (test_bit(0, active))
+					set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+						&cache->runtime_flags);
+			}
+			cache->zone_capacity = min(caps[0], caps[i]);
+		}
+		cache->alloc_offset = alloc_offsets[0];
+		break;
 	case BTRFS_BLOCK_GROUP_RAID5:
 	case BTRFS_BLOCK_GROUP_RAID6:
 		/* non-single profiles are not supported yet */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 8/9] btrfs: check for leaks of ordered stripes on umount
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2022-12-07 14:22 ` [PATCH v4 7/9] btrfs: zoned: allow zoned RAID Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  2022-12-07 14:22 ` [PATCH v4 9/9] btrfs: add tracepoints for ordered stripes Johannes Thumshirn
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

Check if we're leaking any ordered stripes when unmounting a filesystem
with an stripe tree.

This check is gated behind CONFIG_BTRFS_DEBUG to not affect any production
type systems.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c          |  2 ++
 fs/btrfs/raid-stripe-tree.c | 30 ++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h |  1 +
 3 files changed, 33 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bdef4e2e4ea3..85c24d8b3bc8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -52,6 +52,7 @@
 #include "relocation.h"
 #include "scrub.h"
 #include "super.h"
+#include "raid-stripe-tree.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -1528,6 +1529,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->stripe_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
+	btrfs_check_ordered_stripe_leak(fs_info);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	kfree(fs_info->subpage_info);
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index c57dfe9f5c86..ed13b30001e0 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -36,6 +36,36 @@ static int ordered_stripe_less(struct rb_node *rba, const struct rb_node *rbb)
 	return ordered_stripe_cmp(&stripe->logical, rbb);
 }
 
+void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info)
+{
+#ifdef CONFIG_BTRFS_DEBUG
+	struct rb_node *node;
+
+	if (!btrfs_stripe_tree_root(fs_info) ||
+	    RB_EMPTY_ROOT(&fs_info->stripe_update_tree))
+		return;
+
+	WARN_ON_ONCE(1);
+	write_lock(&fs_info->stripe_update_lock);
+	while ((node = rb_first_postorder(&fs_info->stripe_update_tree))
+	       != NULL) {
+		struct btrfs_ordered_stripe *stripe =
+			rb_entry(node, struct btrfs_ordered_stripe, rb_node);
+
+		write_unlock(&fs_info->stripe_update_lock);
+		btrfs_err(fs_info,
+			  "ordered_stripe [%llu, %llu] leaked, refcount=%d",
+			  stripe->logical, stripe->logical + stripe->num_bytes,
+			  refcount_read(&stripe->ref));
+		while (refcount_read(&stripe->ref) > 1)
+			btrfs_put_ordered_stripe(fs_info, stripe);
+		btrfs_put_ordered_stripe(fs_info, stripe);
+		write_lock(&fs_info->stripe_update_lock);
+	}
+	write_unlock(&fs_info->stripe_update_lock);
+#endif
+}
+
 int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
 {
 	struct btrfs_fs_info *fs_info = bioc->fs_info;
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 73167c775f66..cf07f9c8bb9f 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -36,6 +36,7 @@ struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(
 int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc);
 void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 					    struct btrfs_ordered_stripe *stripe);
+void btrfs_check_ordered_stripe_leak(struct btrfs_fs_info *fs_info);
 
 static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
 						 u64 map_type)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 9/9] btrfs: add tracepoints for ordered stripes
  2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2022-12-07 14:22 ` [PATCH v4 8/9] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
@ 2022-12-07 14:22 ` Johannes Thumshirn
  8 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-07 14:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Josef Bacik, Naohiro Aota

Add tracepoints to check the lifetime of btrfs_ordered_stripe entries.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c  |  3 +++
 fs/btrfs/super.c             |  1 +
 include/trace/events/btrfs.h | 50 ++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index ed13b30001e0..b5ef3daec295 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -106,6 +106,7 @@ int btrfs_add_ordered_stripe(struct btrfs_io_context *bioc)
 		return -EINVAL;
 	}
 
+	trace_btrfs_ordered_stripe_add(fs_info, stripe);
 	return 0;
 }
 
@@ -121,6 +122,7 @@ struct btrfs_ordered_stripe *btrfs_lookup_ordered_stripe(struct btrfs_fs_info *f
 	if (node) {
 		stripe = rb_entry(node, struct btrfs_ordered_stripe, rb_node);
 		refcount_inc(&stripe->ref);
+		trace_btrfs_ordered_stripe_lookup(fs_info, stripe);
 	}
 	read_unlock(&fs_info->stripe_update_lock);
 
@@ -131,6 +133,7 @@ void btrfs_put_ordered_stripe(struct btrfs_fs_info *fs_info,
 				 struct btrfs_ordered_stripe *stripe)
 {
 	write_lock(&fs_info->stripe_update_lock);
+	trace_btrfs_ordered_stripe_put(fs_info, stripe);
 	if (refcount_dec_and_test(&stripe->ref)) {
 		struct rb_node *node = &stripe->rb_node;
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 93f52ee85f6f..532852105668 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -58,6 +58,7 @@
 #include "scrub.h"
 #include "verity.h"
 #include "super.h"
+#include "raid-stripe-tree.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0bce0b4ff2fa..93b0fbc69413 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -32,6 +32,7 @@ struct prelim_ref;
 struct btrfs_space_info;
 struct btrfs_raid_bio;
 struct raid56_bio_trace_info;
+struct btrfs_ordered_stripe;
 
 #define show_ref_type(type)						\
 	__print_symbolic(type,						\
@@ -2411,6 +2412,55 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_scrub_read_recover,
 	TP_ARGS(rbio, bio, trace_info)
 );
 
+DECLARE_EVENT_CLASS(btrfs__ordered_stripe,
+
+	TP_PROTO(const struct btrfs_fs_info *fs_info,
+		 const struct btrfs_ordered_stripe *stripe),
+
+	TP_ARGS(fs_info, stripe),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	logical		)
+		__field(	u64,	num_bytes	)
+		__field(	int,	num_stripes	)
+		__field(	int,	ref		)
+	),
+
+	TP_fast_assign_btrfs(fs_info,
+		__entry->logical	= stripe->logical;
+		__entry->num_bytes	= stripe->num_bytes;
+		__entry->num_stripes	= stripe->num_stripes;
+		__entry->ref		= refcount_read(&stripe->ref);
+	),
+
+	TP_printk_btrfs("logical=%llu, num_bytes=%llu, num_stripes=%d, ref=%d",
+			__entry->logical, __entry->num_bytes,
+			__entry->num_stripes, __entry->ref)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_add,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_lookup,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
+
+DEFINE_EVENT(btrfs__ordered_stripe, btrfs_ordered_stripe_put,
+
+	     TP_PROTO(const struct btrfs_fs_info *fs_info,
+		      const struct btrfs_ordered_stripe *stripe),
+
+	     TP_ARGS(fs_info, stripe)
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-07 14:22 ` [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2022-12-12  7:22   ` Christoph Hellwig
  2022-12-13  8:15     ` Johannes Thumshirn
  2022-12-12 19:27   ` Josef Bacik
  2022-12-13 16:14   ` Josef Bacik
  2 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2022-12-12  7:22 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs, Josef Bacik, Naohiro Aota

On Wed, Dec 07, 2022 at 06:22:12AM -0800, Johannes Thumshirn wrote:
> @@ -372,6 +388,15 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>  	else
>  		bio->bi_status = BLK_STS_OK;
>  
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
> +		schedule_work(&bbio->raid_stripe_work);

This needs to be on a specific workqueue (or maybe multiple if/when
metadata and freespace inodes are supported).  Note that end_io_work
isn't currently used for writes, so you can reuse it here.

Also note that I do hav a patchset that defers all writes to a workqueue
here instead of doing separate deferals just for ordered extents or
direct I/O, so maybe we can eventually skip the separate workqueue for
the stripe tree entirely.

>  	bio->bi_private = &bioc->stripes[dev_nr];
>  	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
>  	bioc->stripes[dev_nr].bioc = bioc;
> +	bioc->size = bio->bi_iter.bi_size;

So we could just use the saved_iter for the main bbio for the size and
logical if we also set that for writes.

Also right now the ordered extent is split for each actually submitted
bio for zone append writes, so you could just use that as well.  That
beeing said, I think we could actually stop doing that split with the
stripe tree, as there is no updated of the L2P mapping in the chunk tree
any more.

Can someone more familiar with the btrfs internals chime in here if
there might be a way to in fact stop updating the chunk tree entirely
as that should help reducing the write amp a bit?

> +static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
> +					    struct btrfs_delayed_ref_node *node)
> +{
> +	em = btrfs_get_chunk_map(fs_info, node->bytenr, node->num_bytes);
> +	if (!em) {
> +		btrfs_err(fs_info,
> +			  "cannot get chunk map for address %llu",
> +			  node->bytenr);
> +		return -EINVAL;
> +	}
> +
> +	map = em->map_lookup;
> +
> +	if (btrfs_need_stripe_tree_update(fs_info, map->type)) {

It seems like the chunk_map lookup is only needed to figure out if
the chunk needs a stripe tree update, which seems rather inefficient.
Can't we find some way to stash away that bit from the submission path
instead of rediscovering it here?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-07 14:22 ` [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
  2022-12-12  7:22   ` Christoph Hellwig
@ 2022-12-12 19:27   ` Josef Bacik
  2022-12-13  8:17     ` Johannes Thumshirn
  2022-12-13 16:14   ` Josef Bacik
  2 siblings, 1 reply; 20+ messages in thread
From: Josef Bacik @ 2022-12-12 19:27 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs, Naohiro Aota

On Wed, Dec 07, 2022 at 06:22:12AM -0800, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
> 
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
> 
> This is done by creating in-memory ordered stripe extents, just like the
> in memory ordered extents, on I/O completion and the on-disk raid stripe
> extents get created once we're running the delayed_refs for the extent
> item this stripe extent is tied to.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/Makefile           |   3 +-
>  fs/btrfs/bio.c              |  30 +++++-
>  fs/btrfs/bio.h              |   2 +
>  fs/btrfs/delayed-ref.c      |   5 +-
>  fs/btrfs/disk-io.c          |   3 +
>  fs/btrfs/extent-tree.c      |  49 +++++++++
>  fs/btrfs/fs.h               |   3 +
>  fs/btrfs/inode.c            |   6 ++
>  fs/btrfs/raid-stripe-tree.c | 195 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  69 +++++++++++++
>  fs/btrfs/volumes.c          |   5 +-
>  fs/btrfs/volumes.h          |  12 +--
>  fs/btrfs/zoned.c            |   4 +
>  13 files changed, 376 insertions(+), 10 deletions(-)
>  create mode 100644 fs/btrfs/raid-stripe-tree.c
>  create mode 100644 fs/btrfs/raid-stripe-tree.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 555c962fdad6..63236ae2a87b 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -31,7 +31,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>  	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> -	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o
> +	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> +	   raid-stripe-tree.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 4ccbc120e869..b60a50165703 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -15,6 +15,7 @@
>  #include "rcu-string.h"
>  #include "zoned.h"
>  #include "file-item.h"
> +#include "raid-stripe-tree.h"
>  
>  static struct bio_set btrfs_bioset;
>  static struct bio_set btrfs_clone_bioset;
> @@ -350,6 +351,21 @@ static void btrfs_raid56_end_io(struct bio *bio)
>  	btrfs_put_bioc(bioc);
>  }
>  
> +static void btrfs_raid_stripe_update(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio =
> +		container_of(work, struct btrfs_bio, raid_stripe_work);
> +	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
> +	struct btrfs_io_context *bioc = stripe->bioc;
> +	int ret;
> +
> +	ret = btrfs_add_ordered_stripe(bioc);
> +	if (ret)
> +		bbio->bio.bi_status = errno_to_blk_status(ret);
> +	btrfs_orig_bbio_end_io(bbio);
> +	btrfs_put_bioc(bioc);
> +}
> +
>  static void btrfs_orig_write_end_io(struct bio *bio)
>  {
>  	struct btrfs_io_stripe *stripe = bio->bi_private;
> @@ -372,6 +388,15 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>  	else
>  		bio->bi_status = BLK_STS_OK;
>  
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
> +		schedule_work(&bbio->raid_stripe_work);
> +		return;
> +	}
> +
>  	btrfs_orig_bbio_end_io(bbio);
>  	btrfs_put_bioc(bioc);
>  }
> @@ -383,7 +408,9 @@ static void btrfs_clone_write_end_io(struct bio *bio)
>  	if (bio->bi_status) {
>  		atomic_inc(&stripe->bioc->error);
>  		btrfs_log_dev_io_error(bio, stripe->dev);
> -	}
> +	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> + 	}

Whitespace error here.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-12  7:22   ` Christoph Hellwig
@ 2022-12-13  8:15     ` Johannes Thumshirn
  2022-12-13  8:36       ` hch
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-13  8:15 UTC (permalink / raw)
  To: hch; +Cc: linux-btrfs, Josef Bacik, Naohiro Aota

On 12.12.22 08:22, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 06:22:12AM -0800, Johannes Thumshirn wrote:
>> @@ -372,6 +388,15 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>>  	else
>>  		bio->bi_status = BLK_STS_OK;
>>  
>> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
>> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>> +
>> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
>> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
>> +		schedule_work(&bbio->raid_stripe_work);
> 
> This needs to be on a specific workqueue (or maybe multiple if/when
> metadata and freespace inodes are supported).  Note that end_io_work
> isn't currently used for writes, so you can reuse it here.

I'm not sure I understand your comment. It is on a specific workqueue,
nameley 'raid_stripe_work'.

> 
> Also note that I do hav a patchset that defers all writes to a workqueue
> here instead of doing separate deferals just for ordered extents or
> direct I/O, so maybe we can eventually skip the separate workqueue for
> the stripe tree entirely.
> 
>>  	bio->bi_private = &bioc->stripes[dev_nr];
>>  	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
>>  	bioc->stripes[dev_nr].bioc = bioc;
>> +	bioc->size = bio->bi_iter.bi_size;
> 
> So we could just use the saved_iter for the main bbio for the size and
> logical if we also set that for writes.

That could indeed be done, saving off some bytes in btrfs_io_context.

> Also right now the ordered extent is split for each actually submitted
> bio for zone append writes, so you could just use that as well.  That
> beeing said, I think we could actually stop doing that split with the
> stripe tree, as there is no updated of the L2P mapping in the chunk tree
> any more.
> 
> Can someone more familiar with the btrfs internals chime in here if
> there might be a way to in fact stop updating the chunk tree entirely
> as that should help reducing the write amp a bit?
> 
>> +static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
>> +					    struct btrfs_delayed_ref_node *node)
>> +{
>> +	em = btrfs_get_chunk_map(fs_info, node->bytenr, node->num_bytes);
>> +	if (!em) {
>> +		btrfs_err(fs_info,
>> +			  "cannot get chunk map for address %llu",
>> +			  node->bytenr);
>> +		return -EINVAL;
>> +	}
>> +
>> +	map = em->map_lookup;
>> +
>> +	if (btrfs_need_stripe_tree_update(fs_info, map->type)) {
> 
> It seems like the chunk_map lookup is only needed to figure out if
> the chunk needs a stripe tree update, which seems rather inefficient.
> Can't we find some way to stash away that bit from the submission path
> instead of rediscovering it here?
> 

True, but I've not yet found a good solution to this problem TBH. I could
do a call to btrfs_need_stripe_tree_update() at the start of a transaction
and then cache the value. That's the only way I can think of it atm.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-12 19:27   ` Josef Bacik
@ 2022-12-13  8:17     ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-13  8:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Naohiro Aota

On 12.12.22 20:27, Josef Bacik wrote:
> 
> Whitespace error here.  Thanks,

Whoopsie, fixed.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-13  8:15     ` Johannes Thumshirn
@ 2022-12-13  8:36       ` hch
  2022-12-13  8:47         ` Johannes Thumshirn
  0 siblings, 1 reply; 20+ messages in thread
From: hch @ 2022-12-13  8:36 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: hch, linux-btrfs, Josef Bacik, Naohiro Aota

On Tue, Dec 13, 2022 at 08:15:43AM +0000, Johannes Thumshirn wrote:
> >> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
> >> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
> >> +		schedule_work(&bbio->raid_stripe_work);
> > 
> > This needs to be on a specific workqueue (or maybe multiple if/when
> > metadata and freespace inodes are supported).  Note that end_io_work
> > isn't currently used for writes, so you can reuse it here.
> 
> I'm not sure I understand your comment. It is on a specific workqueue,
> nameley 'raid_stripe_work'.

No.  It is a work_struct scheduled on system_wq:

static inline bool schedule_work(struct work_struct *work)
{
	return queue_work(system_wq, work);
}

> > 
> > It seems like the chunk_map lookup is only needed to figure out if
> > the chunk needs a stripe tree update, which seems rather inefficient.
> > Can't we find some way to stash away that bit from the submission path
> > instead of rediscovering it here?
> > 
> 
> True, but I've not yet found a good solution to this problem TBH. I could
> do a call to btrfs_need_stripe_tree_update() at the start of a transaction
> and then cache the value. That's the only way I can think of it atm.

So to me the btrfs_delayed_data_ref, or for better packing reasons,
btrfs_delayed_ref_node would seem like the more suitable structure as
that contains all the other relevant information.

But in general this makes me thing of you are really inserting the
stripe tree at the right place in the btrfs code [caveat:  I'm still
a learner on a lot of the btrfs structure..], as the whole delayed_ref
mechanisms seems to be about the file offset to logical mapping,
and not logical to physical.  What is the argument of not inserting
the stripe tree from btrfs_finish_ordered_io similar to the L2P
mapping rewrite for the "normal" zoned write completion?  Nothing
in the stripe deals with file-level offsets, so piggying it back
on the file extent insert seems a bit odd.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-13  8:36       ` hch
@ 2022-12-13  8:47         ` Johannes Thumshirn
  2022-12-13  8:54           ` hch
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-13  8:47 UTC (permalink / raw)
  To: hch; +Cc: linux-btrfs, Josef Bacik, Naohiro Aota

On 13.12.22 09:37, hch@infradead.org wrote:
> On Tue, Dec 13, 2022 at 08:15:43AM +0000, Johannes Thumshirn wrote:
>>>> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
>>>> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
>>>> +		schedule_work(&bbio->raid_stripe_work);
>>>
>>> This needs to be on a specific workqueue (or maybe multiple if/when
>>> metadata and freespace inodes are supported).  Note that end_io_work
>>> isn't currently used for writes, so you can reuse it here.
>>
>> I'm not sure I understand your comment. It is on a specific workqueue,
>> nameley 'raid_stripe_work'.
> 
> No.  It is a work_struct scheduled on system_wq:
> 
> static inline bool schedule_work(struct work_struct *work)
> {
> 	return queue_work(system_wq, work);
> }
> 

Right of cause!

>>>
>>> It seems like the chunk_map lookup is only needed to figure out if
>>> the chunk needs a stripe tree update, which seems rather inefficient.
>>> Can't we find some way to stash away that bit from the submission path
>>> instead of rediscovering it here?
>>>
>>
>> True, but I've not yet found a good solution to this problem TBH. I could
>> do a call to btrfs_need_stripe_tree_update() at the start of a transaction
>> and then cache the value. That's the only way I can think of it atm.
> 
> So to me the btrfs_delayed_data_ref, or for better packing reasons,
> btrfs_delayed_ref_node would seem like the more suitable structure as
> that contains all the other relevant information.
> 
> But in general this makes me thing of you are really inserting the
> stripe tree at the right place in the btrfs code [caveat:  I'm still
> a learner on a lot of the btrfs structure..], as the whole delayed_ref
> mechanisms seems to be about the file offset to logical mapping,
> and not logical to physical.  What is the argument of not inserting
> the stripe tree from btrfs_finish_ordered_io similar to the L2P
> mapping rewrite for the "normal" zoned write completion?  Nothing
> in the stripe deals with file-level offsets, so piggying it back
> on the file extent insert seems a bit odd.
> 

The reasoning behind this is possibly lower write amplification, as
we can exploit the merging of delayed_refs. See this hunk for example:

@@ -640,8 +641,10 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
        exist->ref_mod += mod;
 
        /* remove existing tail if its ref_mod is zero */
-       if (exist->ref_mod == 0)
+       if (exist->ref_mod == 0) {
+               btrfs_drop_ordered_stripe(trans->fs_info, exist->bytenr);
                drop_delayed_ref(trans, root, href, exist);
+       }
        spin_unlock(&href->lock);
        return ret;
 inserted:



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-13  8:47         ` Johannes Thumshirn
@ 2022-12-13  8:54           ` hch
  2022-12-13  9:01             ` Johannes Thumshirn
  0 siblings, 1 reply; 20+ messages in thread
From: hch @ 2022-12-13  8:54 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: hch, linux-btrfs, Josef Bacik, Naohiro Aota

On Tue, Dec 13, 2022 at 08:47:48AM +0000, Johannes Thumshirn wrote:
> The reasoning behind this is possibly lower write amplification, as
> we can exploit the merging of delayed_refs. See this hunk for example:

I'm curious how much of that merging happens, and how much would also
be handled by not splitting the ordered_extent to start with as
suggested in my previous mail.  I'll have do defer to the people
more familiar with btrfs again, but if you need to delay the tree
update to reduce the write amp, shouldn't it use it's own set of
delayed refs instead of piggy backing on the rather unrelated file
to logical mapping ones?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-13  8:54           ` hch
@ 2022-12-13  9:01             ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-13  9:01 UTC (permalink / raw)
  To: hch; +Cc: linux-btrfs, Josef Bacik, Naohiro Aota

On 13.12.22 09:54, hch@infradead.org wrote:
> On Tue, Dec 13, 2022 at 08:47:48AM +0000, Johannes Thumshirn wrote:
>> The reasoning behind this is possibly lower write amplification, as
>> we can exploit the merging of delayed_refs. See this hunk for example:
> 
> I'm curious how much of that merging happens, and how much would also
> be handled by not splitting the ordered_extent to start with as

What I've discovered is, for a 2 minute fio randrw run, it's in the upper
2 digit to 3 digit class (obviously depends on the pattern fio spits out).

I've discovered that the hard way when I've added the leak check on unmount out
of curiosity.

> suggested in my previous mail.  I'll have do defer to the people
> more familiar with btrfs again, but if you need to delay the tree
> update to reduce the write amp, shouldn't it use it's own set of
> delayed refs instead of piggy backing on the rather unrelated file
> to logical mapping ones?
> 

Let me think about that. I need to contemplate on the pros and cons
of this myself first. I agree it's abusing the data refs but I haven't
had a better solution.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-07 14:22 ` [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
  2022-12-12  7:22   ` Christoph Hellwig
  2022-12-12 19:27   ` Josef Bacik
@ 2022-12-13 16:14   ` Josef Bacik
  2022-12-13 17:48     ` Johannes Thumshirn
  2 siblings, 1 reply; 20+ messages in thread
From: Josef Bacik @ 2022-12-13 16:14 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs, Naohiro Aota

On Wed, Dec 07, 2022 at 06:22:12AM -0800, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
> 
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
> 
> This is done by creating in-memory ordered stripe extents, just like the
> in memory ordered extents, on I/O completion and the on-disk raid stripe
> extents get created once we're running the delayed_refs for the extent
> item this stripe extent is tied to.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/Makefile           |   3 +-
>  fs/btrfs/bio.c              |  30 +++++-
>  fs/btrfs/bio.h              |   2 +
>  fs/btrfs/delayed-ref.c      |   5 +-
>  fs/btrfs/disk-io.c          |   3 +
>  fs/btrfs/extent-tree.c      |  49 +++++++++
>  fs/btrfs/fs.h               |   3 +
>  fs/btrfs/inode.c            |   6 ++
>  fs/btrfs/raid-stripe-tree.c | 195 ++++++++++++++++++++++++++++++++++++
>  fs/btrfs/raid-stripe-tree.h |  69 +++++++++++++
>  fs/btrfs/volumes.c          |   5 +-
>  fs/btrfs/volumes.h          |  12 +--
>  fs/btrfs/zoned.c            |   4 +
>  13 files changed, 376 insertions(+), 10 deletions(-)
>  create mode 100644 fs/btrfs/raid-stripe-tree.c
>  create mode 100644 fs/btrfs/raid-stripe-tree.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 555c962fdad6..63236ae2a87b 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -31,7 +31,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>  	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> -	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o
> +	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> +	   raid-stripe-tree.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 4ccbc120e869..b60a50165703 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -15,6 +15,7 @@
>  #include "rcu-string.h"
>  #include "zoned.h"
>  #include "file-item.h"
> +#include "raid-stripe-tree.h"
>  
>  static struct bio_set btrfs_bioset;
>  static struct bio_set btrfs_clone_bioset;
> @@ -350,6 +351,21 @@ static void btrfs_raid56_end_io(struct bio *bio)
>  	btrfs_put_bioc(bioc);
>  }
>  
> +static void btrfs_raid_stripe_update(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio =
> +		container_of(work, struct btrfs_bio, raid_stripe_work);
> +	struct btrfs_io_stripe *stripe = bbio->bio.bi_private;
> +	struct btrfs_io_context *bioc = stripe->bioc;
> +	int ret;
> +
> +	ret = btrfs_add_ordered_stripe(bioc);
> +	if (ret)
> +		bbio->bio.bi_status = errno_to_blk_status(ret);
> +	btrfs_orig_bbio_end_io(bbio);
> +	btrfs_put_bioc(bioc);
> +}
> +
>  static void btrfs_orig_write_end_io(struct bio *bio)
>  {
>  	struct btrfs_io_stripe *stripe = bio->bi_private;
> @@ -372,6 +388,15 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>  	else
>  		bio->bi_status = BLK_STS_OK;
>  
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> +	if (btrfs_need_stripe_tree_update(bioc->fs_info, bioc->map_type)) {
> +		INIT_WORK(&bbio->raid_stripe_work, btrfs_raid_stripe_update);
> +		schedule_work(&bbio->raid_stripe_work);
> +		return;
> +	}
> +
>  	btrfs_orig_bbio_end_io(bbio);
>  	btrfs_put_bioc(bioc);
>  }
> @@ -383,7 +408,9 @@ static void btrfs_clone_write_end_io(struct bio *bio)
>  	if (bio->bi_status) {
>  		atomic_inc(&stripe->bioc->error);
>  		btrfs_log_dev_io_error(bio, stripe->dev);
> -	}
> +	} else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> + 	}
>  
>  	/* Pass on control to the original bio this one was cloned from */
>  	bio_endio(stripe->bioc->orig_bio);
> @@ -442,6 +469,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
>  	bio->bi_private = &bioc->stripes[dev_nr];
>  	bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
>  	bioc->stripes[dev_nr].bioc = bioc;
> +	bioc->size = bio->bi_iter.bi_size;
>  	btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
>  }
>  
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 20105806c8ac..bf5fbc105148 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -58,6 +58,8 @@ struct btrfs_bio {
>  	atomic_t pending_ios;
>  	struct work_struct end_io_work;
>  
> +	struct work_struct raid_stripe_work;
> +
>  	/*
>  	 * This member must come last, bio_alloc_bioset will allocate enough
>  	 * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 573ebab886e2..0357f9327cd4 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -14,6 +14,7 @@
>  #include "space-info.h"
>  #include "tree-mod-log.h"
>  #include "fs.h"
> +#include "raid-stripe-tree.h"
>  
>  struct kmem_cache *btrfs_delayed_ref_head_cachep;
>  struct kmem_cache *btrfs_delayed_tree_ref_cachep;
> @@ -640,8 +641,10 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
>  	exist->ref_mod += mod;
>  
>  	/* remove existing tail if its ref_mod is zero */
> -	if (exist->ref_mod == 0)
> +	if (exist->ref_mod == 0) {
> +		btrfs_drop_ordered_stripe(trans->fs_info, exist->bytenr);
>  		drop_delayed_ref(trans, root, href, exist);
> +	}
>  	spin_unlock(&href->lock);
>  	return ret;
>  inserted:
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5784c850a3ec..bdef4e2e4ea3 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3033,6 +3033,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  
>  	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
>  	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
> +
> +	rwlock_init(&fs_info->stripe_update_lock);
> +	fs_info->stripe_update_tree = RB_ROOT;
>  }
>  
>  static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 892d78c1853c..de479af062fd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -43,6 +43,7 @@
>  #include "file-item.h"
>  #include "orphan.h"
>  #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -1498,6 +1499,51 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static int add_stripe_entry_for_delayed_ref(struct btrfs_trans_handle *trans,
> +					    struct btrfs_delayed_ref_node *node)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	int ret;
> +
> +	if (!btrfs_stripe_tree_root(fs_info))
> +		return 0;
> +
> +	em = btrfs_get_chunk_map(fs_info, node->bytenr, node->num_bytes);
> +	if (!em) {
> +		btrfs_err(fs_info,
> +			  "cannot get chunk map for address %llu",
> +			  node->bytenr);
> +		return -EINVAL;
> +	}
> +
> +	map = em->map_lookup;
> +
> +	if (btrfs_need_stripe_tree_update(fs_info, map->type)) {
> +		struct btrfs_ordered_stripe *stripe;
> +
> +		stripe = btrfs_lookup_ordered_stripe(fs_info, node->bytenr);
> +		if (!stripe) {
> +			btrfs_err(fs_info,
> +				  "cannot get stripe extent for address %llu (%llu)",
> +				  node->bytenr, node->num_bytes);
> +			free_extent_map(em);
> +			return -EINVAL;
> +		}
> +		ASSERT(stripe->logical == node->bytenr);
> +		ASSERT(stripe->num_bytes == node->num_bytes);
> +		ret = btrfs_insert_raid_extent(trans, stripe);
> +		/* once for us */
> +		btrfs_put_ordered_stripe(fs_info, stripe);
> +		/* once for the tree */
> +		btrfs_put_ordered_stripe(fs_info, stripe);
> +	}
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> +
>  static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  				struct btrfs_delayed_ref_node *node,
>  				struct btrfs_delayed_extent_op *extent_op,
> @@ -1528,6 +1574,9 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  						 flags, ref->objectid,
>  						 ref->offset, &ins,
>  						 node->ref_mod);
> +		if (ret)
> +			return ret;
> +		ret = add_stripe_entry_for_delayed_ref(trans, node);
>  	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
>  		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
>  					     ref->objectid, ref->offset,
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index f08b59320645..0dfe8ae2e450 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -780,6 +780,9 @@ struct btrfs_fs_info {
>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>  	struct lockdep_map btrfs_ordered_extent_map;
>  
> +	rwlock_t stripe_update_lock;
> +	struct rb_root stripe_update_tree;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 373b7281f5c7..1299acf52c86 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -70,6 +70,7 @@
>  #include "verity.h"
>  #include "super.h"
>  #include "orphan.h"
> +#include "raid-stripe-tree.h"
>  
>  struct btrfs_iget_args {
>  	u64 ino;
> @@ -9507,6 +9508,11 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
>  	if (qgroup_released < 0)
>  		return ERR_PTR(qgroup_released);
>  
> +	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
> +						    start, len);
> +	if (ret)
> +		goto free_qgroup;
> +

Sorry I didn't notice this until I was looking at later patches.  We insert the
preallocated raid stripe here, but if we fail at any point along the way we
still have this raid stripe entry sitting around.  Will this accidentally get
inserted later and mess everything up?  There are failures down this way that
don't result in an abort, but probably should, or at the very least drop this
raid stripe if we haven't modified anything yet.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents
  2022-12-13 16:14   ` Josef Bacik
@ 2022-12-13 17:48     ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2022-12-13 17:48 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Naohiro Aota

On 13.12.22 17:14, Josef Bacik wrote:
>> +	ret = btrfs_insert_preallocated_raid_stripe(inode->root->fs_info,
>> +						    start, len);
>> +	if (ret)
>> +		goto free_qgroup;
>> +
> 
> Sorry I didn't notice this until I was looking at later patches.  We insert the
> preallocated raid stripe here, but if we fail at any point along the way we
> still have this raid stripe entry sitting around.  Will this accidentally get
> inserted later and mess everything up?  There are failures down this way that
> don't result in an abort, but probably should, or at the very least drop this
> raid stripe if we haven't modified anything yet.

Good catch!



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-12-13 17:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-07 14:22 [PATCH v4 0/9] btrfs: introduce RAID stripe tree Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 1/9] btrfs: add raid stripe tree definitions Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 2/9] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 3/9] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2022-12-12  7:22   ` Christoph Hellwig
2022-12-13  8:15     ` Johannes Thumshirn
2022-12-13  8:36       ` hch
2022-12-13  8:47         ` Johannes Thumshirn
2022-12-13  8:54           ` hch
2022-12-13  9:01             ` Johannes Thumshirn
2022-12-12 19:27   ` Josef Bacik
2022-12-13  8:17     ` Johannes Thumshirn
2022-12-13 16:14   ` Josef Bacik
2022-12-13 17:48     ` Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 4/9] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 5/9] btrfs: lookup physical address from stripe extent Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 6/9] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 7/9] btrfs: zoned: allow zoned RAID Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 8/9] btrfs: check for leaks of ordered stripes on umount Johannes Thumshirn
2022-12-07 14:22 ` [PATCH v4 9/9] btrfs: add tracepoints for ordered stripes Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.