[PATCH RFC] btrfs: place holder for RAID56J profiles

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH RFC] btrfs: place holder for RAID56J profiles
Date: Sun, 15 May 2022 18:55:55 +0800	[thread overview]
Message-ID: <beae638a06ad08d16a4eaf799b0aaf01fc1d5573.1652612110.git.wqu@suse.com> (raw)

[BACKGROUND]
Btrfs RAID56 has a long existing bug for write-hole, we have some ideas
to fix it, from extent allocator behavior changes (completely avoid
partial writes) to write-ahead journal.

This patch will introduce two new chunk profiles:
- BTRFS_BLOCK_GROUP_RAID5J
- BTRFS_BLOCK_GROUP_RAID6J

The suffix "J" means journal, and is what I'm purposing to solve the
write-hole problem.

The journaled solution is the tried-and-true solution, mostly to emulate
the behavior of dm/md-raid56.

[ALTERNATIVE]
There is a more advanced and more flex (can support zoned device)
purpose from Johannes, called stripe tree.

Although that solution will introduced a inter-chunk mapping, and no
longer fully respect the strict rotation requirement, I still believe
that solution can be the ultimate solution for the write hole.

But on the other hand, I think there is still some value for such
journal based tried-and-true solution.

[DETAILS]
For write-ahead journal to work, we need extra space to do the work.
We can go free-space-tree way, but that can be a new rabbit hole if our
metadata is also RAID56J based.

So here we introduce a new member, btrfs_chunk::per_dev_reserved, which
shares the same space of io_aligned, to indicate how many bytes are
reserved for each device extents inside the chunk.

One example for a RAID5J chunk would be like:

	btrfs_chunk:	start (key.offset) = L length = 1GiB
			type = RAID5J 	num_stripes = 3
			pre_dev_reserved = 1MiB
	stripe[0]:	devid = 1	physical = P1
	stripe[1]:	devid = 2	physical = P2
	stripe[2]:	devid = 3	physical = P3

	dev extent:	devid = 1	physical = P1 - 1MiB
			length = 512MiB + 1MiB

	dev extent:	devid = 2	physical = P2 - 1MiB
			length = 513MiB

	dev extent:	devid = 3	physical = P3 - 1MiB
			length = 513 MiB

Then on devid 1, physical offset [P1, P1 + 1M) will be reserved for
journal, the same for devid 2 and devid 3.

Now btrfs_stripe::offset will be where the real data starts, excluding
the per-device reservation.

By this, we can bring the changes to minimal for chunk mapping code, at
the cost of more complex dev extents handling.

[CODE CHANGES]
This new feature touches the following code sites:

- Call sites checking RAID5/RAID6 flag manually
  Mostly to make the check to also check the journaled variant, including:
  * scrub_nr_raid_mirrors()
  * btrfs_reduce_alloc_profile()
  * clear_incompat_bg_bits()
  * btrfs_check_chunk_valid()

  The other call sites should be already using btrfs_raid_array[], thus
  no need to bother.

- New feature/profiles related code
  From new chunk types, sysfs interface, and updated member comments.

- Chunk and dev extent handling
  Involved code handles dev extents with its reservation part,
  including:
  * Dev extents verification against chunk items
  * Chunk allocation
  * Chunk removal
  * Chunk read
  * Chunk scrubbing
    As currently chunk scrubbing is done by iterating the dev extents,
    thus it also needs the extra handling.

[RFC]
Although this patch can pass fstests (with new RAID56J added to involved
test cases, so far the test case which exposed the most bugs is
btrfs/011, unsurprisingly), the journal part is just no-op.

Thus the patch is still just a mass simulator for RAID56J, to make sure
the per-dev reservation code is working correctly.

Furthermore, currently the per-dev reservation is still hardcoded
(1MiB), it can handle different per-dev reservation without problem,
thus we can determine the size in the real journaled code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c          |  23 ++++++-
 fs/btrfs/ctree.h                |   7 +-
 fs/btrfs/scrub.c                |  15 ++--
 fs/btrfs/sysfs.c                |   2 +
 fs/btrfs/tree-checker.c         |   4 ++
 fs/btrfs/volumes.c              | 118 +++++++++++++++++++++++++++++---
 fs/btrfs/volumes.h              |   7 +-
 fs/btrfs/zoned.c                |   2 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  30 +++++++-
 10 files changed, 187 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index ede389f2602d..b80dbd2e5ac1 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -79,8 +79,12 @@ static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags)
 	}
 	allowed &= flags;
 
-	if (allowed & BTRFS_BLOCK_GROUP_RAID6)
+	if (allowed & BTRFS_BLOCK_GROUP_RAID6J)
+		allowed = BTRFS_BLOCK_GROUP_RAID6J;
+	else if (allowed & BTRFS_BLOCK_GROUP_RAID6)
 		allowed = BTRFS_BLOCK_GROUP_RAID6;
+	else if (allowed & BTRFS_BLOCK_GROUP_RAID5J)
+		allowed = BTRFS_BLOCK_GROUP_RAID5J;
 	else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
 		allowed = BTRFS_BLOCK_GROUP_RAID5;
 	else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
@@ -836,6 +840,7 @@ static void clear_incompat_bg_bits(struct btrfs_fs_info *fs_info, u64 flags)
 {
 	bool found_raid56 = false;
 	bool found_raid1c34 = false;
+	bool found_journal = false;
 
 	if ((flags & BTRFS_BLOCK_GROUP_RAID56_MASK) ||
 	    (flags & BTRFS_BLOCK_GROUP_RAID1C3) ||
@@ -849,6 +854,14 @@ static void clear_incompat_bg_bits(struct btrfs_fs_info *fs_info, u64 flags)
 				found_raid56 = true;
 			if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID6]))
 				found_raid56 = true;
+			if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID5J])) {
+				found_raid56 = true;
+				found_journal = true;
+			}
+			if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID6J])) {
+				found_raid56 = true;
+				found_journal = true;
+			}
 			if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID1C3]))
 				found_raid1c34 = true;
 			if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID1C4]))
@@ -859,6 +872,8 @@ static void clear_incompat_bg_bits(struct btrfs_fs_info *fs_info, u64 flags)
 			btrfs_clear_fs_incompat(fs_info, RAID56);
 		if (!found_raid1c34)
 			btrfs_clear_fs_incompat(fs_info, RAID1C34);
+		if (!found_journal)
+			btrfs_clear_fs_incompat(fs_info, RAID56_JOURNAL);
 	}
 }
 
@@ -2397,8 +2412,10 @@ static int insert_dev_extents(struct btrfs_trans_handle *trans,
 		device = map->stripes[i].dev;
 		dev_offset = map->stripes[i].physical;
 
-		ret = insert_dev_extent(trans, device, chunk_offset, dev_offset,
-				       stripe_size);
+		ASSERT(dev_offset > map->per_dev_reserved);
+		ret = insert_dev_extent(trans, device, chunk_offset,
+					dev_offset - map->per_dev_reserved,
+					stripe_size + map->per_dev_reserved);
 		if (ret)
 			break;
 	}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0e49b1a0c071..7025105b4023 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -327,7 +327,8 @@ static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
 	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
 	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
-	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2  |	\
+	 BTRFS_FEATURE_INCOMPAT_RAID56_JOURNAL)
 #else
 #define BTRFS_FEATURE_INCOMPAT_SUPP			\
 	(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF |		\
@@ -1710,6 +1711,8 @@ static inline void btrfs_set_device_total_bytes(const struct extent_buffer *eb,
 BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
 BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
 BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
+BTRFS_SETGET_FUNCS(chunk_per_dev_reserved, struct btrfs_chunk, per_dev_reserved,
+		   32);
 BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
 BTRFS_SETGET_FUNCS(device_start_offset, struct btrfs_dev_item,
 		   start_offset, 64);
@@ -1727,6 +1730,8 @@ BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
 			 bytes_used, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_device_io_align, struct btrfs_dev_item,
 			 io_align, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_per_dev_reserved, struct btrfs_chunk,
+			 per_dev_reserved, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_device_io_width, struct btrfs_dev_item,
 			 io_width, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_device_sector_size, struct btrfs_dev_item,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 84346faa4ff1..a8f2f3d854a9 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1208,12 +1208,13 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 
 static inline int scrub_nr_raid_mirrors(struct btrfs_io_context *bioc)
 {
-	if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
+	if (bioc->map_type & (BTRFS_BLOCK_GROUP_RAID5 |
+			      BTRFS_BLOCK_GROUP_RAID5J))
 		return 2;
-	else if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID6)
+	if (bioc->map_type & (BTRFS_BLOCK_GROUP_RAID6 |
+			      BTRFS_BLOCK_GROUP_RAID6J))
 		return 3;
-	else
-		return (int)bioc->num_stripes;
+	return (int)bioc->num_stripes;
 }
 
 static inline void scrub_stripe_index_and_offset(u64 logical, u64 map_type,
@@ -3632,12 +3633,16 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 
 		return ret;
 	}
+	map = em->map_lookup;
+	/* Exclude per dev reservation from @dev_offset and @dev_extent_len. */
+	dev_offset += map->per_dev_reserved;
+	dev_extent_len -= map->per_dev_reserved;
+
 	if (em->start != bg->start)
 		goto out;
 	if (em->len < dev_extent_len)
 		goto out;
 
-	map = em->map_lookup;
 	for (i = 0; i < map->num_stripes; ++i) {
 		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
 		    map->stripes[i].physical == dev_offset) {
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 92a1fa8e3da6..082afc93c763 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -283,6 +283,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_INCOMPAT(raid56_journal, RAID56_JOURNAL);
 #ifdef CONFIG_BTRFS_DEBUG
 /* Remove once support for zoned allocation is feature complete */
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
@@ -317,6 +318,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_FEAT_ATTR_PTR(zoned),
 	BTRFS_FEAT_ATTR_PTR(extent_tree_v2),
+	BTRFS_FEAT_ATTR_PTR(raid56_journal),
 #endif
 #ifdef CONFIG_FS_VERITY
 	BTRFS_FEAT_ATTR_PTR(verity),
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 9e0e0ae2288c..e8aadf0570d2 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -903,6 +903,10 @@ int btrfs_check_chunk_valid(struct extent_buffer *leaf,
 		      num_stripes < btrfs_raid_array[BTRFS_RAID_RAID5].devs_min) ||
 		     (type & BTRFS_BLOCK_GROUP_RAID6 &&
 		      num_stripes < btrfs_raid_array[BTRFS_RAID_RAID6].devs_min) ||
+		     (type & BTRFS_BLOCK_GROUP_RAID5J &&
+		      num_stripes < btrfs_raid_array[BTRFS_RAID_RAID5J].devs_min) ||
+		     (type & BTRFS_BLOCK_GROUP_RAID6J &&
+		      num_stripes < btrfs_raid_array[BTRFS_RAID_RAID6J].devs_min) ||
 		     (type & BTRFS_BLOCK_GROUP_DUP &&
 		      num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
 		     ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index cddbbd8eb310..f8ac402ef5c9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -34,6 +34,14 @@
 #include "discard.h"
 #include "zoned.h"
 
+/*
+ * The extra space for journal based profiles (raid56j).
+ *
+ * Each device will have this amount of bytes reserved before the real
+ * stripe begins.
+ */
+#define JOURNAL_RESERVED		(SZ_1M)
+
 #define BTRFS_BLOCK_GROUP_STRIPE_MASK	(BTRFS_BLOCK_GROUP_RAID0 | \
 					 BTRFS_BLOCK_GROUP_RAID10 | \
 					 BTRFS_BLOCK_GROUP_RAID56_MASK)
@@ -156,6 +164,32 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.bg_flag	= BTRFS_BLOCK_GROUP_RAID6,
 		.mindev_error	= BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
 	},
+	[BTRFS_RAID_RAID5J] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 2,
+		.tolerated_failures = 1,
+		.devs_increment	= 1,
+		.ncopies	= 1,
+		.nparity        = 1,
+		.raid_name	= "raid5j",
+		.bg_flag	= BTRFS_BLOCK_GROUP_RAID5J,
+		.mindev_error	= BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
+	},
+	[BTRFS_RAID_RAID6J] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 3,
+		.tolerated_failures = 2,
+		.devs_increment	= 1,
+		.ncopies	= 1,
+		.nparity        = 2,
+		.raid_name	= "raid6j",
+		.bg_flag	= BTRFS_BLOCK_GROUP_RAID6J,
+		.mindev_error	= BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
+	},
 };
 
 /*
@@ -182,6 +216,11 @@ const char *btrfs_bg_type_to_raid_name(u64 flags)
 	return btrfs_raid_array[index].raid_name;
 }
 
+static bool bg_type_is_journal(u64 type)
+{
+	return !!(type & BTRFS_BLOCK_GROUP_JOURNAL_MASK);
+}
+
 /*
  * Fill @buf with textual description of @bg_flags, no more than @size_buf
  * bytes including terminating null byte.
@@ -5029,6 +5068,15 @@ static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 	btrfs_set_fs_incompat(info, RAID56);
 }
 
+static void check_journal_incompat_flag(struct btrfs_fs_info *fs_info,
+					u64 type)
+{
+	if (!(type & BTRFS_BLOCK_GROUP_JOURNAL_MASK))
+		return;
+
+	btrfs_set_fs_incompat(fs_info, RAID56_JOURNAL);
+}
+
 static void check_raid1c34_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4)))
@@ -5063,9 +5111,14 @@ struct alloc_chunk_ctl {
 	u64 max_stripe_size;
 	u64 max_chunk_size;
 	u64 dev_extent_min;
+
+	/* The real stripe size, excluding the per-device reservation. */
 	u64 stripe_size;
 	u64 chunk_size;
 	int ndevs;
+
+	/* How many bytes needs to be reserved before the real stripe starts. */
+	u32 per_dev_reserved;
 };
 
 static void init_alloc_chunk_ctl_policy_regular(
@@ -5136,6 +5189,7 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 				 struct alloc_chunk_ctl *ctl)
 {
 	int index = btrfs_bg_flags_to_raid_index(ctl->type);
+	bool is_journal = bg_type_is_journal(ctl->type);
 
 	ctl->sub_stripes = btrfs_raid_array[index].sub_stripes;
 	ctl->dev_stripes = btrfs_raid_array[index].dev_stripes;
@@ -5148,6 +5202,11 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 	ctl->nparity = btrfs_raid_array[index].nparity;
 	ctl->ndevs = 0;
 
+	if (is_journal)
+		ctl->per_dev_reserved = JOURNAL_RESERVED;
+	else
+		ctl->per_dev_reserved = 0;
+
 	switch (fs_devices->chunk_alloc_policy) {
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		init_alloc_chunk_ctl_policy_regular(fs_devices, ctl);
@@ -5276,6 +5335,11 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
 
 	/* Align to BTRFS_STRIPE_LEN */
 	ctl->stripe_size = round_down(ctl->stripe_size, BTRFS_STRIPE_LEN);
+
+	/* Not enough space for per-dev reservation. */
+	if (ctl->stripe_size <= ctl->per_dev_reserved)
+		return -ENOSPC;
+	ctl->stripe_size -= ctl->per_dev_reserved;
 	ctl->chunk_size = ctl->stripe_size * data_stripes;
 
 	return 0;
@@ -5294,6 +5358,9 @@ static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl,
 	 */
 	ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min);
 
+	/* No support for RAID56J at all. */
+	ASSERT(!bg_type_is_journal(ctl->type));
+
 	ctl->stripe_size = zone_size;
 	ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
 	data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
@@ -5371,8 +5438,15 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
 		for (j = 0; j < ctl->dev_stripes; ++j) {
 			int s = i * ctl->dev_stripes + j;
 			map->stripes[s].dev = devices_info[i].dev;
+
+			/*
+			 * devices_info[] contains the stripes for the physical
+			 * stripe, our real data only starts after the reserved
+			 * space.
+			 */
 			map->stripes[s].physical = devices_info[i].dev_offset +
-						   j * ctl->stripe_size;
+						   j * ctl->stripe_size +
+						   ctl->per_dev_reserved;
 		}
 	}
 	map->stripe_len = BTRFS_STRIPE_LEN;
@@ -5380,6 +5454,7 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
 	map->io_width = BTRFS_STRIPE_LEN;
 	map->type = type;
 	map->sub_stripes = ctl->sub_stripes;
+	map->per_dev_reserved = ctl->per_dev_reserved;
 
 	trace_btrfs_chunk_alloc(info, map, start, ctl->chunk_size);
 
@@ -5414,7 +5489,8 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
 		struct btrfs_device *dev = map->stripes[i].dev;
 
 		btrfs_device_set_bytes_used(dev,
-					    dev->bytes_used + ctl->stripe_size);
+					    dev->bytes_used + ctl->stripe_size +
+					    ctl->per_dev_reserved);
 		if (list_empty(&dev->post_commit_list))
 			list_add_tail(&dev->post_commit_list,
 				      &trans->transaction->dev_update_list);
@@ -5425,6 +5501,7 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
 
 	free_extent_map(em);
 	check_raid56_incompat_flag(info, type);
+	check_journal_incompat_flag(info, type);
 	check_raid1c34_incompat_flag(info, type);
 
 	return block_group;
@@ -5517,6 +5594,7 @@ int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
 	struct btrfs_stripe *stripe;
 	struct extent_map *em;
 	struct map_lookup *map;
+	bool is_journal = bg_type_is_journal(bg->flags);
 	size_t item_size;
 	int i;
 	int ret;
@@ -5555,6 +5633,11 @@ int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
 	map = em->map_lookup;
 	item_size = btrfs_chunk_item_size(map->num_stripes);
 
+	if (is_journal)
+		ASSERT(map->per_dev_reserved);
+	else
+		ASSERT(map->per_dev_reserved == 0);
+
 	chunk = kzalloc(item_size, GFP_NOFS);
 	if (!chunk) {
 		ret = -ENOMEM;
@@ -5586,7 +5669,11 @@ int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_chunk_stripe_len(chunk, map->stripe_len);
 	btrfs_set_stack_chunk_type(chunk, map->type);
 	btrfs_set_stack_chunk_num_stripes(chunk, map->num_stripes);
-	btrfs_set_stack_chunk_io_align(chunk, map->stripe_len);
+	if (!is_journal)
+		btrfs_set_stack_chunk_io_align(chunk, map->stripe_len);
+	else
+		btrfs_set_stack_chunk_per_dev_reserved(chunk,
+						       map->per_dev_reserved);
 	btrfs_set_stack_chunk_io_width(chunk, map->stripe_len);
 	btrfs_set_stack_chunk_sector_size(chunk, fs_info->sectorsize);
 	btrfs_set_stack_chunk_sub_stripes(chunk, map->sub_stripes);
@@ -5740,9 +5827,11 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	if (!(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK))
 		/* Non-raid56, use their ncopies from btrfs_raid_array[]. */
 		ret = btrfs_raid_array[index].ncopies;
-	else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
+	else if (map->type & (BTRFS_BLOCK_GROUP_RAID5 |
+			      BTRFS_BLOCK_GROUP_RAID5J))
 		ret = 2;
-	else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
+	else if (map->type & (BTRFS_BLOCK_GROUP_RAID6 |
+			      BTRFS_BLOCK_GROUP_RAID6J))
 		/*
 		 * There could be two corrupted data stripes, we need
 		 * to loop retry in order to rebuild the correct data.
@@ -6558,7 +6647,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 				em->start + (tmp + i) * map->stripe_len;
 
 		bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE;
-		if (map->type & BTRFS_BLOCK_GROUP_RAID6)
+		if (map->type & (BTRFS_BLOCK_GROUP_RAID6 |
+				 BTRFS_BLOCK_GROUP_RAID6J))
 			bioc->raid_map[(i + rot + 1) % num_stripes] =
 				RAID6_Q_STRIPE;
 
@@ -7061,6 +7151,7 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
 	struct extent_map_tree *map_tree = &fs_info->mapping_tree;
 	struct map_lookup *map;
 	struct extent_map *em;
+	bool is_journal;
 	u64 logical;
 	u64 length;
 	u64 devid;
@@ -7092,6 +7183,7 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
 			return ret;
 	}
 
+	is_journal = bg_type_is_journal(type);
 	read_lock(&map_tree->lock);
 	em = lookup_extent_mapping(map_tree, logical, 1);
 	read_unlock(&map_tree->lock);
@@ -7123,7 +7215,14 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
 
 	map->num_stripes = num_stripes;
 	map->io_width = btrfs_chunk_io_width(leaf, chunk);
-	map->io_align = btrfs_chunk_io_align(leaf, chunk);
+	if (!is_journal) {
+		map->io_align = btrfs_chunk_io_align(leaf, chunk);
+		map->per_dev_reserved = 0;
+	} else {
+		map->io_align = map->io_width;
+		map->per_dev_reserved = btrfs_chunk_per_dev_reserved(leaf,
+								     chunk);
+	}
 	map->stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
 	map->type = type;
 	map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
@@ -8030,7 +8129,7 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 
 	map = em->map_lookup;
 	stripe_len = btrfs_calc_stripe_length(em);
-	if (physical_len != stripe_len) {
+	if (physical_len != stripe_len + map->per_dev_reserved) {
 		btrfs_err(fs_info,
 "dev extent physical offset %llu on devid %llu length doesn't match chunk %llu, have %llu expect %llu",
 			  physical_offset, devid, em->start, physical_len,
@@ -8041,7 +8140,8 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 
 	for (i = 0; i < map->num_stripes; i++) {
 		if (map->stripes[i].dev->devid == devid &&
-		    map->stripes[i].physical == physical_offset) {
+		    map->stripes[i].physical == physical_offset +
+						map->per_dev_reserved) {
 			found = true;
 			if (map->verified_stripes >= map->num_stripes) {
 				btrfs_err(fs_info,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2bfe14d75a15..b1902a1e2e55 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -47,6 +47,8 @@ enum btrfs_raid_types {
 	BTRFS_RAID_RAID6   = BTRFS_BG_FLAG_TO_INDEX(BTRFS_BLOCK_GROUP_RAID6),
 	BTRFS_RAID_RAID1C3 = BTRFS_BG_FLAG_TO_INDEX(BTRFS_BLOCK_GROUP_RAID1C3),
 	BTRFS_RAID_RAID1C4 = BTRFS_BG_FLAG_TO_INDEX(BTRFS_BLOCK_GROUP_RAID1C4),
+	BTRFS_RAID_RAID5J  = BTRFS_BG_FLAG_TO_INDEX(BTRFS_BLOCK_GROUP_RAID5J),
+	BTRFS_RAID_RAID6J  = BTRFS_BG_FLAG_TO_INDEX(BTRFS_BLOCK_GROUP_RAID6J),
 
 	BTRFS_NR_RAID_TYPES
 };
@@ -462,8 +464,11 @@ extern const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES];
 
 struct map_lookup {
 	u64 type;
-	int io_align;
+	u32 io_align;
+	u32 per_dev_reserved;
 	int io_width;
+
+	/* This is the real stripe length, excluding above per_dev_reserved. */
 	u32 stripe_len;
 	int num_stripes;
 	int sub_stripes;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b60767492b3c..85dd49aa41fd 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1459,6 +1459,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 	case BTRFS_BLOCK_GROUP_RAID10:
 	case BTRFS_BLOCK_GROUP_RAID5:
 	case BTRFS_BLOCK_GROUP_RAID6:
+	case BTRFS_BLOCK_GROUP_RAID5J:
+	case BTRFS_BLOCK_GROUP_RAID6J:
 		/* non-single profiles are not supported yet */
 	default:
 		btrfs_err(fs_info, "zoned: profile %s not yet supported",
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index d956b2993970..cb66aff71508 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -310,6 +310,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_RAID56_JOURNAL	(1ULL << 14)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index d4117152d907..46991a27013b 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -438,6 +438,10 @@ struct btrfs_dev_item {
 
 struct btrfs_stripe {
 	__le64 devid;
+	/*
+	 * Where the real stripe starts on the device, excluding the per-dev
+	 * reserved bytes.
+	 */
 	__le64 offset;
 	__u8 dev_uuid[BTRFS_UUID_SIZE];
 } __attribute__ ((__packed__));
@@ -452,8 +456,19 @@ struct btrfs_chunk {
 	__le64 stripe_len;
 	__le64 type;
 
-	/* optimal io alignment for this chunk */
-	__le32 io_align;
+	union {
+		/*
+		 * For non-journaled profiles, optimal io alignment for this
+		 * chunk, not really utilized though.
+		 */
+		__le32 io_align;
+
+		/*
+		 * For journaled profiles, per-device-extent reserved bytes
+		 * before the real data starts.
+		 */
+		__le32 per_dev_reserved;
+	};
 
 	/* optimal io width for this chunk */
 	__le32 io_width;
@@ -877,6 +892,8 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
 #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
+#define BTRFS_BLOCK_GROUP_RAID5J	(1ULL << 11)
+#define BTRFS_BLOCK_GROUP_RAID6J	(1ULL << 12)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
@@ -890,10 +907,17 @@ struct btrfs_dev_replace_item {
 					 BTRFS_BLOCK_GROUP_RAID1C4 | \
 					 BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6 |   \
+					 BTRFS_BLOCK_GROUP_RAID5J |   \
+					 BTRFS_BLOCK_GROUP_RAID6J |   \
 					 BTRFS_BLOCK_GROUP_DUP |     \
 					 BTRFS_BLOCK_GROUP_RAID10)
 #define BTRFS_BLOCK_GROUP_RAID56_MASK	(BTRFS_BLOCK_GROUP_RAID5 |   \
-					 BTRFS_BLOCK_GROUP_RAID6)
+					 BTRFS_BLOCK_GROUP_RAID6 |   \
+					 BTRFS_BLOCK_GROUP_RAID5J |  \
+					 BTRFS_BLOCK_GROUP_RAID6J)
+
+#define BTRFS_BLOCK_GROUP_JOURNAL_MASK	(BTRFS_BLOCK_GROUP_RAID5J | \
+					 BTRFS_BLOCK_GROUP_RAID6J)
 
 #define BTRFS_BLOCK_GROUP_RAID1_MASK	(BTRFS_BLOCK_GROUP_RAID1 |   \
 					 BTRFS_BLOCK_GROUP_RAID1C3 | \
-- 
2.36.1