linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/15] btrfs-progs: zoned block device support
@ 2019-12-04  8:24 Naohiro Aota
  2019-12-04  8:24 ` [PATCH v5 01/15] btrfs-progs: utils: Introduce queue_param helper function Naohiro Aota
                   ` (15 more replies)
  0 siblings, 16 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:24 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This is a userland part of supporting zoned block device on btrfs.

Since the log-structured superblock feature changed the location of
superblock magic, the current util-linux (libblkid) cannot detect
HMZONED btrfs anymore. You need to apply a to-be posted patch to
util-linux to make it "zone aware".

Naohiro Aota (15):
  btrfs-progs: utils: Introduce queue_param helper function
  btrfs-progs: introduce raid parameters variables
  btrfs-progs: build: Check zoned block device support
  btrfs-progs: add new HMZONED feature flag
  btrfs-progs: Introduce zone block device helper functions
  btrfs-progs: load and check zone information
  btrfs-progs: support discarding zoned device
  btrfs-progs: support zero out on zoned block device
  btrfs-progs: implement log-structured superblock for HMZONED mode
  btrfs-progs: align device extent allocation to zone boundary
  btrfs-progs: do sequential allocation in HMZONED mode
  btrfs-progs: redirty clean extent buffers in seq
  btrfs-progs: mkfs: Zoned block device support
  btrfs-progs: device-add: support HMZONED device
  btrfs-progs: introduce support for device replace HMZONED device

 Makefile                  |    3 +-
 cmds/device.c             |   32 +-
 cmds/inspect-dump-super.c |    4 +-
 cmds/replace.c            |   12 +-
 common/device-scan.c      |   19 +-
 common/device-utils.c     |  109 +++-
 common/device-utils.h     |    4 +
 common/fsfeatures.c       |    8 +
 common/fsfeatures.h       |    2 +-
 common/hmzoned.c          | 1009 +++++++++++++++++++++++++++++++++++++
 common/hmzoned.h          |  139 +++++
 configure.ac              |   13 +
 ctree.h                   |   11 +-
 disk-io.c                 |   14 +-
 extent-tree.c             |   24 +
 kerncompat.h              |    9 +
 mkfs/common.c             |   38 +-
 mkfs/common.h             |    1 +
 mkfs/main.c               |   90 ++--
 transaction.c             |    7 +
 volumes.c                 |  116 ++++-
 volumes.h                 |    4 +
 22 files changed, 1582 insertions(+), 86 deletions(-)
 create mode 100644 common/hmzoned.c
 create mode 100644 common/hmzoned.h

-- 
2.24.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 01/15] btrfs-progs: utils: Introduce queue_param helper function
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
@ 2019-12-04  8:24 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 02/15] btrfs-progs: introduce raid parameters variables Naohiro Aota
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:24 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Introduce the queue_param helper function to get a device request queue
parameter.  This helper will be used later to query information of a zoned
device.

Furthermore, rewrite is_ssd() using the helper function.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
[Naohiro] fixed error return value
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/device-utils.c | 46 +++++++++++++++++++++++++++++++++++++++++++
 common/device-utils.h |  1 +
 mkfs/main.c           | 40 ++-----------------------------------
 3 files changed, 49 insertions(+), 38 deletions(-)

diff --git a/common/device-utils.c b/common/device-utils.c
index b03d62faaf21..7fa9386f4677 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -252,3 +252,49 @@ u64 get_partition_size(const char *dev)
 	return result;
 }
 
+/*
+ * Get a device request queue parameter.
+ */
+int queue_param(const char *file, const char *param, char *buf, size_t len)
+{
+	blkid_probe probe;
+	char wholedisk[PATH_MAX];
+	char sysfs_path[PATH_MAX];
+	dev_t devno;
+	int fd;
+	int ret;
+
+	probe = blkid_new_probe_from_filename(file);
+	if (!probe)
+		return 0;
+
+	/* Device number of this disk (possibly a partition) */
+	devno = blkid_probe_get_devno(probe);
+	if (!devno) {
+		blkid_free_probe(probe);
+		return 0;
+	}
+
+	/* Get whole disk name (not full path) for this devno */
+	ret = blkid_devno_to_wholedisk(devno,
+			wholedisk, sizeof(wholedisk), NULL);
+	if (ret) {
+		blkid_free_probe(probe);
+		return 0;
+	}
+
+	snprintf(sysfs_path, PATH_MAX, "/sys/block/%s/queue/%s",
+		 wholedisk, param);
+
+	blkid_free_probe(probe);
+
+	fd = open(sysfs_path, O_RDONLY);
+	if (fd < 0)
+		return 0;
+
+	len = read(fd, buf, len);
+	close(fd);
+
+	return len;
+}
+
diff --git a/common/device-utils.h b/common/device-utils.h
index 70d19cae3e50..d1799323d002 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -29,5 +29,6 @@ u64 disk_size(const char *path);
 u64 btrfs_device_size(int fd, struct stat *st);
 int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		u64 max_block_count, unsigned opflags);
+int queue_param(const char *file, const char *param, char *buf, size_t len);
 
 #endif
diff --git a/mkfs/main.c b/mkfs/main.c
index 316ea82e45c6..14e9ae7aeb6d 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -432,49 +432,13 @@ static int zero_output_file(int out_fd, u64 size)
 
 static int is_ssd(const char *file)
 {
-	blkid_probe probe;
-	char wholedisk[PATH_MAX];
-	char sysfs_path[PATH_MAX];
-	dev_t devno;
-	int fd;
 	char rotational;
 	int ret;
 
-	probe = blkid_new_probe_from_filename(file);
-	if (!probe)
+	ret = queue_param(file, "rotational", &rotational, 1);
+	if (ret < 1)
 		return 0;
 
-	/* Device number of this disk (possibly a partition) */
-	devno = blkid_probe_get_devno(probe);
-	if (!devno) {
-		blkid_free_probe(probe);
-		return 0;
-	}
-
-	/* Get whole disk name (not full path) for this devno */
-	ret = blkid_devno_to_wholedisk(devno,
-			wholedisk, sizeof(wholedisk), NULL);
-	if (ret) {
-		blkid_free_probe(probe);
-		return 0;
-	}
-
-	snprintf(sysfs_path, PATH_MAX, "/sys/block/%s/queue/rotational",
-		 wholedisk);
-
-	blkid_free_probe(probe);
-
-	fd = open(sysfs_path, O_RDONLY);
-	if (fd < 0) {
-		return 0;
-	}
-
-	if (read(fd, &rotational, 1) < 1) {
-		close(fd);
-		return 0;
-	}
-	close(fd);
-
 	return rotational == '0';
 }
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 02/15] btrfs-progs: introduce raid parameters variables
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
  2019-12-04  8:24 ` [PATCH v5 01/15] btrfs-progs: utils: Introduce queue_param helper function Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 03/15] btrfs-progs: build: Check zoned block device support Naohiro Aota
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Userland btrfs_alloc_chunk() and its kernel side counterpart
__btrfs_alloc_chunk() is so diverged that it's difficult to use the kernel
code as is.

This commit introduces some RAID parameter variables and read them from
btrfs_raid_array as the same as in kernel land.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 volumes.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 143164f02ac0..8bfffa5586eb 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1014,6 +1014,18 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int max_stripes = 0;
 	int min_stripes = 1;
 	int sub_stripes = 1;
+	int dev_stripes __attribute__((unused));
+				/* stripes per dev */
+	int devs_max;		/* max devs to use */
+	int devs_min __attribute__((unused));
+				/* min devs needed */
+	int devs_increment __attribute__((unused));
+				/* ndevs has to be a multiple of this */
+	int ncopies __attribute__((unused));
+				/* how many copies to data has */
+	int nparity __attribute__((unused));
+				/* number of stripes worth of bytes to
+				   store parity information */
 	int looped = 0;
 	int ret;
 	int index;
@@ -1025,6 +1037,18 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		return -ENOSPC;
 	}
 
+	index = btrfs_bg_flags_to_raid_index(type);
+
+	sub_stripes = btrfs_raid_array[index].sub_stripes;
+	dev_stripes = btrfs_raid_array[index].dev_stripes;
+	devs_max = btrfs_raid_array[index].devs_max;
+	if (!devs_max)
+		devs_max = BTRFS_MAX_DEVS(info);
+	devs_min = btrfs_raid_array[index].devs_min;
+	devs_increment = btrfs_raid_array[index].devs_increment;
+	ncopies = btrfs_raid_array[index].ncopies;
+	nparity = btrfs_raid_array[index].nparity;
+
 	if (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
 		if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
 			calc_size = SZ_8M;
@@ -1085,7 +1109,6 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		if (num_stripes < 4)
 			return -ENOSPC;
 		num_stripes &= ~(u32)1;
-		sub_stripes = 2;
 		min_stripes = 4;
 	}
 	if (type & (BTRFS_BLOCK_GROUP_RAID5)) {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 03/15] btrfs-progs: build: Check zoned block device support
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
  2019-12-04  8:24 ` [PATCH v5 01/15] btrfs-progs: utils: Introduce queue_param helper function Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 02/15] btrfs-progs: introduce raid parameters variables Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 04/15] btrfs-progs: add new HMZONED feature flag Naohiro Aota
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

If the kernel supports zoned block devices, the file
/usr/include/linux/blkzoned.h will be present. Check this and define
BTRFS_ZONED if the file is present.

If it present, enables HMZONED feature, if not disable it.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 configure.ac | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/configure.ac b/configure.ac
index cf792eb5488b..c637f72a8fe6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -206,6 +206,18 @@ else
 AC_DEFINE([HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE], [0], [We did not define FIEMAP_EXTENT_SHARED])
 fi
 
+AC_CHECK_HEADER(linux/blkzoned.h, [blkzoned_found=yes], [blkzoned_found=no])
+AC_ARG_ENABLE([zoned],
+  AS_HELP_STRING([--disable-zoned], [disable zoned block device support]),
+  [], [enable_zoned=$blkzoned_found]
+)
+
+AS_IF([test "x$enable_zoned" = xyes], [
+	AC_CHECK_HEADER(linux/blkzoned.h, [],
+		[AC_MSG_ERROR([Couldn't find linux/blkzoned.h])])
+	AC_DEFINE([BTRFS_ZONED], [1], [enable zoned block device support])
+])
+
 dnl Define <NAME>_LIBS= and <NAME>_CFLAGS= by pkg-config
 dnl
 dnl The default PKG_CHECK_MODULES() action-if-not-found is end the
@@ -307,6 +319,7 @@ AC_MSG_RESULT([
 	btrfs-restore zstd: ${enable_zstd}
 	Python bindings:    ${enable_python}
 	Python interpreter: ${PYTHON}
+	zoned device:       ${enable_zoned}
 
 	Type 'make' to compile.
 ])
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 04/15] btrfs-progs: add new HMZONED feature flag
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (2 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 03/15] btrfs-progs: build: Check zoned block device support Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 05/15] btrfs-progs: Introduce zone block device helper functions Naohiro Aota
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

With this feature enabled, a zoned block device aware btrfs allocates block
groups aligned to the device zones and always write in sequential zones at
the zone write pointer position.

Enabling this feature also disable conversion from ext4 volumes.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 cmds/inspect-dump-super.c | 1 +
 common/fsfeatures.c       | 8 ++++++++
 common/fsfeatures.h       | 2 +-
 ctree.h                   | 4 +++-
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/cmds/inspect-dump-super.c b/cmds/inspect-dump-super.c
index f22633b99390..ddb2120fb397 100644
--- a/cmds/inspect-dump-super.c
+++ b/cmds/inspect-dump-super.c
@@ -229,6 +229,7 @@ static struct readable_flag_entry incompat_flags_array[] = {
 	DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES),
 	DEF_INCOMPAT_FLAG_ENTRY(METADATA_UUID),
 	DEF_INCOMPAT_FLAG_ENTRY(RAID1C34),
+	DEF_INCOMPAT_FLAG_ENTRY(HMZONED)
 };
 static const int incompat_flags_num = sizeof(incompat_flags_array) /
 				      sizeof(struct readable_flag_entry);
diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index ac12d57b25a3..929a076e7b69 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -92,6 +92,14 @@ static const struct btrfs_fs_feature {
 		NULL, 0,
 		NULL, 0,
 		"RAID1 with 3 or 4 copies" },
+#ifdef BTRFS_ZONED
+	{ "hmzoned", BTRFS_FEATURE_INCOMPAT_HMZONED,
+		"hmzoned",
+		NULL, 0,
+		NULL, 0,
+		NULL, 0,
+		"support Host-Managed Zoned devices" },
+#endif
 	/* Keep this one last */
 	{ "list-all", BTRFS_FEATURE_LIST_ALL, NULL }
 };
diff --git a/common/fsfeatures.h b/common/fsfeatures.h
index 3cc9452a3327..0918ee1aa113 100644
--- a/common/fsfeatures.h
+++ b/common/fsfeatures.h
@@ -25,7 +25,7 @@
 		| BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
 
 /*
- * Avoid multi-device features (RAID56) and mixed block groups
+ * Avoid multi-device features (RAID56), mixed block groups, and hmzoned device
  */
 #define BTRFS_CONVERT_ALLOWED_FEATURES				\
 	(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF			\
diff --git a/ctree.h b/ctree.h
index 3e50d0863bde..34fd7d00cabf 100644
--- a/ctree.h
+++ b/ctree.h
@@ -493,6 +493,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID    (1ULL << 10)
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED		(1ULL << 12)
 
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 
@@ -517,7 +518,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES |		\
 	 BTRFS_FEATURE_INCOMPAT_RAID1C34 |		\
-	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID |		\
+	 BTRFS_FEATURE_INCOMPAT_HMZONED)
 
 /*
  * A leaf is full of items. offset and size tell us where to find
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 05/15] btrfs-progs: Introduce zone block device helper functions
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (3 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 04/15] btrfs-progs: add new HMZONED feature flag Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 06/15] btrfs-progs: load and check zone information Naohiro Aota
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This patch introduce several zone related functions: btrfs_get_zone_info()
to get zone information from the specified device and put the information
in zinfo, and zone_is_sequential() to check if a zone is a sequential
required zone.

btrfs_get_zone_info() is intentionaly works with "struct btrfs_zone_info"
instead of "struct btrfs_device". We need to load zone information at
btrfs_prepare_device(), but there are no "struct btrfs_device" at that
time.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 Makefile         |   3 +-
 common/hmzoned.c | 219 +++++++++++++++++++++++++++++++++++++++++++++++
 common/hmzoned.h |  63 ++++++++++++++
 3 files changed, 284 insertions(+), 1 deletion(-)
 create mode 100644 common/hmzoned.c
 create mode 100644 common/hmzoned.h

diff --git a/Makefile b/Makefile
index b00eafe44a8d..a67bf7ce7833 100644
--- a/Makefile
+++ b/Makefile
@@ -146,7 +146,8 @@ objects = dir-item.o inode-map.o \
 	  inode.o file.o find-root.o common/help.o send-dump.o \
 	  common/fsfeatures.o \
 	  common/format-output.o \
-	  common/device-utils.o
+	  common/device-utils.o \
+	  common/hmzoned.o
 cmds_objects = cmds/subvolume.o cmds/filesystem.o cmds/device.o cmds/scrub.o \
 	       cmds/inspect.o cmds/balance.o cmds/send.o cmds/receive.o \
 	       cmds/quota.o cmds/qgroup.o cmds/replace.o check/main.o \
diff --git a/common/hmzoned.c b/common/hmzoned.c
new file mode 100644
index 000000000000..e11e56210709
--- /dev/null
+++ b/common/hmzoned.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ * Authors:
+ *      Naohiro Aota    <naohiro.aota@wdc.com>
+ *      Damien Le Moal  <damien.lemoal@wdc.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include <sys/ioctl.h>
+
+#include "common/utils.h"
+#include "common/device-utils.h"
+#include "common/messages.h"
+#include "mkfs/common.h"
+#include "common/hmzoned.h"
+
+#define BTRFS_REPORT_NR_ZONES	8192
+
+enum btrfs_zoned_model zoned_model(const char *file)
+{
+	char model[32];
+	int ret;
+
+	ret = queue_param(file, "zoned", model, sizeof(model));
+	if (ret <= 0)
+		return ZONED_NONE;
+
+	if (strncmp(model, "host-aware", 10) == 0)
+		return ZONED_HOST_AWARE;
+	if (strncmp(model, "host-managed", 12) == 0)
+		return ZONED_HOST_MANAGED;
+
+	return ZONED_NONE;
+}
+
+size_t zone_size(const char *file)
+{
+	char chunk[32];
+	int ret;
+
+	ret = queue_param(file, "chunk_sectors", chunk, sizeof(chunk));
+	if (ret <= 0)
+		return 0;
+
+	return strtoul((const char *)chunk, NULL, 10) << 9;
+}
+
+#ifdef BTRFS_ZONED
+bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo, u64 bytenr)
+{
+	unsigned int zno;
+
+	if (!zinfo || zinfo->model == ZONED_NONE)
+		return false;
+
+	zno = bytenr / zinfo->zone_size;
+
+	/*
+	 * Only sequential write required zones on host-managed
+	 * devices cannot be written randomly.
+	 */
+	return zinfo->zones[zno].type == BLK_ZONE_TYPE_SEQWRITE_REQ;
+}
+
+static int report_zones(int fd, const char *file, u64 block_count,
+			struct btrfs_zoned_device_info *zinfo)
+{
+	size_t zone_bytes = zone_size(file);
+	size_t rep_size;
+	u64 sector = 0;
+	struct blk_zone_report *rep;
+	struct blk_zone *zone;
+	unsigned int i, n = 0;
+	int ret;
+
+	/*
+	 * Zones are guaranteed (by the kernel) to be a power of 2 number of
+	 * sectors. Check this here and make sure that zones are not too
+	 * small.
+	 */
+	if (!zone_bytes || !is_power_of_2(zone_bytes)) {
+		error("Illegal zone size %zu (not a power of 2)", zone_bytes);
+		exit(1);
+	}
+	if (zone_bytes < BTRFS_MKFS_SYSTEM_GROUP_SIZE) {
+		error("Illegal zone size %zu (smaller than %d)", zone_bytes,
+		      BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+		exit(1);
+	}
+
+	/* Allocate the zone information array */
+	zinfo->zone_size = zone_bytes;
+	zinfo->nr_zones = block_count / zone_bytes;
+	if (block_count & (zone_bytes - 1))
+		zinfo->nr_zones++;
+	zinfo->zones = calloc(zinfo->nr_zones, sizeof(struct blk_zone));
+	if (!zinfo->zones) {
+		error("No memory for zone information");
+		exit(1);
+	}
+
+	/* Allocate a zone report */
+	rep_size = sizeof(struct blk_zone_report) +
+		sizeof(struct blk_zone) * BTRFS_REPORT_NR_ZONES;
+	rep = malloc(rep_size);
+	if (!rep) {
+		error("No memory for zones report");
+		exit(1);
+	}
+
+	/* Get zone information */
+	zone = (struct blk_zone *)(rep + 1);
+	while (n < zinfo->nr_zones) {
+		memset(rep, 0, rep_size);
+		rep->sector = sector;
+		rep->nr_zones = BTRFS_REPORT_NR_ZONES;
+
+		ret = ioctl(fd, BLKREPORTZONE, rep);
+		if (ret != 0) {
+			error("ioctl BLKREPORTZONE failed (%s)",
+			      strerror(errno));
+			exit(1);
+		}
+
+		if (!rep->nr_zones)
+			break;
+
+		for (i = 0; i < rep->nr_zones; i++) {
+			if (n >= zinfo->nr_zones)
+				break;
+			memcpy(&zinfo->zones[n], &zone[i],
+			       sizeof(struct blk_zone));
+			n++;
+		}
+
+		sector = zone[rep->nr_zones - 1].start +
+			zone[rep->nr_zones - 1].len;
+	}
+
+	free(rep);
+
+	return 0;
+}
+
+#endif
+
+int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
+			struct btrfs_zoned_device_info **zinfo_ret)
+{
+#ifdef BTRFS_ZONED
+	struct btrfs_zoned_device_info *zinfo;
+#endif
+	struct stat st;
+	enum btrfs_zoned_model model;
+	int ret;
+
+	*zinfo_ret = NULL;
+
+	ret = fstat(fd, &st);
+	if (ret < 0) {
+		error("unable to stat %s", file);
+		return -ENOENT;
+	}
+
+	if (!S_ISBLK(st.st_mode))
+		return 0;
+
+	/* Check zone model */
+	model = zoned_model(file);
+	if (model == ZONED_NONE)
+		return 0;
+
+	if (model == ZONED_HOST_MANAGED && !hmzoned) {
+		error(
+"%s: host-managed zoned block device (enable zone block device support with -O hmzoned)",
+		      file);
+		return -EIO;
+	}
+
+	/* Treat host-aware devices as regular devices */
+	if (!hmzoned)
+		return 0;
+
+#ifdef BTRFS_ZONED
+	zinfo = malloc(sizeof(*zinfo));
+	if (!zinfo) {
+		error("No memory for zone information");
+		exit(1);
+	}
+
+	memset(zinfo, 0, sizeof(struct btrfs_zoned_device_info));
+	zinfo->model = model;
+
+	/* Get zone information */
+	ret = report_zones(fd, file, btrfs_device_size(fd, &st), zinfo);
+	if (ret != 0) {
+		kfree(zinfo);
+		return ret;
+	}
+	*zinfo_ret = zinfo;
+#else
+	error("%s: Unsupported host-%s zoned block device", file,
+	      model == ZONED_HOST_MANAGED ? "managed" : "aware");
+	if (model == ZONED_HOST_MANAGED)
+		return -EOPNOTSUPP;
+
+	error("%s: handling host-aware block device as a regular disk", file);
+#endif
+	return 0;
+}
diff --git a/common/hmzoned.h b/common/hmzoned.h
new file mode 100644
index 000000000000..098952061bfb
--- /dev/null
+++ b/common/hmzoned.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ * Authors:
+ *      Naohiro Aota    <naohiro.aota@wdc.com>
+ *      Damien Le Moal  <damien.lemoal@wdc.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef __BTRFS_HMZONED_H__
+#define __BTRFS_HMZONED_H__
+
+#ifdef BTRFS_ZONED
+#include <linux/blkzoned.h>
+#else
+struct blk_zone {
+	int dummy;
+};
+#endif /* BTRFS_ZONED */
+
+/*
+ * Zoned block device models.
+ */
+enum btrfs_zoned_model {
+	ZONED_NONE = 0,
+	ZONED_HOST_AWARE,
+	ZONED_HOST_MANAGED,
+};
+
+/*
+ * Zone information for a zoned block device.
+ */
+struct btrfs_zoned_device_info {
+	enum btrfs_zoned_model	model;
+	u64			zone_size;
+	u32			nr_zones;
+	struct blk_zone	*zones;
+};
+
+enum btrfs_zoned_model zoned_model(const char *file);
+size_t zone_size(const char *file);
+int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
+			struct btrfs_zoned_device_info **zinfo);
+
+#ifdef BTRFS_ZONED
+bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo, u64 bytenr);
+#else
+static inline bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo,
+				      u64 bytenr)
+{
+	return true;
+}
+#endif /* BTRFS_ZONED */
+
+#endif /* __BTRFS_HMZONED_H__ */
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 06/15] btrfs-progs: load and check zone information
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (4 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 05/15] btrfs-progs: Introduce zone block device helper functions Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 07/15] btrfs-progs: support discarding zoned device Naohiro Aota
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This patch checks if a device added to btrfs is a zoned block device. If it
is, load zones information and the zone size for the device.

For a btrfs volume composed of multiple zoned block devices, all devices
must have the same zone size.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/device-scan.c | 15 +++++++++++++++
 common/hmzoned.h     |  2 ++
 volumes.c            | 31 +++++++++++++++++++++++++++++++
 volumes.h            |  4 ++++
 4 files changed, 52 insertions(+)

diff --git a/common/device-scan.c b/common/device-scan.c
index 48dbd9e19715..548e1322bb70 100644
--- a/common/device-scan.c
+++ b/common/device-scan.c
@@ -29,6 +29,7 @@
 #include "kernel-lib/overflow.h"
 #include "common/path-utils.h"
 #include "common/device-scan.h"
+#include "common/hmzoned.h"
 #include "common/messages.h"
 #include "common/utils.h"
 #include "common-defs.h"
@@ -137,6 +138,19 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
+	ret = btrfs_get_zone_info(fd, path, fs_info->fs_devices->hmzoned,
+				  &device->zone_info);
+	if (ret)
+		goto out;
+	if (fs_info->fs_devices->hmzoned) {
+		if (device->zone_info->zone_size !=
+		    fs_info->fs_devices->zone_size) {
+			error("Device zone size differ");
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
 	disk_super = (struct btrfs_super_block *)buf;
 	dev_item = &disk_super->dev_item;
 
@@ -197,6 +211,7 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 	return 0;
 
 out:
+	free(device->zone_info);
 	free(device);
 	free(buf);
 	return ret;
diff --git a/common/hmzoned.h b/common/hmzoned.h
index 098952061bfb..d229b946e5ed 100644
--- a/common/hmzoned.h
+++ b/common/hmzoned.h
@@ -18,6 +18,8 @@
 #ifndef __BTRFS_HMZONED_H__
 #define __BTRFS_HMZONED_H__
 
+#include <stdbool.h>
+
 #ifdef BTRFS_ZONED
 #include <linux/blkzoned.h>
 #else
diff --git a/volumes.c b/volumes.c
index 8bfffa5586eb..d92052e19330 100644
--- a/volumes.c
+++ b/volumes.c
@@ -27,6 +27,7 @@
 #include "transaction.h"
 #include "print-tree.h"
 #include "volumes.h"
+#include "common/hmzoned.h"
 #include "common/utils.h"
 #include "kernel-lib/raid56.h"
 
@@ -214,6 +215,8 @@ static int device_list_add(const char *path,
 	u64 found_transid = btrfs_super_generation(disk_super);
 	bool metadata_uuid = (btrfs_super_incompat_flags(disk_super) &
 		BTRFS_FEATURE_INCOMPAT_METADATA_UUID);
+	bool hmzoned = btrfs_super_incompat_flags(disk_super) &
+		BTRFS_FEATURE_INCOMPAT_HMZONED;
 
 	if (metadata_uuid)
 		fs_devices = find_fsid(disk_super->fsid,
@@ -238,8 +241,18 @@ static int device_list_add(const char *path,
 		fs_devices->latest_devid = devid;
 		fs_devices->latest_trans = found_transid;
 		fs_devices->lowest_devid = (u64)-1;
+		fs_devices->hmzoned = hmzoned;
 		device = NULL;
 	} else {
+		if (fs_devices->hmzoned != hmzoned) {
+			if (hmzoned)
+				error(
+			"Cannot add HMZONED device to non-HMZONED file system");
+			else
+				error(
+			"Cannot add non-HMZONED device to HMZONED file system");
+			return -EINVAL;
+		}
 		device = find_device(fs_devices, devid,
 				       disk_super->dev_item.uuid);
 	}
@@ -335,6 +348,7 @@ again:
 		/* free the memory */
 		free(device->name);
 		free(device->label);
+		free(device->zone_info);
 		free(device);
 	}
 
@@ -373,6 +387,8 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags)
 	struct btrfs_device *device;
 	int ret;
 
+	fs_devices->zone_size = 0;
+
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
 		if (!device->name) {
 			printk("no name for device %llu, skip it now\n", device->devid);
@@ -396,6 +412,21 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags)
 		device->fd = fd;
 		if (flags & O_RDWR)
 			device->writeable = 1;
+
+		ret = btrfs_get_zone_info(fd, device->name, fs_devices->hmzoned,
+					  &device->zone_info);
+		if (ret != 0)
+			goto fail;
+		if (!device->zone_info)
+			continue;
+		if (!fs_devices->zone_size) {
+			fs_devices->zone_size = device->zone_info->zone_size;
+		} else if (device->zone_info->zone_size !=
+			   fs_devices->zone_size) {
+			error("Device zone size differ");
+			ret = -EINVAL;
+			goto fail;
+		}
 	}
 	return 0;
 fail:
diff --git a/volumes.h b/volumes.h
index 41574f21dd23..d52dbcba0410 100644
--- a/volumes.h
+++ b/volumes.h
@@ -28,6 +28,7 @@ struct btrfs_device {
 	struct list_head dev_list;
 	struct btrfs_root *dev_root;
 	struct btrfs_fs_devices *fs_devices;
+	struct btrfs_zoned_device_info *zone_info;
 
 	u64 total_ios;
 
@@ -87,6 +88,9 @@ struct btrfs_fs_devices {
 
 	int seeding;
 	struct btrfs_fs_devices *seed;
+
+	u64 zone_size;
+	bool hmzoned;
 };
 
 struct btrfs_bio_stripe {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 07/15] btrfs-progs: support discarding zoned device
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (5 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 06/15] btrfs-progs: load and check zone information Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 08/15] btrfs-progs: support zero out on zoned block device Naohiro Aota
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

All zones of zoned block devices should be reset before writing. Support
this by introducing PREP_DEVICE_HMZONED.

This commit export discard_blocks() and use it from
btrfs_discard_all_zones().

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/device-utils.c | 32 ++++++++++++++++++++++++++++++--
 common/device-utils.h |  2 ++
 common/hmzoned.c      | 29 +++++++++++++++++++++++++++++
 common/hmzoned.h      |  6 ++++++
 4 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/common/device-utils.c b/common/device-utils.c
index 7fa9386f4677..2689f157aeea 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -29,6 +29,7 @@
 #include "common/internal.h"
 #include "common/messages.h"
 #include "common/utils.h"
+#include "common/hmzoned.h"
 
 #ifndef BLKDISCARD
 #define BLKDISCARD	_IO(0x12,119)
@@ -49,7 +50,7 @@ static int discard_range(int fd, u64 start, u64 len)
 /*
  * Discard blocks in the given range in 1G chunks, the process is interruptible
  */
-static int discard_blocks(int fd, u64 start, u64 len)
+int discard_blocks(int fd, u64 start, u64 len)
 {
 	while (len > 0) {
 		/* 1G granularity */
@@ -155,6 +156,7 @@ out:
 int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		u64 max_block_count, unsigned opflags)
 {
+	struct btrfs_zoned_device_info *zinfo = NULL;
 	u64 block_count;
 	struct stat st;
 	int i, ret;
@@ -173,7 +175,30 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 	if (max_block_count)
 		block_count = min(block_count, max_block_count);
 
-	if (opflags & PREP_DEVICE_DISCARD) {
+	ret = btrfs_get_zone_info(fd, file, opflags & PREP_DEVICE_HMZONED,
+				  &zinfo);
+	if (ret < 0)
+		return 1;
+
+	if (opflags & PREP_DEVICE_HMZONED) {
+		if (!zinfo) {
+			error("unable to load zone information of %s", file);
+			return 1;
+		}
+		if (opflags & PREP_DEVICE_VERBOSE)
+			printf("Resetting device zones %s (%u zones) ...\n",
+			       file, zinfo->nr_zones);
+		/*
+		 * We cannot ignore zone discard (reset) errors for a zoned
+		 * block device as this could result in the inability to
+		 * write to non-empty sequential zones of the device.
+		 */
+		if (btrfs_discard_all_zones(fd, zinfo)) {
+			error("failed to reset device '%s' zones", file);
+			kfree(zinfo);
+			return 1;
+		}
+	} else if (opflags & PREP_DEVICE_DISCARD) {
 		/*
 		 * We intentionally ignore errors from the discard ioctl.  It
 		 * is not necessary for the mkfs functionality but just an
@@ -198,6 +223,7 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 	if (ret < 0) {
 		errno = -ret;
 		error("failed to zero device '%s': %m", file);
+		kfree(zinfo);
 		return 1;
 	}
 
@@ -207,6 +233,8 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		return 1;
 	}
 
+	kfree(zinfo);
+
 	*block_count_ret = block_count;
 	return 0;
 }
diff --git a/common/device-utils.h b/common/device-utils.h
index d1799323d002..885a46937e0d 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -23,7 +23,9 @@
 #define	PREP_DEVICE_ZERO_END	(1U << 0)
 #define	PREP_DEVICE_DISCARD	(1U << 1)
 #define	PREP_DEVICE_VERBOSE	(1U << 2)
+#define	PREP_DEVICE_HMZONED	(1U << 3)
 
+int discard_blocks(int fd, u64 start, u64 len);
 u64 get_partition_size(const char *dev);
 u64 disk_size(const char *path);
 u64 btrfs_device_size(int fd, struct stat *st);
diff --git a/common/hmzoned.c b/common/hmzoned.c
index e11e56210709..5803b2c17a2b 100644
--- a/common/hmzoned.c
+++ b/common/hmzoned.c
@@ -16,6 +16,7 @@
  */
 
 #include <sys/ioctl.h>
+#include <unistd.h>
 
 #include "common/utils.h"
 #include "common/device-utils.h"
@@ -151,6 +152,34 @@ static int report_zones(int fd, const char *file, u64 block_count,
 	return 0;
 }
 
+/*
+ * Discard blocks in the zones of a zoned block device. Process this
+ * with zone size granularity so that blocks in conventional zones are
+ * discarded using discard_range and blocks in sequential zones are
+ * discarded though a zone reset.
+ */
+int btrfs_discard_all_zones(int fd, struct btrfs_zoned_device_info *zinfo)
+{
+	unsigned int i;
+
+	ASSERT(zinfo);
+
+	/* Zone size granularity */
+	for (i = 0; i < zinfo->nr_zones; i++) {
+		if (zinfo->zones[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+			discard_blocks(fd, zinfo->zones[i].start << 9,
+				       zinfo->zone_size);
+		} else if (zinfo->zones[i].cond != BLK_ZONE_COND_EMPTY) {
+			struct blk_zone_range range = {
+				zinfo->zones[i].start,
+				zinfo->zone_size >> 9 };
+			if (ioctl(fd, BLKRESETZONE, &range) < 0)
+				return errno;
+		}
+	}
+	return fsync(fd);
+}
+
 #endif
 
 int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
diff --git a/common/hmzoned.h b/common/hmzoned.h
index d229b946e5ed..631780537a77 100644
--- a/common/hmzoned.h
+++ b/common/hmzoned.h
@@ -54,12 +54,18 @@ int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
 
 #ifdef BTRFS_ZONED
 bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo, u64 bytenr);
+int btrfs_discard_all_zones(int fd, struct btrfs_zoned_device_info *zinfo);
 #else
 static inline bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo,
 				      u64 bytenr)
 {
 	return true;
 }
+static inline int btrfs_discard_all_zones(int fd,
+					  struct btrfs_zoned_device_info *zinfo)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* BTRFS_ZONED */
 
 #endif /* __BTRFS_HMZONED_H__ */
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 08/15] btrfs-progs: support zero out on zoned block device
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (6 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 07/15] btrfs-progs: support discarding zoned device Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 09/15] btrfs-progs: implement log-structured superblock for HMZONED mode Naohiro Aota
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

If we zero out a region in a sequential write required zone, we cannot
write to the region until we reset the zone. Thus, we must prohibit zeroing
out to a sequential write required zone.

zero_dev_clamped() is modified to take the zone information and it calls
zero_zone_blocks() if the device is host managed to avoid writing to
sequential write required zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/device-utils.c | 14 +++++++++-----
 common/device-utils.h |  1 +
 common/hmzoned.c      | 28 ++++++++++++++++++++++++++++
 common/hmzoned.h      |  8 ++++++++
 4 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/common/device-utils.c b/common/device-utils.c
index 2689f157aeea..2ac8e7d9802a 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -67,7 +67,7 @@ int discard_blocks(int fd, u64 start, u64 len)
 	return 0;
 }
 
-static int zero_blocks(int fd, off_t start, size_t len)
+int zero_blocks(int fd, off_t start, size_t len)
 {
 	char *buf = malloc(len);
 	int ret = 0;
@@ -86,7 +86,8 @@ static int zero_blocks(int fd, off_t start, size_t len)
 #define ZERO_DEV_BYTES SZ_2M
 
 /* don't write outside the device by clamping the region to the device size */
-static int zero_dev_clamped(int fd, off_t start, ssize_t len, u64 dev_size)
+static int zero_dev_clamped(int fd, struct btrfs_zoned_device_info *zinfo,
+			    off_t start, ssize_t len, u64 dev_size)
 {
 	off_t end = max(start, start + len);
 
@@ -99,6 +100,9 @@ static int zero_dev_clamped(int fd, off_t start, ssize_t len, u64 dev_size)
 	start = min_t(u64, start, dev_size);
 	end = min_t(u64, end, dev_size);
 
+	if (zinfo && zinfo->model == ZONED_HOST_MANAGED)
+		return zero_zone_blocks(fd, zinfo, start, end - start);
+
 	return zero_blocks(fd, start, end - start);
 }
 
@@ -212,12 +216,12 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		}
 	}
 
-	ret = zero_dev_clamped(fd, 0, ZERO_DEV_BYTES, block_count);
+	ret = zero_dev_clamped(fd, zinfo, 0, ZERO_DEV_BYTES, block_count);
 	for (i = 0 ; !ret && i < BTRFS_SUPER_MIRROR_MAX; i++)
-		ret = zero_dev_clamped(fd, btrfs_sb_offset(i),
+		ret = zero_dev_clamped(fd, zinfo, btrfs_sb_offset(i),
 				       BTRFS_SUPER_INFO_SIZE, block_count);
 	if (!ret && (opflags & PREP_DEVICE_ZERO_END))
-		ret = zero_dev_clamped(fd, block_count - ZERO_DEV_BYTES,
+		ret = zero_dev_clamped(fd, zinfo, block_count - ZERO_DEV_BYTES,
 				       ZERO_DEV_BYTES, block_count);
 
 	if (ret < 0) {
diff --git a/common/device-utils.h b/common/device-utils.h
index 885a46937e0d..7d5b622b8957 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -26,6 +26,7 @@
 #define	PREP_DEVICE_HMZONED	(1U << 3)
 
 int discard_blocks(int fd, u64 start, u64 len);
+int zero_blocks(int fd, off_t start, size_t len);
 u64 get_partition_size(const char *dev);
 u64 disk_size(const char *path);
 u64 btrfs_device_size(int fd, struct stat *st);
diff --git a/common/hmzoned.c b/common/hmzoned.c
index 5803b2c17a2b..484877743948 100644
--- a/common/hmzoned.c
+++ b/common/hmzoned.c
@@ -180,6 +180,34 @@ int btrfs_discard_all_zones(int fd, struct btrfs_zoned_device_info *zinfo)
 	return fsync(fd);
 }
 
+int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
+		     size_t len)
+{
+	size_t zone_len = zinfo->zone_size;
+	off_t ofst = start;
+	size_t count;
+	int ret;
+
+	/* Make sure that zero_blocks does not write sequential zones */
+	while (len > 0) {
+		/* Limit zero_blocks to a single zone */
+		count = min_t(size_t, len, zone_len);
+		if (count > zone_len - (ofst & (zone_len - 1)))
+			count = zone_len - (ofst & (zone_len - 1));
+
+		if (!zone_is_sequential(zinfo, ofst)) {
+			ret = zero_blocks(fd, ofst, count);
+			if (ret != 0)
+				return ret;
+		}
+
+		len -= count;
+		ofst += count;
+	}
+
+	return 0;
+}
+
 #endif
 
 int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
diff --git a/common/hmzoned.h b/common/hmzoned.h
index 631780537a77..a902717335b0 100644
--- a/common/hmzoned.h
+++ b/common/hmzoned.h
@@ -55,6 +55,8 @@ int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
 #ifdef BTRFS_ZONED
 bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo, u64 bytenr);
 int btrfs_discard_all_zones(int fd, struct btrfs_zoned_device_info *zinfo);
+int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
+		     size_t len);
 #else
 static inline bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo,
 				      u64 bytenr)
@@ -66,6 +68,12 @@ static inline int btrfs_discard_all_zones(int fd,
 {
 	return -EOPNOTSUPP;
 }
+static inline int zero_zone_blocks(int fd,
+				   struct btrfs_zoned_device_info *zinfo,
+				   off_t start, size_t len)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* BTRFS_ZONED */
 
 #endif /* __BTRFS_HMZONED_H__ */
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 09/15] btrfs-progs: implement log-structured superblock for HMZONED mode
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (7 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 08/15] btrfs-progs: support zero out on zoned block device Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 10/15] btrfs-progs: align device extent allocation to zone boundary Naohiro Aota
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Superblock (and its copies) is the only data structure in btrfs which has a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone. One easy solution is
limiting superblock and copies to be placed only in conventional zones.
However, this method has two downsides: one is reduced number of superblock
copies. The location of the second copy of superblock is 256GB, which is in
a sequential write required zone on typical devices in the market today.
So, the number of superblock and copies is limited to be two.  Second
downside is that we cannot support devices which have no conventional zones
at all.

To solve these two problem, we employ superblock log writing. It uses two
zones as a circular buffer to write updated superblocks. Once the first
zone is filled up, start writing into the second buffer and reset the first
one. We can determine the postion of the latest superblock by reading write
pointer information from a device.

The following zones are reserved as the circular buffer on HMZONED btrfs.

- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and next
  to it

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 cmds/inspect-dump-super.c |   3 +-
 common/device-scan.c      |   4 +-
 common/device-utils.c     |  17 ++-
 common/hmzoned.c          | 227 ++++++++++++++++++++++++++++++++++++++
 common/hmzoned.h          |  23 ++++
 disk-io.c                 |  14 +--
 kerncompat.h              |   7 ++
 7 files changed, 281 insertions(+), 14 deletions(-)

diff --git a/cmds/inspect-dump-super.c b/cmds/inspect-dump-super.c
index ddb2120fb397..e49dec560ca7 100644
--- a/cmds/inspect-dump-super.c
+++ b/cmds/inspect-dump-super.c
@@ -34,6 +34,7 @@
 #include "cmds/commands.h"
 #include "crypto/crc32c.h"
 #include "common/help.h"
+#include "common/hmzoned.h"
 
 static int check_csum_sblock(void *sb, int csum_size, u16 csum_type)
 {
@@ -491,7 +492,7 @@ static int load_and_dump_sb(char *filename, int fd, u64 sb_bytenr, int full,
 
 	sb = (struct btrfs_super_block *)super_block_data;
 
-	ret = pread64(fd, super_block_data, BTRFS_SUPER_INFO_SIZE, sb_bytenr);
+	ret = sbread(fd, super_block_data, sb_bytenr);
 	if (ret != BTRFS_SUPER_INFO_SIZE) {
 		/* check if the disk if too short for further superblock */
 		if (ret == 0 && errno == 0)
diff --git a/common/device-scan.c b/common/device-scan.c
index 548e1322bb70..7760ce50ad72 100644
--- a/common/device-scan.c
+++ b/common/device-scan.c
@@ -202,7 +202,7 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_device_bytes_used(dev_item, device->bytes_used);
 	memcpy(&dev_item->uuid, device->uuid, BTRFS_UUID_SIZE);
 
-	ret = pwrite(fd, buf, sectorsize, BTRFS_SUPER_INFO_OFFSET);
+	ret = sbwrite(fd, buf, BTRFS_SUPER_INFO_OFFSET);
 	BUG_ON(ret != sectorsize);
 
 	free(buf);
@@ -279,7 +279,7 @@ int btrfs_device_already_in_root(struct btrfs_root *root, int fd,
 		ret = -ENOMEM;
 		goto out;
 	}
-	ret = pread(fd, buf, BTRFS_SUPER_INFO_SIZE, super_offset);
+	ret = sbread(fd, buf, super_offset);
 	if (ret != BTRFS_SUPER_INFO_SIZE)
 		goto brelse;
 
diff --git a/common/device-utils.c b/common/device-utils.c
index 2ac8e7d9802a..d7bbac0e1730 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -231,10 +231,19 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		return 1;
 	}
 
-	ret = btrfs_wipe_existing_sb(fd);
-	if (ret < 0) {
-		error("cannot wipe superblocks on %s", file);
-		return 1;
+	if (!zinfo) {
+		ret = btrfs_wipe_existing_sb(fd);
+		if (ret < 0) {
+			error("cannot wipe superblocks on %s", file);
+			return 1;
+		}
+	} else {
+		ret = btrfs_wipe_sb_zones(fd, zinfo);
+		if (ret < 0) {
+			error("cannot wipe superblock log zones on %s", file);
+			kfree(zinfo);
+			return 1;
+		}
 	}
 
 	kfree(zinfo);
diff --git a/common/hmzoned.c b/common/hmzoned.c
index 484877743948..5080bd7dea5b 100644
--- a/common/hmzoned.c
+++ b/common/hmzoned.c
@@ -18,6 +18,7 @@
 #include <sys/ioctl.h>
 #include <unistd.h>
 
+#include "disk-io.h"
 #include "common/utils.h"
 #include "common/device-utils.h"
 #include "common/messages.h"
@@ -56,6 +57,24 @@ size_t zone_size(const char *file)
 }
 
 #ifdef BTRFS_ZONED
+static u32 sb_zone_number(u64 zone_size, int mirror)
+{
+	ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
+
+	switch (mirror) {
+	case 0:
+		return 0;
+	case 1:
+		return 16;
+	case 2:
+		return min(btrfs_sb_offset(mirror) / zone_size, 1024ULL);
+	default:
+		BUG();
+	}
+
+	return 0;
+}
+
 bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo, u64 bytenr)
 {
 	unsigned int zno;
@@ -180,6 +199,39 @@ int btrfs_discard_all_zones(int fd, struct btrfs_zoned_device_info *zinfo)
 	return fsync(fd);
 }
 
+int btrfs_wipe_sb_zones(int fd, struct btrfs_zoned_device_info *zinfo)
+{
+	struct blk_zone_range range;
+	int i;
+
+	if (!zinfo)
+		return 0;
+
+	if (zinfo->model == ZONED_NONE)
+		return 0;
+
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+		u32 sb_pos = sb_zone_number(zinfo->zone_size, i);
+
+		if (zinfo->nr_zones >= sb_pos + 1)
+			break;
+
+		range.sector = (sb_pos * zinfo->zone_size) >> SECTOR_SHIFT;
+		range.nr_sectors = (2 * zinfo->zone_size) >> SECTOR_SHIFT;
+
+		if (ioctl(fd, BLKRESETZONE, &range) < 0) {
+			error("failed to reset zone %u: %s",
+			      sb_pos, strerror(errno));
+			return 1;
+		}
+	}
+
+	if (fsync(fd))
+		return 1;
+
+	return 0;
+}
+
 int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
 		     size_t len)
 {
@@ -208,6 +260,181 @@ int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
 	return 0;
 }
 
+static int sb_write_pointer(struct blk_zone *zones, u64 *wp_ret)
+{
+	bool empty[2];
+	bool full[2];
+	sector_t sector;
+
+	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	}
+
+	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
+	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
+	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
+	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
+
+	/*
+	 * Possible state of log buffer zones
+	 *
+	 *   E I F
+	 * E * x 0
+	 * I 0 x 0
+	 * F 1 1 x
+	 *
+	 * Row: zones[0]
+	 * Col: zones[1]
+	 * State:
+	 *   E: Empty, I: In-Use, F: Full
+	 * Log position:
+	 *   *: Special case, no superblock is written
+	 *   0: Use write pointer of zones[0]
+	 *   1: Use write pointer of zones[1]
+	 *   x: Invalid state
+	 */
+
+	if (empty[0] && empty[1]) {
+		/* special case to distinguish no superblock to read */
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	} else if (full[0] && full[1]) {
+		/* cannot determine which zone has the newer superblock */
+		return -EUCLEAN;
+	} else if (!full[0] && (empty[1] || full[1])) {
+		sector = zones[0].wp;
+	} else if (full[0]) {
+		sector = zones[1].wp;
+	} else {
+		return -EUCLEAN;
+	}
+	*wp_ret = sector << SECTOR_SHIFT;
+	return 0;
+}
+
+size_t btrfs_sb_io(int fd, void *buf, off_t offset, int rw)
+{
+	size_t count = BTRFS_SUPER_INFO_SIZE;
+	struct blk_zone_report *rep;
+	struct blk_zone *zones;
+	const u64 sb_size_sector = BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT;
+	u64 mapped;
+	u32 zone_num;
+	int reset_target;
+	u32 zone_size_sector;
+	size_t rep_size;
+	int mirror = -1;
+	int i;
+	int ret;
+	size_t ret_sz;
+
+	ASSERT(rw == READ || rw == WRITE);
+
+	ret = ioctl(fd, BLKGETZONESZ, &zone_size_sector);
+	if (ret) {
+		error("ioctl BLKGETZONESZ failed (%s)", strerror(errno));
+		exit(1);
+	}
+
+	if (zone_size_sector == 0) {
+		if (rw == READ)
+			return pread64(fd, buf, count, offset);
+		return pwrite64(fd, buf, count, offset);
+	}
+
+	ASSERT(IS_ALIGNED(zone_size_sector, sb_size_sector));
+
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+		if (offset == btrfs_sb_offset(i)) {
+			mirror = i;
+			break;
+		}
+	}
+	ASSERT(mirror != -1);
+
+	zone_num = sb_zone_number(zone_size_sector * 512, mirror);
+
+	rep_size = sizeof(struct blk_zone_report) + sizeof(struct blk_zone) * 2;
+	rep = malloc(rep_size);
+	if (!rep) {
+		error("No memory for zones report");
+		exit(1);
+	}
+
+	memset(rep, 0, rep_size);
+	rep->sector = zone_num * zone_size_sector;
+	rep->nr_zones = 2;
+
+	ret = ioctl(fd, BLKREPORTZONE, rep);
+	if (ret) {
+		error("ioctl BLKREPORTZONE failed (%s)", strerror(errno));
+		exit(1);
+	}
+	if (rep->nr_zones != 2) {
+		if (errno == ENOENT || errno == 0)
+			return (rw == WRITE ? count : 0);
+		error("failed to read zone info of %u and %u: %s", zone_num,
+		      zone_num + 1, strerror(errno));
+		free(rep);
+		return 0;
+	}
+
+	zones = (struct blk_zone *)(rep + 1);
+
+	ret = sb_write_pointer(zones, &mapped);
+	if (ret != -ENOENT && ret)
+		return -EIO;
+	if (rw == READ) {
+		if (ret != -ENOENT) {
+			if (mapped == zones[0].start << SECTOR_SHIFT)
+				mapped = (zones[1].start + zones[1].len)
+					<< SECTOR_SHIFT;
+			mapped -= BTRFS_SUPER_INFO_SIZE;
+		}
+		return pread64(fd, buf, count, mapped);
+	}
+
+	ret_sz = pwrite64(fd, buf, count, mapped);
+	if (zone_size_sector == 0)
+		return ret_sz;
+
+	if (ret_sz != count)
+		return ret_sz;
+	if (fsync(fd)) {
+		error("failed to synchronize superblock: %s", strerror(errno));
+		exit(1);
+	}
+
+	reset_target = -1;
+	mapped += BTRFS_SUPER_INFO_SIZE;
+	if (mapped == (zones[0].start + zones[0].len) << SECTOR_SHIFT &&
+	    zones[1].cond != BLK_ZONE_COND_EMPTY)
+		reset_target = 1;
+	else if (mapped == (zones[1].start + zones[1].len) << SECTOR_SHIFT &&
+		 zones[0].cond != BLK_ZONE_COND_EMPTY)
+		reset_target = 0;
+
+	if (reset_target != -1) {
+		struct blk_zone_range range = {
+			zone_size_sector * (zone_num + reset_target),
+			zone_size_sector,
+		};
+		if (ioctl(fd, BLKRESETZONE, &range) < 0) {
+			error("failed to reset zone %u: %s",
+			      zone_num + reset_target, strerror(errno));
+			exit(1);
+		}
+		if (fsync(fd)) {
+			error("failed to synchronize zone reset: %s",
+			      strerror(errno));
+			exit(1);
+		}
+	}
+
+	return ret_sz;
+}
+
 #endif
 
 int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
diff --git a/common/hmzoned.h b/common/hmzoned.h
index a902717335b0..920f992dbb93 100644
--- a/common/hmzoned.h
+++ b/common/hmzoned.h
@@ -57,6 +57,16 @@ bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo, u64 bytenr);
 int btrfs_discard_all_zones(int fd, struct btrfs_zoned_device_info *zinfo);
 int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
 		     size_t len);
+size_t btrfs_sb_io(int fd, void *buf, off_t offset, int rw);
+static inline size_t sbread(int fd, void *buf, off_t offset)
+{
+	return btrfs_sb_io(fd, buf, offset, READ);
+}
+static inline size_t sbwrite(int fd, void *buf, off_t offset)
+{
+	return btrfs_sb_io(fd, buf, offset, WRITE);
+}
+int btrfs_wipe_sb_zones(int fd, struct btrfs_zoned_device_info *zinfo);
 #else
 static inline bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo,
 				      u64 bytenr)
@@ -74,6 +84,19 @@ static inline int zero_zone_blocks(int fd,
 {
 	return -EOPNOTSUPP;
 }
+static inline u64 btrfs_map_sb_offset_for_zoned(int fd, u64 offset)
+{
+	return offset;
+}
+#define sbread(fd, buf, offset) \
+	pread64(fd, buf, BTRFS_SUPER_INFO_SIZE, offset)
+#define sbwrite(fd, buf, offset) \
+	pwrite64(fd, buf, BTRFS_SUPER_INFO_SIZE, offset)
+static inline int btrfs_wipe_sb_zones(int fd,
+				      struct btrfs_zoned_device_info *zinfo)
+{
+	return 0;
+}
 #endif /* BTRFS_ZONED */
 
 #endif /* __BTRFS_HMZONED_H__ */
diff --git a/disk-io.c b/disk-io.c
index 659f8b93a7ca..92f781ce4abe 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -35,6 +35,7 @@
 #include "common/rbtree-utils.h"
 #include "common/device-scan.h"
 #include "crypto/hash.h"
+#include "common/hmzoned.h"
 
 /* specified errno for check_tree_block */
 #define BTRFS_BAD_BYTENR		(-1)
@@ -1553,7 +1554,7 @@ int btrfs_read_dev_super(int fd, struct btrfs_super_block *sb, u64 sb_bytenr,
 	u64 bytenr;
 
 	if (sb_bytenr != BTRFS_SUPER_INFO_OFFSET) {
-		ret = pread64(fd, buf, BTRFS_SUPER_INFO_SIZE, sb_bytenr);
+		ret = sbread(fd, buf, sb_bytenr);
 		/* real error */
 		if (ret < 0)
 			return -errno;
@@ -1581,7 +1582,8 @@ int btrfs_read_dev_super(int fd, struct btrfs_super_block *sb, u64 sb_bytenr,
 
 	for (i = 0; i < max_super; i++) {
 		bytenr = btrfs_sb_offset(i);
-		ret = pread64(fd, buf, BTRFS_SUPER_INFO_SIZE, bytenr);
+		ret = sbread(fd, buf, bytenr);
+
 		if (ret < BTRFS_SUPER_INFO_SIZE)
 			break;
 
@@ -1653,9 +1655,8 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 		 * super_copy is BTRFS_SUPER_INFO_SIZE bytes and is
 		 * zero filled, we can use it directly
 		 */
-		ret = pwrite64(device->fd, fs_info->super_copy,
-				BTRFS_SUPER_INFO_SIZE,
-				fs_info->super_bytenr);
+		ret = sbwrite(device->fd, fs_info->super_copy,
+			      fs_info->super_bytenr);
 		if (ret != BTRFS_SUPER_INFO_SIZE) {
 			errno = EIO;
 			error(
@@ -1688,8 +1689,7 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 		 * super_copy is BTRFS_SUPER_INFO_SIZE bytes and is
 		 * zero filled, we can use it directly
 		 */
-		ret = pwrite64(device->fd, fs_info->super_copy,
-				BTRFS_SUPER_INFO_SIZE, bytenr);
+		ret = sbwrite(device->fd, fs_info->super_copy, bytenr);
 		if (ret != BTRFS_SUPER_INFO_SIZE) {
 			errno = EIO;
 			error(
diff --git a/kerncompat.h b/kerncompat.h
index 01fd93a7b540..c38643437747 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -76,6 +76,10 @@
 #define ULONG_MAX       (~0UL)
 #endif
 
+#ifndef SECTOR_SHIFT
+#define SECTOR_SHIFT 9
+#endif
+
 #define __token_glue(a,b,c)	___token_glue(a,b,c)
 #define ___token_glue(a,b,c)	a ## b ## c
 #ifdef DEBUG_BUILD_CHECKS
@@ -162,6 +166,7 @@ typedef long long s64;
 typedef int s32;
 #endif
 
+typedef u64 sector_t;
 
 struct vma_shared { int prio_tree_node; };
 struct vm_area_struct {
@@ -362,6 +367,8 @@ typedef u32 __bitwise __be32;
 typedef u64 __bitwise __le64;
 typedef u64 __bitwise __be64;
 
+#define U64_MAX UINT64_MAX
+
 /* Macros to generate set/get funcs for the struct fields
  * assume there is a lefoo_to_cpu for every type, so lets make a simple
  * one for u8:
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 10/15] btrfs-progs: align device extent allocation to zone boundary
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (8 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 09/15] btrfs-progs: implement log-structured superblock for HMZONED mode Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 11/15] btrfs-progs: do sequential allocation in HMZONED mode Naohiro Aota
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

In HMZONED mode, align the device extents to zone boundaries so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents. Also, check that a region allocation
is always over empty zones and it is not over any locations of super block
zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/hmzoned.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
 common/hmzoned.h | 23 +++++++++++++++
 kerncompat.h     |  2 ++
 volumes.c        | 76 +++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 163 insertions(+), 8 deletions(-)

diff --git a/common/hmzoned.c b/common/hmzoned.c
index 5080bd7dea5b..2cbf2fc88cb0 100644
--- a/common/hmzoned.c
+++ b/common/hmzoned.c
@@ -24,6 +24,8 @@
 #include "common/messages.h"
 #include "mkfs/common.h"
 #include "common/hmzoned.h"
+#include "volumes.h"
+#include "disk-io.h"
 
 #define BTRFS_REPORT_NR_ZONES	8192
 
@@ -435,6 +437,74 @@ size_t btrfs_sb_io(int fd, void *buf, off_t offset, int rw)
 	return ret_sz;
 }
 
+static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	unsigned int zno;
+
+	if (!zone_is_sequential(zinfo, pos))
+		return true;
+
+	zno = pos / zinfo->zone_size;
+	return zinfo->zones[zno].cond == BLK_ZONE_COND_EMPTY;
+}
+
+/*
+ * btrfs_check_allocatable_zones - check if spcecifeid region is
+ *                                 suitable for allocation
+ * @device:	the device to allocate a region
+ * @pos:	the position of the region
+ * @num_bytes:	the size of the region
+ *
+ * In non-ZONED device, anywhere is suitable for allocation. In ZONED
+ * device, check if
+ * 1) the region is not on non-empty sequential zones,
+ * 2) all zones in the region have the same zone type,
+ * 3) it does not contain super block location
+ */
+bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
+				   u64 num_bytes)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u64 nzones, begin, end;
+	u64 sb_pos;
+	bool is_sequential;
+	int i;
+
+	if (!zinfo || zinfo->model == ZONED_NONE)
+		return true;
+
+	nzones = num_bytes / zinfo->zone_size;
+	begin = pos / zinfo->zone_size;
+	end = begin + nzones;
+
+	ASSERT(IS_ALIGNED(pos, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+
+	if (end > zinfo->nr_zones)
+		return false;
+
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+		sb_pos = sb_zone_number(zinfo->zone_size, i);
+		if (!(end < sb_pos || sb_pos + 1 < begin))
+			return false;
+	}
+
+	is_sequential = btrfs_dev_is_sequential(device, pos);
+
+	while (num_bytes) {
+		if (is_sequential && !btrfs_dev_is_empty_zone(device, pos))
+			return false;
+		if (is_sequential != btrfs_dev_is_sequential(device, pos))
+			return false;
+
+		pos += zinfo->zone_size;
+		num_bytes -= zinfo->zone_size;
+	}
+
+	return true;
+}
+
 #endif
 
 int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
diff --git a/common/hmzoned.h b/common/hmzoned.h
index 920f992dbb93..3444e2c1b0f5 100644
--- a/common/hmzoned.h
+++ b/common/hmzoned.h
@@ -19,6 +19,7 @@
 #define __BTRFS_HMZONED_H__
 
 #include <stdbool.h>
+#include "volumes.h"
 
 #ifdef BTRFS_ZONED
 #include <linux/blkzoned.h>
@@ -67,6 +68,8 @@ static inline size_t sbwrite(int fd, void *buf, off_t offset)
 	return btrfs_sb_io(fd, buf, offset, WRITE);
 }
 int btrfs_wipe_sb_zones(int fd, struct btrfs_zoned_device_info *zinfo);
+bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
+				   u64 num_bytes);
 #else
 static inline bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo,
 				      u64 bytenr)
@@ -97,6 +100,26 @@ static inline int btrfs_wipe_sb_zones(int fd,
 {
 	return 0;
 }
+static inline bool btrfs_check_allocatable_zones(struct btrfs_device *device,
+						 u64 pos, u64 num_bytes)
+{
+	return true;
+}
+
 #endif /* BTRFS_ZONED */
 
+static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
+{
+	return zone_is_sequential(device->zone_info, pos);
+}
+static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+
+	if (!zinfo || zinfo->model == ZONED_NONE)
+		return pos;
+
+	return ALIGN(pos, zinfo->zone_size);
+}
+
 #endif /* __BTRFS_HMZONED_H__ */
diff --git a/kerncompat.h b/kerncompat.h
index c38643437747..58cdcf921c5e 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -28,6 +28,7 @@
 #include <assert.h>
 #include <stddef.h>
 #include <linux/types.h>
+#include <linux/kernel.h>
 #include <stdint.h>
 
 #include <features.h>
@@ -354,6 +355,7 @@ static inline void assert_trace(const char *assertion, const char *filename,
 
 /* Alignment check */
 #define IS_ALIGNED(x, a)                (((x) & ((typeof(x))(a) - 1)) == 0)
+#define ALIGN(x, a)		__ALIGN_KERNEL((x), (a))
 
 static inline int is_power_of_2(unsigned long n)
 {
diff --git a/volumes.c b/volumes.c
index d92052e19330..148169d5b2a2 100644
--- a/volumes.c
+++ b/volumes.c
@@ -496,6 +496,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	int slot;
 	struct extent_buffer *l;
 	u64 min_search_start;
+	u64 zone_size = 0;
 
 	/*
 	 * We don't want to overwrite the superblock on the drive nor any area
@@ -504,6 +505,14 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	 */
 	min_search_start = max(root->fs_info->alloc_start, (u64)SZ_1M);
 	search_start = max(search_start, min_search_start);
+	/*
+	 * For a zoned block device, skip the first zone of the device
+	 * entirely.
+	 */
+	if (device->zone_info)
+		zone_size = device->zone_info->zone_size;
+	search_start = max_t(u64, search_start, zone_size);
+	search_start = btrfs_zone_align(device, search_start);
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -512,6 +521,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	max_hole_start = search_start;
 	max_hole_size = 0;
 
+again:
 	if (search_start >= search_end) {
 		ret = -ENOSPC;
 		goto out;
@@ -556,6 +566,13 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 			goto next;
 
 		if (key.offset > search_start) {
+			if (!btrfs_check_allocatable_zones(device, search_start,
+							   num_bytes)) {
+				search_start += zone_size;
+				btrfs_release_path(path);
+				goto again;
+			}
+
 			hole_size = key.offset - search_start;
 
 			/*
@@ -598,6 +615,13 @@ next:
 	 * search_end may be smaller than search_start.
 	 */
 	if (search_end > search_start) {
+		if (!btrfs_check_allocatable_zones(device, search_start,
+						   num_bytes)) {
+			search_start += zone_size;
+			btrfs_release_path(path);
+			goto again;
+		}
+
 		hole_size = search_end - search_start;
 
 		if (hole_size > max_hole_size) {
@@ -613,6 +637,7 @@ next:
 		ret = 0;
 
 out:
+	ASSERT(zone_size == 0 || IS_ALIGNED(max_hole_start, zone_size));
 	btrfs_free_path(path);
 	*start = max_hole_start;
 	if (len)
@@ -641,6 +666,11 @@ int btrfs_insert_dev_extent(struct btrfs_trans_handle *trans,
 	struct extent_buffer *leaf;
 	struct btrfs_key key;
 
+	/* Check alignment to zone for a zoned block device */
+	ASSERT(!device->zone_info ||
+	       device->zone_info->model != ZONED_HOST_MANAGED ||
+	       IS_ALIGNED(start, device->zone_info->zone_size));
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -1045,17 +1075,13 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int max_stripes = 0;
 	int min_stripes = 1;
 	int sub_stripes = 1;
-	int dev_stripes __attribute__((unused));
-				/* stripes per dev */
+	int dev_stripes;	/* stripes per dev */
 	int devs_max;		/* max devs to use */
-	int devs_min __attribute__((unused));
-				/* min devs needed */
+	int devs_min;		/* min devs needed */
 	int devs_increment __attribute__((unused));
 				/* ndevs has to be a multiple of this */
-	int ncopies __attribute__((unused));
-				/* how many copies to data has */
-	int nparity __attribute__((unused));
-				/* number of stripes worth of bytes to
+	int ncopies;		/* how many copies to data has */
+	int nparity;		/* number of stripes worth of bytes to
 				   store parity information */
 	int looped = 0;
 	int ret;
@@ -1063,6 +1089,8 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int stripe_len = BTRFS_STRIPE_LEN;
 	struct btrfs_key key;
 	u64 offset;
+	bool hmzoned = info->fs_devices->hmzoned;
+	u64 zone_size = info->fs_devices->zone_size;
 
 	if (list_empty(dev_list)) {
 		return -ENOSPC;
@@ -1163,13 +1191,40 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 				    btrfs_super_stripesize(info->super_copy));
 	}
 
+	if (hmzoned) {
+		calc_size = zone_size;
+		max_chunk_size = max(max_chunk_size, zone_size);
+		max_chunk_size = round_down(max_chunk_size, zone_size);
+	}
+
 	/* we don't want a chunk larger than 10% of the FS */
 	percent_max = div_factor(btrfs_super_total_bytes(info->super_copy), 1);
 	max_chunk_size = min(percent_max, max_chunk_size);
 
+	if (hmzoned) {
+		int min_num_stripes = devs_min * dev_stripes;
+		int min_data_stripes = (min_num_stripes - nparity) / ncopies;
+		u64 min_chunk_size = min_data_stripes * zone_size;
+
+		max_chunk_size = max(round_down(max_chunk_size,
+						zone_size),
+				     min_chunk_size);
+	}
+
 again:
 	if (chunk_bytes_by_type(type, calc_size, num_stripes, sub_stripes) >
 	    max_chunk_size) {
+		if (hmzoned) {
+			/*
+			 * calc_size is fixed in HMZONED. Reduce
+			 * num_stripes instead.
+			 */
+			num_stripes = max_chunk_size * ncopies / calc_size;
+			if (num_stripes < min_stripes)
+				return -ENOSPC;
+			goto again;
+		}
+
 		calc_size = max_chunk_size;
 		calc_size /= num_stripes;
 		calc_size /= stripe_len;
@@ -1180,6 +1235,9 @@ again:
 
 	calc_size /= stripe_len;
 	calc_size *= stripe_len;
+
+	ASSERT(!hmzoned || calc_size == zone_size);
+
 	INIT_LIST_HEAD(&private_devs);
 	cur = dev_list->next;
 	index = 0;
@@ -1261,6 +1319,8 @@ again:
 		if (ret < 0)
 			goto out_chunk_map;
 
+		ASSERT(!zone_size || IS_ALIGNED(dev_offset, zone_size));
+
 		device->bytes_used += calc_size;
 		ret = btrfs_update_device(trans, device);
 		if (ret < 0)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 11/15] btrfs-progs: do sequential allocation in HMZONED mode
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (9 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 10/15] btrfs-progs: align device extent allocation to zone boundary Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 12/15] btrfs-progs: redirty clean extent buffers in seq Naohiro Aota
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

On HMZONED drives, writes must always be sequential and directed at a block
group zone write pointer position. Thus, block allocation in a block group
must also be done sequentially using an allocation pointer equal to the
block group zone write pointer plus the number of blocks allocated but not
yet written.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/hmzoned.c | 406 +++++++++++++++++++++++++++++++++++++++++++++++
 common/hmzoned.h |   7 +
 ctree.h          |   6 +
 extent-tree.c    |  16 ++
 4 files changed, 435 insertions(+)

diff --git a/common/hmzoned.c b/common/hmzoned.c
index 2cbf2fc88cb0..f268f360d8f7 100644
--- a/common/hmzoned.c
+++ b/common/hmzoned.c
@@ -29,6 +29,11 @@
 
 #define BTRFS_REPORT_NR_ZONES	8192
 
+/* Invalid allocation pointer value for missing devices */
+#define WP_MISSING_DEV ((u64)-1)
+/* Pseudo write pointer value for conventional zone */
+#define WP_CONVENTIONAL ((u64)-2)
+
 enum btrfs_zoned_model zoned_model(const char *file)
 {
 	char model[32];
@@ -505,6 +510,407 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 	return true;
 }
 
+static int emulate_write_pointer(struct btrfs_fs_info *fs_info,
+				 struct btrfs_block_group_cache *cache,
+				 u64 *offset_ret)
+{
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_key search_key;
+	struct btrfs_key found_key;
+	int slot;
+	int ret;
+	u64 length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	search_key.objectid = cache->key.objectid + cache->key.offset;
+	search_key.type = 0;
+	search_key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	ASSERT(ret != 0);
+	slot = path->slots[0];
+	leaf = path->nodes[0];
+	ASSERT(slot != 0);
+	slot--;
+	btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+	if (found_key.objectid < cache->key.objectid) {
+		*offset_ret = 0;
+	} else if (found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+		struct btrfs_key extent_item_key;
+
+		if (found_key.objectid != cache->key.objectid) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		length = 0;
+
+		/* metadata may have METADATA_ITEM_KEY */
+		if (slot == 0) {
+			ret = btrfs_prev_leaf(root, path);
+			if (ret < 0)
+				goto out;
+			if (ret == 0) {
+				slot = btrfs_header_nritems(leaf) - 1;
+				btrfs_item_key_to_cpu(leaf, &extent_item_key,
+						      slot);
+			}
+		} else {
+			btrfs_item_key_to_cpu(leaf, &extent_item_key, slot - 1);
+			ret = 0;
+		}
+
+		if (ret == 0 &&
+		    extent_item_key.objectid == cache->key.objectid) {
+			if (extent_item_key.type == BTRFS_METADATA_ITEM_KEY)
+				length = fs_info->nodesize;
+			else if (extent_item_key.type == BTRFS_EXTENT_ITEM_KEY)
+				length = extent_item_key.offset;
+			else {
+				ret = -EUCLEAN;
+				goto out;
+			}
+		}
+
+		*offset_ret = length;
+	} else if (found_key.type == BTRFS_EXTENT_ITEM_KEY ||
+		   found_key.type == BTRFS_METADATA_ITEM_KEY) {
+
+		if (found_key.type == BTRFS_EXTENT_ITEM_KEY)
+			length = found_key.offset;
+		else
+			length = fs_info->nodesize;
+
+		if (!(found_key.objectid >= cache->key.objectid &&
+		       found_key.objectid + length <=
+		       cache->key.objectid + cache->key.offset)) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+		*offset_ret = found_key.objectid + length - cache->key.objectid;
+	} else {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = 0;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static u64 offset_in_dev_extent(struct map_lookup *map, u64 *alloc_offsets,
+				u64 logical, int idx)
+{
+	u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+	u64 stripe_nr = logical / map->stripe_len;
+	u64 full_stripes_cnt = stripe_nr / map->num_stripes;
+	u32 rest_stripes_cnt = stripe_nr % map->num_stripes;
+	u64 stripe_start, offset;
+	int data_stripes = map->num_stripes / map->sub_stripes;
+	int stripe_idx;
+	int i;
+
+	ASSERT(profile == BTRFS_BLOCK_GROUP_RAID0 ||
+	       profile == BTRFS_BLOCK_GROUP_RAID10);
+
+	stripe_idx = idx / map->sub_stripes;
+
+	if (stripe_idx < rest_stripes_cnt)
+		return map->stripe_len * (full_stripes_cnt + 1);
+
+	for (i = idx + map->sub_stripes; i < map->num_stripes;
+	     i += map->sub_stripes) {
+		if (alloc_offsets[i] != WP_CONVENTIONAL &&
+		    alloc_offsets[i] > map->stripe_len * full_stripes_cnt)
+			return map->stripe_len * (full_stripes_cnt + 1);
+	}
+
+	stripe_start = (full_stripes_cnt * data_stripes + stripe_idx) *
+		map->stripe_len;
+	if (stripe_start >= logical)
+		return full_stripes_cnt * map->stripe_len;
+	offset = min_t(u64, logical - stripe_start, map->stripe_len);
+
+	return full_stripes_cnt * map->stripe_len + offset;
+}
+
+int btrfs_load_block_group_zone_info(struct btrfs_fs_info *fs_info,
+				     struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_device *device;
+	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
+	struct cache_extent *ce;
+	struct map_lookup *map;
+	u64 logical = cache->key.objectid;
+	u64 length = cache->key.offset;
+	u64 physical = 0;
+	int ret = 0;
+	int i, j;
+	u64 zone_size = fs_info->fs_devices->zone_size;
+	u64 *alloc_offsets = NULL;
+	u64 emulated_offset = 0;
+	u32 num_sequential = 0, num_conventional = 0;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	/* Sanity check */
+	if (logical == BTRFS_BLOCK_RESERVED_1M_FOR_SUPER) {
+		if (length + SZ_1M != zone_size) {
+			error("unaligned initial system block group");
+			return -EIO;
+		}
+	} else if (!IS_ALIGNED(length, zone_size)) {
+		error("unaligned block group at %llu + %llu", logical, length);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	ce = search_cache_extent(&map_tree->cache_tree, logical);
+	if (!ce) {
+		error("failed to find block group at %llu", logical);
+		return -ENOENT;
+	}
+	map = container_of(ce, struct map_lookup, ce);
+
+	/*
+	 * Get the zone type: if the group is mapped to a non-sequential zone,
+	 * there is no need for the allocation offset (fit allocation is OK).
+	 */
+	alloc_offsets = calloc(map->num_stripes, sizeof(*alloc_offsets));
+	if (!alloc_offsets) {
+		error("failed to allocate alloc_offsets");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < map->num_stripes; i++) {
+		bool is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		if (device->fd == -1) {
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		}
+
+		is_sequential = btrfs_dev_is_sequential(device, physical);
+		if (is_sequential)
+			num_sequential++;
+		else
+			num_conventional++;
+
+		if (!is_sequential) {
+			alloc_offsets[i] = WP_CONVENTIONAL;
+			continue;
+		}
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		WARN_ON(!IS_ALIGNED(physical, zone_size));
+		zone = device->zone_info->zones[physical / zone_size];
+
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			error("Offline/readonly zone %llu",
+			      physical / fs_info->fs_devices->zone_size);
+			ret = -EIO;
+			goto out;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] = ((zone.wp - zone.start) << 9);
+			break;
+		}
+	}
+
+	if (num_conventional > 0) {
+		ret = emulate_write_pointer(fs_info, cache, &emulated_offset);
+		if (ret || map->num_stripes == num_conventional) {
+			if (!ret)
+				cache->alloc_offset = emulated_offset;
+			goto out;
+		}
+	}
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+		cache->alloc_offset = WP_MISSING_DEV;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV ||
+			    alloc_offsets[i] == WP_CONVENTIONAL)
+				continue;
+			if (cache->alloc_offset == WP_MISSING_DEV)
+				cache->alloc_offset = alloc_offsets[i];
+			if (alloc_offsets[i] == cache->alloc_offset)
+				continue;
+
+			error("write pointer mismatch: block group %llu",
+			      logical);
+			ret = -EIO;
+			goto out;
+		}
+		if (num_conventional && emulated_offset > cache->alloc_offset)
+			ret = -EIO;
+		break;
+	case BTRFS_BLOCK_GROUP_RAID0:
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV) {
+				error(
+			"cannot recover write pointer: block group %llu",
+				      logical);
+				ret = -EIO;
+				goto out;
+			}
+
+			if (alloc_offsets[i] == WP_CONVENTIONAL)
+				alloc_offsets[i] =
+					offset_in_dev_extent(map, alloc_offsets,
+							     emulated_offset,
+							     i);
+
+			/* sanity check */
+			if (i > 0) {
+				if ((alloc_offsets[i] % BTRFS_STRIPE_LEN != 0 &&
+				     alloc_offsets[i - 1] %
+					     BTRFS_STRIPE_LEN != 0) ||
+				    (alloc_offsets[i - 1] < alloc_offsets[i]) ||
+				    (alloc_offsets[i - 1] - alloc_offsets[i] >
+						BTRFS_STRIPE_LEN)) {
+					error(
+				"write pointer mismatch at %d: block group %llu",
+					      i, logical);
+					ret = -EIO;
+					goto out;
+				}
+			}
+
+			cache->alloc_offset += alloc_offsets[i];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID10:
+		/*
+		 * Pass1: check write pointer of RAID1 level: each pointer
+		 * should be equal.
+		 */
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i * map->sub_stripes;
+			u64 offset = WP_MISSING_DEV;
+			int fill = 0, num_conventional = 0;
+
+			for (j = 0; j < map->sub_stripes; j++) {
+				if (alloc_offsets[base+j] == WP_MISSING_DEV) {
+					fill++;
+					continue;
+				}
+				if (alloc_offsets[base+j] == WP_CONVENTIONAL) {
+					fill++;
+					num_conventional++;
+					continue;
+				}
+				if (offset == WP_MISSING_DEV)
+					offset = alloc_offsets[base + j];
+				if (alloc_offsets[base + j] == offset)
+					continue;
+
+				error(
+				"write pointer mismatch: block group %llu",
+				      logical);
+				ret = -EIO;
+				goto out;
+			}
+			if (!fill)
+				continue;
+			/* this RAID0 stripe is free on conventional zones */
+			if (num_conventional == map->sub_stripes)
+				offset = WP_CONVENTIONAL;
+			/* fill WP_MISSING_DEV or WP_CONVENTIONAL */
+			for (j = 0; j < map->sub_stripes; j++)
+				alloc_offsets[base + j] = offset;
+		}
+
+		/* Pass2: check write pointer of RAID0 level */
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i * map->sub_stripes;
+
+			if (alloc_offsets[base] == WP_MISSING_DEV) {
+				error(
+			"cannot recover write pointer: block group %llu",
+				      logical);
+				ret = -EIO;
+				goto out;
+			}
+
+			if (alloc_offsets[base] == WP_CONVENTIONAL)
+				alloc_offsets[base] =
+					offset_in_dev_extent(map, alloc_offsets,
+							     emulated_offset,
+							     base);
+
+			/* sanity check */
+			if (i > 0) {
+				int prev = base - map->sub_stripes;
+
+				if ((alloc_offsets[base] %
+					     BTRFS_STRIPE_LEN != 0 &&
+				     alloc_offsets[prev] %
+					     BTRFS_STRIPE_LEN != 0) ||
+				    (alloc_offsets[prev] <
+					     alloc_offsets[base]) ||
+				    (alloc_offsets[prev] - alloc_offsets[base] >
+						BTRFS_STRIPE_LEN)) {
+					error(
+				"write pointer mismatch: block group %llu",
+					      logical);
+					ret = -EIO;
+					goto out;
+				}
+			}
+
+			cache->alloc_offset += alloc_offsets[base];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* RAID5/6 is not supported yet */
+	default:
+		error("Unsupported profile %llu",
+		      map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	/* an extent is allocated after the write pointer */
+	if (num_conventional && emulated_offset > cache->alloc_offset)
+		ret = -EIO;
+
+	free(alloc_offsets);
+	return ret;
+}
+
 #endif
 
 int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
diff --git a/common/hmzoned.h b/common/hmzoned.h
index 3444e2c1b0f5..a6b16d0ed35a 100644
--- a/common/hmzoned.h
+++ b/common/hmzoned.h
@@ -70,6 +70,8 @@ static inline size_t sbwrite(int fd, void *buf, off_t offset)
 int btrfs_wipe_sb_zones(int fd, struct btrfs_zoned_device_info *zinfo);
 bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 				   u64 num_bytes);
+int btrfs_load_block_group_zone_info(struct btrfs_fs_info *fs_info,
+				     struct btrfs_block_group_cache *cache);
 #else
 static inline bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo,
 				      u64 bytenr)
@@ -105,6 +107,11 @@ static inline bool btrfs_check_allocatable_zones(struct btrfs_device *device,
 {
 	return true;
 }
+static inline int btrfs_load_block_group_zone_info(
+	struct btrfs_fs_info *fs_info, struct btrfs_block_group_cache *cache)
+{
+	return 0;
+}
 
 #endif /* BTRFS_ZONED */
 
diff --git a/ctree.h b/ctree.h
index 34fd7d00cabf..fe72bd8921b0 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1119,6 +1119,12 @@ struct btrfs_block_group_cache {
          */
         u32 bitmap_low_thresh;
 
+	/*
+	 * Allocation offset for the block group to implement
+	 * sequential allocation. This is used only with HMZONED mode
+	 * enabled.
+	 */
+	u64 alloc_offset;
 };
 
 struct btrfs_device;
diff --git a/extent-tree.c b/extent-tree.c
index 53be4f4c7369..89a8b935b602 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -30,6 +30,7 @@
 #include "volumes.h"
 #include "free-space-cache.h"
 #include "free-space-tree.h"
+#include "common/hmzoned.h"
 #include "common/utils.h"
 
 #define PENDING_EXTENT_INSERT 0
@@ -258,6 +259,14 @@ again:
 	if (cache->ro || !block_group_bits(cache, data))
 		goto new_group;
 
+	if (root->fs_info->fs_devices->hmzoned) {
+		if (cache->key.offset - cache->alloc_offset < num)
+			goto new_group;
+		*start_ret = cache->key.objectid + cache->alloc_offset;
+		cache->alloc_offset += num;
+		return 0;
+	}
+
 	while(1) {
 		ret = find_first_extent_bit(&root->fs_info->free_space_cache,
 					    last, &start, &end, EXTENT_DIRTY);
@@ -2720,6 +2729,10 @@ static int read_one_block_group(struct btrfs_fs_info *fs_info,
 	}
 	cache->space_info = space_info;
 
+	ret = btrfs_load_block_group_zone_info(fs_info, cache);
+	if (ret)
+		return ret;
+
 	set_extent_bits(block_group_cache, cache->key.objectid,
 			cache->key.objectid + cache->key.offset - 1,
 			bit | EXTENT_LOCKED);
@@ -2785,6 +2798,9 @@ btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
 	cache->key.objectid = chunk_offset;
 	cache->key.offset = size;
 
+	ret = btrfs_load_block_group_zone_info(fs_info, cache);
+	BUG_ON(ret);
+
 	cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	cache->used = bytes_used;
 	cache->flags = type;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 12/15] btrfs-progs: redirty clean extent buffers in seq
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (10 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 11/15] btrfs-progs: do sequential allocation in HMZONED mode Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 13/15] btrfs-progs: mkfs: Zoned block device support Naohiro Aota
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On HMZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks breaks the sequential write sequence expected by the
device.

This patch check if next dirty extent buffer is continuous to a previously
written one. If not, it redirty extent buffers between the previous one and
the next one, so that all dirty buffers are written sequentially.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/hmzoned.c | 30 ++++++++++++++++++++++++++++++
 common/hmzoned.h |  7 +++++++
 ctree.h          |  1 +
 transaction.c    |  7 +++++++
 4 files changed, 45 insertions(+)

diff --git a/common/hmzoned.c b/common/hmzoned.c
index f268f360d8f7..53c9e1cfd472 100644
--- a/common/hmzoned.c
+++ b/common/hmzoned.c
@@ -907,10 +907,40 @@ out:
 	if (num_conventional && emulated_offset > cache->alloc_offset)
 		ret = -EIO;
 
+	if (!ret)
+		cache->write_offset = cache->alloc_offset;
+
 	free(alloc_offsets);
 	return ret;
 }
 
+bool btrfs_redirty_extent_buffer_for_hmzoned(struct btrfs_fs_info *fs_info,
+					     u64 start, u64 end)
+{
+	u64 next;
+	struct btrfs_block_group_cache *cache;
+	struct extent_buffer *eb;
+
+	if (!fs_info->fs_devices->hmzoned)
+		return false;
+
+	cache = btrfs_lookup_first_block_group(fs_info, start);
+	BUG_ON(!cache);
+
+	if (cache->key.objectid + cache->write_offset < start) {
+		next = cache->key.objectid + cache->write_offset;
+		BUG_ON(next + fs_info->nodesize > start);
+		eb = btrfs_find_create_tree_block(fs_info, next);
+		btrfs_mark_buffer_dirty(eb);
+		free_extent_buffer(eb);
+		return true;
+	}
+
+	cache->write_offset += (end + 1 - start);
+
+	return false;
+}
+
 #endif
 
 int btrfs_get_zone_info(int fd, const char *file, bool hmzoned,
diff --git a/common/hmzoned.h b/common/hmzoned.h
index a6b16d0ed35a..ee2fab311967 100644
--- a/common/hmzoned.h
+++ b/common/hmzoned.h
@@ -72,6 +72,8 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 				   u64 num_bytes);
 int btrfs_load_block_group_zone_info(struct btrfs_fs_info *fs_info,
 				     struct btrfs_block_group_cache *cache);
+bool btrfs_redirty_extent_buffer_for_hmzoned(struct btrfs_fs_info *fs_info,
+					     u64 start, u64 end);
 #else
 static inline bool zone_is_sequential(struct btrfs_zoned_device_info *zinfo,
 				      u64 bytenr)
@@ -112,6 +114,11 @@ static inline int btrfs_load_block_group_zone_info(
 {
 	return 0;
 }
+static inline bool btrfs_redirty_extent_buffer_for_hmzoned(
+	struct btrfs_fs_info *fs_info, u64 start, u64 end)
+{
+	return false;
+}
 
 #endif /* BTRFS_ZONED */
 
diff --git a/ctree.h b/ctree.h
index fe72bd8921b0..7418627cade3 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1125,6 +1125,7 @@ struct btrfs_block_group_cache {
 	 * enabled.
 	 */
 	u64 alloc_offset;
+	u64 write_offset;
 };
 
 struct btrfs_device;
diff --git a/transaction.c b/transaction.c
index 45bb9e1f9de6..7b37f12f118f 100644
--- a/transaction.c
+++ b/transaction.c
@@ -18,6 +18,7 @@
 #include "disk-io.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "common/hmzoned.h"
 
 #include "common/messages.h"
 
@@ -136,10 +137,16 @@ int __commit_transaction(struct btrfs_trans_handle *trans,
 	int ret;
 
 	while(1) {
+again:
 		ret = find_first_extent_bit(tree, 0, &start, &end,
 					    EXTENT_DIRTY);
 		if (ret)
 			break;
+
+		if (btrfs_redirty_extent_buffer_for_hmzoned(fs_info, start,
+							    end))
+			goto again;
+
 		while(start <= end) {
 			eb = find_first_extent_buffer(tree, start);
 			BUG_ON(!eb || eb->start != start);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 13/15] btrfs-progs: mkfs: Zoned block device support
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (11 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 12/15] btrfs-progs: redirty clean extent buffers in seq Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 14/15] btrfs-progs: device-add: support HMZONED device Naohiro Aota
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This patch makes the size of the temporary system group chunk equal to the
device zone size. It also enables PREP_DEVICE_HMZONED if the user enables
the HMZONED feature.

Enabling HMZONED feature is done using option "-O hmzoned". This feature is
incompatible for now with source directory setup.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 extent-tree.c |  8 ++++++++
 mkfs/common.c | 38 +++++++++++++++++++++++++-------------
 mkfs/common.h |  1 +
 mkfs/main.c   | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 77 insertions(+), 20 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 89a8b935b602..23b8bf44f4fe 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -32,6 +32,7 @@
 #include "free-space-tree.h"
 #include "common/hmzoned.h"
 #include "common/utils.h"
+#include "mkfs/common.h"
 
 #define PENDING_EXTENT_INSERT 0
 #define PENDING_EXTENT_DELETE 1
@@ -2799,6 +2800,13 @@ btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
 	cache->key.offset = size;
 
 	ret = btrfs_load_block_group_zone_info(fs_info, cache);
+	if (ret == -ENOENT &&
+	    cache->key.objectid == fs_info->fs_devices->zone_size * 2) {
+		/* Write pointer for initial SYSTEM block group */
+		cache->write_offset = cache->alloc_offset =
+			fs_info->nodesize * (MKFS_BLOCK_COUNT - 1);
+		ret = 0;
+	}
 	BUG_ON(ret);
 
 	cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
diff --git a/mkfs/common.c b/mkfs/common.c
index 469b88d6a8d3..c7406a3bd230 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -25,6 +25,7 @@
 #include "common/utils.h"
 #include "common/path-utils.h"
 #include "common/device-utils.h"
+#include "common/hmzoned.h"
 #include "mkfs/common.h"
 
 static u64 reference_root_table[] = {
@@ -155,6 +156,13 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	int skinny_metadata = !!(cfg->features &
 				 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA);
 	u64 num_bytes;
+	u64 system_group_offset = BTRFS_BLOCK_RESERVED_1M_FOR_SUPER;
+	u64 system_group_size =  BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+
+	if ((cfg->features & BTRFS_FEATURE_INCOMPAT_HMZONED)) {
+		system_group_offset = cfg->zone_size * 2;
+		system_group_size = cfg->zone_size;
+	}
 
 	buf = malloc(sizeof(*buf) + max(cfg->sectorsize, cfg->nodesize));
 	if (!buf)
@@ -186,7 +194,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 
 	cfg->blocks[MKFS_SUPER_BLOCK] = BTRFS_SUPER_INFO_OFFSET;
 	for (i = 1; i < MKFS_BLOCK_COUNT; i++) {
-		cfg->blocks[i] = BTRFS_BLOCK_RESERVED_1M_FOR_SUPER +
+		cfg->blocks[i] = system_group_offset +
 			cfg->nodesize * (i - 1);
 	}
 
@@ -204,7 +212,10 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_super_stripesize(&super, cfg->stripesize);
 	btrfs_set_super_csum_type(&super, cfg->csum_type);
 	btrfs_set_super_chunk_root_generation(&super, 1);
-	btrfs_set_super_cache_generation(&super, -1);
+	if (cfg->features & BTRFS_FEATURE_INCOMPAT_HMZONED)
+		btrfs_set_super_cache_generation(&super, 0);
+	else
+		btrfs_set_super_cache_generation(&super, -1);
 	btrfs_set_super_incompat_flags(&super, cfg->features);
 	if (cfg->label)
 		__strncpy_null(super.label, cfg->label, BTRFS_LABEL_SIZE - 1);
@@ -320,8 +331,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_device_id(buf, dev_item, 1);
 	btrfs_set_device_generation(buf, dev_item, 0);
 	btrfs_set_device_total_bytes(buf, dev_item, num_bytes);
-	btrfs_set_device_bytes_used(buf, dev_item,
-				    BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+	btrfs_set_device_bytes_used(buf, dev_item, system_group_size);
 	btrfs_set_device_io_align(buf, dev_item, cfg->sectorsize);
 	btrfs_set_device_io_width(buf, dev_item, cfg->sectorsize);
 	btrfs_set_device_sector_size(buf, dev_item, cfg->sectorsize);
@@ -342,14 +352,14 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 
 	/* then we have chunk 0 */
 	btrfs_set_disk_key_objectid(&disk_key, BTRFS_FIRST_CHUNK_TREE_OBJECTID);
-	btrfs_set_disk_key_offset(&disk_key, BTRFS_BLOCK_RESERVED_1M_FOR_SUPER);
+	btrfs_set_disk_key_offset(&disk_key, system_group_offset);
 	btrfs_set_disk_key_type(&disk_key, BTRFS_CHUNK_ITEM_KEY);
 	btrfs_set_item_key(buf, &disk_key, nritems);
 	btrfs_set_item_offset(buf, btrfs_item_nr(nritems), itemoff);
 	btrfs_set_item_size(buf, btrfs_item_nr(nritems), item_size);
 
 	chunk = btrfs_item_ptr(buf, nritems, struct btrfs_chunk);
-	btrfs_set_chunk_length(buf, chunk, BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+	btrfs_set_chunk_length(buf, chunk, system_group_size);
 	btrfs_set_chunk_owner(buf, chunk, BTRFS_EXTENT_TREE_OBJECTID);
 	btrfs_set_chunk_stripe_len(buf, chunk, BTRFS_STRIPE_LEN);
 	btrfs_set_chunk_type(buf, chunk, BTRFS_BLOCK_GROUP_SYSTEM);
@@ -359,7 +369,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_chunk_num_stripes(buf, chunk, 1);
 	btrfs_set_stripe_devid_nr(buf, chunk, 0, 1);
 	btrfs_set_stripe_offset_nr(buf, chunk, 0,
-				   BTRFS_BLOCK_RESERVED_1M_FOR_SUPER);
+				   system_group_offset);
 	nritems++;
 
 	write_extent_buffer(buf, super.dev_item.uuid,
@@ -398,7 +408,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 		sizeof(struct btrfs_dev_extent);
 
 	btrfs_set_disk_key_objectid(&disk_key, 1);
-	btrfs_set_disk_key_offset(&disk_key, BTRFS_BLOCK_RESERVED_1M_FOR_SUPER);
+	btrfs_set_disk_key_offset(&disk_key, system_group_offset);
 	btrfs_set_disk_key_type(&disk_key, BTRFS_DEV_EXTENT_KEY);
 	btrfs_set_item_key(buf, &disk_key, nritems);
 	btrfs_set_item_offset(buf, btrfs_item_nr(nritems), itemoff);
@@ -410,14 +420,13 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_dev_extent_chunk_objectid(buf, dev_extent,
 					BTRFS_FIRST_CHUNK_TREE_OBJECTID);
 	btrfs_set_dev_extent_chunk_offset(buf, dev_extent,
-					  BTRFS_BLOCK_RESERVED_1M_FOR_SUPER);
+					  system_group_offset);
 
 	write_extent_buffer(buf, chunk_tree_uuid,
 		    (unsigned long)btrfs_dev_extent_chunk_tree_uuid(dev_extent),
 		    BTRFS_UUID_SIZE);
 
-	btrfs_set_dev_extent_length(buf, dev_extent,
-				    BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+	btrfs_set_dev_extent_length(buf, dev_extent, system_group_size);
 	nritems++;
 
 	btrfs_set_header_bytenr(buf, cfg->blocks[MKFS_DEV_TREE]);
@@ -464,13 +473,16 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	buf->len = BTRFS_SUPER_INFO_SIZE;
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, BTRFS_SUPER_INFO_SIZE,
-			cfg->blocks[MKFS_SUPER_BLOCK]);
+	ret = sbwrite(fd, buf->data, cfg->blocks[MKFS_SUPER_BLOCK]);
 	if (ret != BTRFS_SUPER_INFO_SIZE) {
 		ret = (ret < 0 ? -errno : -EIO);
 		goto out;
 	}
 
+	ret = fsync(fd);
+	if (ret)
+		goto out;
+
 	ret = 0;
 
 out:
diff --git a/mkfs/common.h b/mkfs/common.h
index 1ca71a4fcce5..b7742dedbae1 100644
--- a/mkfs/common.h
+++ b/mkfs/common.h
@@ -55,6 +55,7 @@ struct btrfs_mkfs_config {
 	u64 num_bytes;
 	/* checksum algorithm to use */
 	enum btrfs_csum_type csum_type;
+	u64 zone_size;
 
 	/* Output fields, set during creation */
 
diff --git a/mkfs/main.c b/mkfs/main.c
index 14e9ae7aeb6d..0aa73cce728b 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -48,6 +48,7 @@
 #include "crypto/crc32c.h"
 #include "common/fsfeatures.h"
 #include "common/box.h"
+#include "common/hmzoned.h"
 
 static int verbose = 1;
 
@@ -68,8 +69,16 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
 	u64 bytes_used;
 	u64 chunk_start = 0;
 	u64 chunk_size = 0;
+	u64 system_group_offset = BTRFS_BLOCK_RESERVED_1M_FOR_SUPER;
+	u64 system_group_size = BTRFS_MKFS_SYSTEM_GROUP_SIZE;
 	int ret;
 
+	if (fs_info->fs_devices->hmzoned) {
+		/* Two zones are reserved for superblock */
+		system_group_offset = fs_info->fs_devices->zone_size * 2;
+		system_group_size = fs_info->fs_devices->zone_size;
+	}
+
 	if (mixed)
 		flags |= BTRFS_BLOCK_GROUP_DATA;
 
@@ -89,9 +98,8 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
 	 */
 	ret = btrfs_make_block_group(trans, fs_info, bytes_used,
 				     BTRFS_BLOCK_GROUP_SYSTEM,
-				     BTRFS_BLOCK_RESERVED_1M_FOR_SUPER,
-				     BTRFS_MKFS_SYSTEM_GROUP_SIZE);
-	allocation->system += BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+				     system_group_offset, system_group_size);
+	allocation->system += system_group_size;
 	if (ret)
 		return ret;
 
@@ -789,6 +797,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	int metadata_profile_opt = 0;
 	int discard = 1;
 	int ssd = 0;
+	int hmzoned = 0;
 	int force_overwrite = 0;
 	char *source_dir = NULL;
 	bool source_dir_set = false;
@@ -803,6 +812,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	struct mkfs_allocation allocation = { 0 };
 	struct btrfs_mkfs_config mkfs_cfg;
 	enum btrfs_csum_type csum_type = BTRFS_CSUM_TYPE_CRC32;
+	u64 system_group_size;
 
 	crc32c_optimization_init();
 
@@ -934,6 +944,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	if (dev_cnt == 0)
 		print_usage(1);
 
+	hmzoned = features & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
 	if (source_dir_set && dev_cnt > 1) {
 		error("the option -r is limited to a single device");
 		goto error;
@@ -943,6 +955,11 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		goto error;
 	}
 
+	if (source_dir_set && hmzoned) {
+		error("The -r and hmzoned feature are incompatible");
+		exit(1);
+	}
+
 	if (*fs_uuid) {
 		uuid_t dummy_uuid;
 
@@ -974,6 +991,16 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 
 	file = argv[optind++];
 	ssd = is_ssd(file);
+	if (hmzoned) {
+		if (zoned_model(file) == ZONED_NONE) {
+			error("%s: not a zoned block device", file);
+			exit(1);
+		}
+		if (!zone_size(file)) {
+			error("%s: zone size undefined", file);
+			exit(1);
+		}
+	}
 
 	/*
 	* Set default profiles according to number of added devices.
@@ -1130,7 +1157,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	ret = btrfs_prepare_device(fd, file, &dev_block_count, block_count,
 			(zero_end ? PREP_DEVICE_ZERO_END : 0) |
 			(discard ? PREP_DEVICE_DISCARD : 0) |
-			(verbose ? PREP_DEVICE_VERBOSE : 0));
+			(verbose ? PREP_DEVICE_VERBOSE : 0) |
+			(hmzoned ? PREP_DEVICE_HMZONED : 0));
 	if (ret)
 		goto error;
 	if (block_count && block_count > dev_block_count) {
@@ -1141,9 +1169,11 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	}
 
 	/* To create the first block group and chunk 0 in make_btrfs */
-	if (dev_block_count < BTRFS_MKFS_SYSTEM_GROUP_SIZE) {
+	system_group_size = hmzoned ?
+		zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+	if (dev_block_count < system_group_size) {
 		error("device is too small to make filesystem, must be at least %llu",
-				(unsigned long long)BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+				(unsigned long long)system_group_size);
 		goto error;
 	}
 
@@ -1161,6 +1191,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	mkfs_cfg.stripesize = stripesize;
 	mkfs_cfg.features = features;
 	mkfs_cfg.csum_type = csum_type;
+	mkfs_cfg.zone_size = zone_size(file);
 
 	ret = make_btrfs(fd, &mkfs_cfg);
 	if (ret) {
@@ -1244,7 +1275,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 				block_count,
 				(verbose ? PREP_DEVICE_VERBOSE : 0) |
 				(zero_end ? PREP_DEVICE_ZERO_END : 0) |
-				(discard ? PREP_DEVICE_DISCARD : 0));
+				(discard ? PREP_DEVICE_DISCARD : 0) |
+				(hmzoned ? PREP_DEVICE_HMZONED : 0));
 		if (ret) {
 			goto error;
 		}
@@ -1341,6 +1373,10 @@ raid_groups:
 			btrfs_group_profile_str(metadata_profile),
 			pretty_size(allocation.system));
 		printf("SSD detected:       %s\n", ssd ? "yes" : "no");
+		printf("Zoned device:       %s\n", hmzoned ? "yes" : "no");
+		if (hmzoned)
+			printf("Zone size:          %s\n",
+			       pretty_size(fs_info->fs_devices->zone_size));
 		btrfs_parse_features_to_string(features_buf, features);
 		printf("Incompat features:  %s\n", features_buf);
 		printf("Checksum:           %s",
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 14/15] btrfs-progs: device-add: support HMZONED device
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (12 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 13/15] btrfs-progs: mkfs: Zoned block device support Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:25 ` [PATCH v5 15/15] btrfs-progs: introduce support for device replace " Naohiro Aota
  2019-12-04  8:30 ` [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This patch check if the target file system is flagged as HMZONED. If it is,
the device to be added is flagged PREP_DEVICE_HMZONED.  Also add checks to
prevent mixing non-zoned devices and zoned devices.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 cmds/device.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/cmds/device.c b/cmds/device.c
index 24158308a41b..f85820cb1cc0 100644
--- a/cmds/device.c
+++ b/cmds/device.c
@@ -36,6 +36,7 @@
 #include "common/path-utils.h"
 #include "common/device-utils.h"
 #include "common/device-scan.h"
+#include "common/hmzoned.h"
 #include "mkfs/common.h"
 
 static const char * const device_cmd_group_usage[] = {
@@ -61,6 +62,9 @@ static int cmd_device_add(const struct cmd_struct *cmd,
 	int discard = 1;
 	int force = 0;
 	int last_dev;
+	int res;
+	int hmzoned;
+	struct btrfs_ioctl_feature_flags feature_flags;
 
 	optind = 0;
 	while (1) {
@@ -96,12 +100,35 @@ static int cmd_device_add(const struct cmd_struct *cmd,
 	if (fdmnt < 0)
 		return 1;
 
+	res = ioctl(fdmnt, BTRFS_IOC_GET_FEATURES, &feature_flags);
+	if (res) {
+		error("error getting feature flags '%s': %m", mntpnt);
+		return 1;
+	}
+	hmzoned = feature_flags.incompat_flags & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
 	for (i = optind; i < last_dev; i++){
 		struct btrfs_ioctl_vol_args ioctl_args;
-		int	devfd, res;
+		int	devfd;
 		u64 dev_block_count = 0;
 		char *path;
 
+		if (hmzoned && zoned_model(argv[i]) == ZONED_NONE) {
+			error(
+		"cannot add non-zoned device to HMZONED file system '%s'",
+			      argv[i]);
+			ret++;
+			continue;
+		}
+
+		if (!hmzoned && zoned_model(argv[i]) == ZONED_HOST_MANAGED) {
+			error(
+	"cannot add host managed zoned device to non-HMZONED file system '%s'",
+			      argv[i]);
+			ret++;
+			continue;
+		}
+
 		res = test_dev_for_mkfs(argv[i], force);
 		if (res) {
 			ret++;
@@ -117,7 +144,8 @@ static int cmd_device_add(const struct cmd_struct *cmd,
 
 		res = btrfs_prepare_device(devfd, argv[i], &dev_block_count, 0,
 				PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE |
-				(discard ? PREP_DEVICE_DISCARD : 0));
+				(discard ? PREP_DEVICE_DISCARD : 0) |
+				(hmzoned ? PREP_DEVICE_HMZONED : 0));
 		close(devfd);
 		if (res) {
 			ret++;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 15/15] btrfs-progs: introduce support for device replace HMZONED device
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (13 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 14/15] btrfs-progs: device-add: support HMZONED device Naohiro Aota
@ 2019-12-04  8:25 ` Naohiro Aota
  2019-12-04  8:30 ` [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
  15 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:25 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This patch check if the target file system is flagged as HMZONED. If it is,
the device to be added is flagged PREP_DEVICE_HMZONED.  Also add checks to
prevent mixing non-zoned devices and zoned devices.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 cmds/replace.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/cmds/replace.c b/cmds/replace.c
index 2321aa156fe2..670df68a93f7 100644
--- a/cmds/replace.c
+++ b/cmds/replace.c
@@ -119,6 +119,7 @@ static const char *const cmd_replace_start_usage[] = {
 static int cmd_replace_start(const struct cmd_struct *cmd,
 			     int argc, char **argv)
 {
+	struct btrfs_ioctl_feature_flags feature_flags;
 	struct btrfs_ioctl_dev_replace_args start_args = {0};
 	struct btrfs_ioctl_dev_replace_args status_args = {0};
 	int ret;
@@ -126,6 +127,7 @@ static int cmd_replace_start(const struct cmd_struct *cmd,
 	int c;
 	int fdmnt = -1;
 	int fddstdev = -1;
+	int hmzoned;
 	char *path;
 	char *srcdev;
 	char *dstdev = NULL;
@@ -166,6 +168,13 @@ static int cmd_replace_start(const struct cmd_struct *cmd,
 	if (fdmnt < 0)
 		goto leave_with_error;
 
+	ret = ioctl(fdmnt, BTRFS_IOC_GET_FEATURES, &feature_flags);
+	if (ret) {
+		error("ioctl(GET_FEATURES) on '%s' returns error: %m", path);
+		goto leave_with_error;
+	}
+	hmzoned = feature_flags.incompat_flags & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
 	/* check for possible errors before backgrounding */
 	status_args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS;
 	status_args.result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT;
@@ -260,7 +269,8 @@ static int cmd_replace_start(const struct cmd_struct *cmd,
 	strncpy((char *)start_args.start.tgtdev_name, dstdev,
 		BTRFS_DEVICE_PATH_NAME_MAX);
 	ret = btrfs_prepare_device(fddstdev, dstdev, &dstdev_block_count, 0,
-			PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE);
+			PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE |
+			(hmzoned ? PREP_DEVICE_HMZONED : 0));
 	if (ret)
 		goto leave_with_error;
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs
  2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
                   ` (14 preceding siblings ...)
  2019-12-04  8:25 ` [PATCH v5 15/15] btrfs-progs: introduce support for device replace " Naohiro Aota
@ 2019-12-04  8:30 ` Naohiro Aota
  2019-12-04 12:15   ` Vyacheslav Dubeyko
  2019-12-05 14:51   ` Karel Zak
  15 siblings, 2 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-04  8:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This is a proof-of-concept patch to make libblkid zone-aware. It can
probe the magic located at some offset from the beginning of some
specific zone of a device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 libblkid/src/blkidP.h            |   4 +
 libblkid/src/probe.c             |  25 +++++-
 libblkid/src/superblocks/btrfs.c | 132 ++++++++++++++++++++++++++++++-
 3 files changed, 157 insertions(+), 4 deletions(-)

diff --git a/libblkid/src/blkidP.h b/libblkid/src/blkidP.h
index f9bbe008406f..5bb6771ee9c6 100644
--- a/libblkid/src/blkidP.h
+++ b/libblkid/src/blkidP.h
@@ -148,6 +148,10 @@ struct blkid_idmag
 
 	long		kboff;		/* kilobyte offset of superblock */
 	unsigned int	sboff;		/* byte offset within superblock */
+
+	int		is_zone;
+	long		zonenum;
+	long		kboff_inzone;
 };
 
 /*
diff --git a/libblkid/src/probe.c b/libblkid/src/probe.c
index f6dd5573d5dd..56e42ac28559 100644
--- a/libblkid/src/probe.c
+++ b/libblkid/src/probe.c
@@ -94,6 +94,7 @@
 #ifdef HAVE_LINUX_CDROM_H
 #include <linux/cdrom.h>
 #endif
+#include <linux/blkzoned.h>
 #ifdef HAVE_SYS_STAT_H
 #include <sys/stat.h>
 #endif
@@ -1009,8 +1010,25 @@ int blkid_probe_get_idmag(blkid_probe pr, const struct blkid_idinfo *id,
 	/* try to detect by magic string */
 	while(mag && mag->magic) {
 		unsigned char *buf;
-
-		off = (mag->kboff + (mag->sboff >> 10)) << 10;
+		uint64_t kboff;
+
+		if (!mag->is_zone)
+			kboff = mag->kboff;
+		else {
+			uint32_t zone_size_sector;
+			int ret;
+
+			ret = ioctl(pr->fd, BLKGETZONESZ, &zone_size_sector);
+			if (ret == EOPNOTSUPP)
+				goto next;
+			if (ret)
+				return -errno;
+			if (zone_size_sector == 0)
+				goto next;
+			kboff = (mag->zonenum * (zone_size_sector << 9)) >> 10;
+			kboff += mag->kboff_inzone;
+		}
+		off = (kboff + (mag->sboff >> 10)) << 10;
 		buf = blkid_probe_get_buffer(pr, off, 1024);
 
 		if (!buf && errno)
@@ -1020,13 +1038,14 @@ int blkid_probe_get_idmag(blkid_probe pr, const struct blkid_idinfo *id,
 				buf + (mag->sboff & 0x3ff), mag->len)) {
 
 			DBG(LOWPROBE, ul_debug("\tmagic sboff=%u, kboff=%ld",
-				mag->sboff, mag->kboff));
+				mag->sboff, kboff));
 			if (offset)
 				*offset = off + (mag->sboff & 0x3ff);
 			if (res)
 				*res = mag;
 			return BLKID_PROBE_OK;
 		}
+next:
 		mag++;
 	}
 
diff --git a/libblkid/src/superblocks/btrfs.c b/libblkid/src/superblocks/btrfs.c
index f0fde700d896..4254220ef423 100644
--- a/libblkid/src/superblocks/btrfs.c
+++ b/libblkid/src/superblocks/btrfs.c
@@ -9,6 +9,9 @@
 #include <unistd.h>
 #include <string.h>
 #include <stdint.h>
+#include <stdbool.h>
+
+#include <linux/blkzoned.h>
 
 #include "superblocks.h"
 
@@ -59,11 +62,131 @@ struct btrfs_super_block {
 	uint8_t label[256];
 } __attribute__ ((__packed__));
 
+#define BTRFS_SUPER_INFO_SIZE 4096
+#define SECTOR_SHIFT 9
+
+#define READ 0
+#define WRITE 1
+
+typedef uint64_t u64;
+typedef uint64_t sector_t;
+
+static int sb_write_pointer(struct blk_zone *zones, u64 *wp_ret)
+{
+	bool empty[2];
+	bool full[2];
+	sector_t sector;
+
+	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	}
+
+	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
+	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
+	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
+	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
+
+	/*
+	 * Possible state of log buffer zones
+	 *
+	 *   E I F
+	 * E * x 0
+	 * I 0 x 0
+	 * F 1 1 x
+	 *
+	 * Row: zones[0]
+	 * Col: zones[1]
+	 * State:
+	 *   E: Empty, I: In-Use, F: Full
+	 * Log position:
+	 *   *: Special case, no superblock is written
+	 *   0: Use write pointer of zones[0]
+	 *   1: Use write pointer of zones[1]
+	 *   x: Invalid state
+	 */
+
+	if (empty[0] && empty[1]) {
+		/* special case to distinguish no superblock to read */
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	} else if (full[0] && full[1]) {
+		/* cannot determine which zone has the newer superblock */
+		return -EUCLEAN;
+	} else if (!full[0] && (empty[1] || full[1])) {
+		sector = zones[0].wp;
+	} else if (full[0]) {
+		sector = zones[1].wp;
+	} else {
+		return -EUCLEAN;
+	}
+	*wp_ret = sector << SECTOR_SHIFT;
+	return 0;
+}
+
+static int sb_log_offset(uint32_t zone_size_sector, blkid_probe pr,
+			 uint64_t *offset_ret)
+{
+	uint32_t zone_num = 0;
+	struct blk_zone_report *rep;
+	struct blk_zone *zones;
+	size_t rep_size;
+	int ret;
+	uint64_t wp;
+
+	rep_size = sizeof(struct blk_zone_report) + sizeof(struct blk_zone) * 2;
+	rep = malloc(rep_size);
+	if (!rep)
+		return -errno;
+
+	memset(rep, 0, rep_size);
+	rep->sector = zone_num * zone_size_sector;
+	rep->nr_zones = 2;
+
+	ret = ioctl(pr->fd, BLKREPORTZONE, rep);
+	if (ret)
+		return -errno;
+	if (rep->nr_zones != 2) {
+		free(rep);
+		return 1;
+	}
+
+	zones = (struct blk_zone *)(rep + 1);
+
+	ret = sb_write_pointer(zones, &wp);
+	if (ret != -ENOENT && ret)
+		return -EIO;
+	if (ret != -ENOENT) {
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT;
+		wp -= BTRFS_SUPER_INFO_SIZE;
+	}
+	*offset_ret = wp;
+
+	return 0;
+}
+
 static int probe_btrfs(blkid_probe pr, const struct blkid_idmag *mag)
 {
 	struct btrfs_super_block *bfs;
+	uint32_t zone_size_sector;
+	int ret;
+
+	ret = ioctl(pr->fd, BLKGETZONESZ, &zone_size_sector);
+	if (ret)
+		return errno;
+	if (zone_size_sector != 0) {
+		uint64_t offset = 0;
 
-	bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
+		ret = sb_log_offset(zone_size_sector, pr, &offset);
+		if (ret)
+			return ret;
+		bfs = (struct btrfs_super_block*)
+			blkid_probe_get_buffer(pr, offset,
+					       sizeof(struct btrfs_super_block));
+	} else {
+		bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
+	}
 	if (!bfs)
 		return errno ? -errno : 1;
 
@@ -88,6 +211,13 @@ const struct blkid_idinfo btrfs_idinfo =
 	.magics		=
 	{
 	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40, .kboff = 64 },
+	  /* for HMZONED btrfs */
+	  { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
+	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
+	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zone = 1, .zonenum = 1, .kboff_inzone = 0 },
 	  { NULL }
 	}
 };
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs
  2019-12-04  8:30 ` [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
@ 2019-12-04 12:15   ` Vyacheslav Dubeyko
  2019-12-06  7:03     ` Naohiro Aota
  2019-12-05 14:51   ` Karel Zak
  1 sibling, 1 reply; 22+ messages in thread
From: Vyacheslav Dubeyko @ 2019-12-04 12:15 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel

On Wed, 2019-12-04 at 17:30 +0900, Naohiro Aota wrote:
> This is a proof-of-concept patch to make libblkid zone-aware. It can
> probe the magic located at some offset from the beginning of some
> specific zone of a device.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  libblkid/src/blkidP.h            |   4 +
>  libblkid/src/probe.c             |  25 +++++-
>  libblkid/src/superblocks/btrfs.c | 132
> ++++++++++++++++++++++++++++++-
>  3 files changed, 157 insertions(+), 4 deletions(-)
> 
> diff --git a/libblkid/src/blkidP.h b/libblkid/src/blkidP.h
> index f9bbe008406f..5bb6771ee9c6 100644
> --- a/libblkid/src/blkidP.h
> +++ b/libblkid/src/blkidP.h
> @@ -148,6 +148,10 @@ struct blkid_idmag
>  
>  	long		kboff;		/* kilobyte offset of
> superblock */
>  	unsigned int	sboff;		/* byte offset within
> superblock */
> +
> +	int		is_zone;
> +	long		zonenum;
> +	long		kboff_inzone;
>  };

Maybe, it makes sense to add the comments for added fields? How do you
feel?

>  
>  /*
> diff --git a/libblkid/src/probe.c b/libblkid/src/probe.c
> index f6dd5573d5dd..56e42ac28559 100644
> --- a/libblkid/src/probe.c
> +++ b/libblkid/src/probe.c
> @@ -94,6 +94,7 @@
>  #ifdef HAVE_LINUX_CDROM_H
>  #include <linux/cdrom.h>
>  #endif
> +#include <linux/blkzoned.h>
>  #ifdef HAVE_SYS_STAT_H
>  #include <sys/stat.h>
>  #endif
> @@ -1009,8 +1010,25 @@ int blkid_probe_get_idmag(blkid_probe pr,
> const struct blkid_idinfo *id,
>  	/* try to detect by magic string */
>  	while(mag && mag->magic) {
>  		unsigned char *buf;
> -
> -		off = (mag->kboff + (mag->sboff >> 10)) << 10;
> +		uint64_t kboff;
> +
> +		if (!mag->is_zone)
> +			kboff = mag->kboff;
> +		else {
> +			uint32_t zone_size_sector;
> +			int ret;
> +
> +			ret = ioctl(pr->fd, BLKGETZONESZ,
> &zone_size_sector);
> +			if (ret == EOPNOTSUPP)

-EOPNOTSUPP??? Or this is the libblk peculiarity?

> +				goto next;
> +			if (ret)
> +				return -errno;
> +			if (zone_size_sector == 0)
> +				goto next;
> +			kboff = (mag->zonenum * (zone_size_sector <<
> 9)) >> 10;
> +			kboff += mag->kboff_inzone;
> +		}
> +		off = (kboff + (mag->sboff >> 10)) << 10;
>  		buf = blkid_probe_get_buffer(pr, off, 1024);
>  
>  		if (!buf && errno)
> @@ -1020,13 +1038,14 @@ int blkid_probe_get_idmag(blkid_probe pr,
> const struct blkid_idinfo *id,
>  				buf + (mag->sboff & 0x3ff), mag->len))
> {
>  
>  			DBG(LOWPROBE, ul_debug("\tmagic sboff=%u,
> kboff=%ld",
> -				mag->sboff, mag->kboff));
> +				mag->sboff, kboff));
>  			if (offset)
>  				*offset = off + (mag->sboff & 0x3ff);
>  			if (res)
>  				*res = mag;
>  			return BLKID_PROBE_OK;
>  		}
> +next:
>  		mag++;
>  	}
>  
> diff --git a/libblkid/src/superblocks/btrfs.c
> b/libblkid/src/superblocks/btrfs.c
> index f0fde700d896..4254220ef423 100644
> --- a/libblkid/src/superblocks/btrfs.c
> +++ b/libblkid/src/superblocks/btrfs.c
> @@ -9,6 +9,9 @@
>  #include <unistd.h>
>  #include <string.h>
>  #include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <linux/blkzoned.h>
>  
>  #include "superblocks.h"
>  
> @@ -59,11 +62,131 @@ struct btrfs_super_block {
>  	uint8_t label[256];
>  } __attribute__ ((__packed__));
>  
> +#define BTRFS_SUPER_INFO_SIZE 4096

I believe that 4K is very widely used constant.
Are you sure that it needs to introduce some
additional constant? Especially, it looks slightly
strange to see the BTRFS specialized constant.
Maybe, it needs to generalize the constant? 

> +#define SECTOR_SHIFT 9

Are you sure that libblkid hasn't such constant?

> +
> +#define READ 0
> +#define WRITE 1
> +
> +typedef uint64_t u64;
> +typedef uint64_t sector_t;

I see the point to introduce the sector_t type.
But is it really necessary to introduce the u64 type?

> +
> +static int sb_write_pointer(struct blk_zone *zones, u64 *wp_ret)
> +{
> +	bool empty[2];
> +	bool full[2];
> +	sector_t sector;
> +
> +	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> +		*wp_ret = zones[0].start << SECTOR_SHIFT;
> +		return -ENOENT;
> +	}
> +
> +	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
> +	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
> +	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
> +	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
> +
> +	/*
> +	 * Possible state of log buffer zones
> +	 *
> +	 *   E I F
> +	 * E * x 0
> +	 * I 0 x 0
> +	 * F 1 1 x
> +	 *
> +	 * Row: zones[0]
> +	 * Col: zones[1]
> +	 * State:
> +	 *   E: Empty, I: In-Use, F: Full
> +	 * Log position:
> +	 *   *: Special case, no superblock is written
> +	 *   0: Use write pointer of zones[0]
> +	 *   1: Use write pointer of zones[1]
> +	 *   x: Invalid state
> +	 */
> +
> +	if (empty[0] && empty[1]) {
> +		/* special case to distinguish no superblock to read */
> +		*wp_ret = zones[0].start << SECTOR_SHIFT;


So, even if we return the error then somebody will check
the *wp_ret value? Looks slightly unexpected.

> +		return -ENOENT;
> +	} else if (full[0] && full[1]) {
> +		/* cannot determine which zone has the newer superblock
> */
> +		return -EUCLEAN;
> +	} else if (!full[0] && (empty[1] || full[1])) {
> +		sector = zones[0].wp;
> +	} else if (full[0]) {
> +		sector = zones[1].wp;
> +	} else {
> +		return -EUCLEAN;
> +	}
> +	*wp_ret = sector << SECTOR_SHIFT;
> +	return 0;
> +}
> +
> +static int sb_log_offset(uint32_t zone_size_sector, blkid_probe pr,
> +			 uint64_t *offset_ret)
> +{
> +	uint32_t zone_num = 0;
> +	struct blk_zone_report *rep;
> +	struct blk_zone *zones;
> +	size_t rep_size;
> +	int ret;
> +	uint64_t wp;
> +
> +	rep_size = sizeof(struct blk_zone_report) + sizeof(struct
> blk_zone) * 2;
> +	rep = malloc(rep_size);
> +	if (!rep)
> +		return -errno;
> +
> +	memset(rep, 0, rep_size);
> +	rep->sector = zone_num * zone_size_sector;
> +	rep->nr_zones = 2;
> +
> +	ret = ioctl(pr->fd, BLKREPORTZONE, rep);
> +	if (ret)
> +		return -errno;

So, the valid case if ioctl returns 0? Am I correct?


> +	if (rep->nr_zones != 2) {
> +		free(rep);
> +		return 1;
> +	}
> +
> +	zones = (struct blk_zone *)(rep + 1);
> +
> +	ret = sb_write_pointer(zones, &wp);
> +	if (ret != -ENOENT && ret)
> +		return -EIO;


If ret is positive then we could return the error. Am I correct?


> +	if (ret != -ENOENT) {
> +		if (wp == zones[0].start << SECTOR_SHIFT)
> +			wp = (zones[1].start + zones[1].len) <<
> SECTOR_SHIFT;
> +		wp -= BTRFS_SUPER_INFO_SIZE;
> +	}
> +	*offset_ret = wp;
> +
> +	return 0;
> +}
> +
>  static int probe_btrfs(blkid_probe pr, const struct blkid_idmag
> *mag)
>  {
>  	struct btrfs_super_block *bfs;
> +	uint32_t zone_size_sector;
> +	int ret;
> +
> +	ret = ioctl(pr->fd, BLKGETZONESZ, &zone_size_sector);
> +	if (ret)
> +		return errno;

You returned -errno for another ioctls above. Is everything correct
here?

> +	if (zone_size_sector != 0) {
> +		uint64_t offset = 0;
>  
> -	bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
> +		ret = sb_log_offset(zone_size_sector, pr, &offset);
> +		if (ret)
> +			return ret;

What about a positive value of ret? I suppose it needs to return ret
only if we have an error. Am I correct?

Thanks,
Viacheslav Dubeyko.

> +		bfs = (struct btrfs_super_block*)
> +			blkid_probe_get_buffer(pr, offset,
> +					       sizeof(struct
> btrfs_super_block));
> +	} else {
> +		bfs = blkid_probe_get_sb(pr, mag, struct
> btrfs_super_block);
> +	}
>  	if (!bfs)
>  		return errno ? -errno : 1;
>  
> @@ -88,6 +211,13 @@ const struct blkid_idinfo btrfs_idinfo =
>  	.magics		=
>  	{
>  	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40, .kboff = 64
> },
> +	  /* for HMZONED btrfs */
> +	  { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
> +	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
> +	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
> +	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
> +	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
> +	    .is_zone = 1, .zonenum = 1, .kboff_inzone = 0 },
>  	  { NULL }
>  	}
>  };


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs
  2019-12-04  8:30 ` [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
  2019-12-04 12:15   ` Vyacheslav Dubeyko
@ 2019-12-05 14:51   ` Karel Zak
  2019-12-06  7:06     ` Naohiro Aota
  1 sibling, 1 reply; 22+ messages in thread
From: Karel Zak @ 2019-12-05 14:51 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik,
	Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On Wed, Dec 04, 2019 at 05:30:23PM +0900, Naohiro Aota wrote:
>  	while(mag && mag->magic) {
>  		unsigned char *buf;
> -
> -		off = (mag->kboff + (mag->sboff >> 10)) << 10;
> +		uint64_t kboff;
> +
> +		if (!mag->is_zone)
> +			kboff = mag->kboff;
> +		else {
> +			uint32_t zone_size_sector;
> +			int ret;
> +
> +			ret = ioctl(pr->fd, BLKGETZONESZ, &zone_size_sector);

I guess this ioctl returns always the same number, right? 

If yes, than you don't want to call it always when libmount compares
any magic string. It would be better call it only once from
blkid_probe_set_device() and save zone_size_sector to struct
blkid_probe.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs
  2019-12-04 12:15   ` Vyacheslav Dubeyko
@ 2019-12-06  7:03     ` Naohiro Aota
  2019-12-06 15:22       ` David Sterba
  0 siblings, 1 reply; 22+ messages in thread
From: Naohiro Aota @ 2019-12-06  7:03 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik,
	Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On Wed, Dec 04, 2019 at 03:15:32PM +0300, Vyacheslav Dubeyko wrote:
>On Wed, 2019-12-04 at 17:30 +0900, Naohiro Aota wrote:
>> This is a proof-of-concept patch to make libblkid zone-aware. It can
>> probe the magic located at some offset from the beginning of some
>> specific zone of a device.
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>  libblkid/src/blkidP.h            |   4 +
>>  libblkid/src/probe.c             |  25 +++++-
>>  libblkid/src/superblocks/btrfs.c | 132
>> ++++++++++++++++++++++++++++++-
>>  3 files changed, 157 insertions(+), 4 deletions(-)
>>
>> diff --git a/libblkid/src/blkidP.h b/libblkid/src/blkidP.h
>> index f9bbe008406f..5bb6771ee9c6 100644
>> --- a/libblkid/src/blkidP.h
>> +++ b/libblkid/src/blkidP.h
>> @@ -148,6 +148,10 @@ struct blkid_idmag
>>
>>  	long		kboff;		/* kilobyte offset of
>> superblock */
>>  	unsigned int	sboff;		/* byte offset within
>> superblock */
>> +
>> +	int		is_zone;
>> +	long		zonenum;
>> +	long		kboff_inzone;
>>  };
>
>Maybe, it makes sense to add the comments for added fields? How do you
>feel?

I agree. This is still a prototype version to test HMZONED btrfs. So,
I'll add comments and clean up codes in the later version.

>>
>>  /*
>> diff --git a/libblkid/src/probe.c b/libblkid/src/probe.c
>> index f6dd5573d5dd..56e42ac28559 100644
>> --- a/libblkid/src/probe.c
>> +++ b/libblkid/src/probe.c
>> @@ -94,6 +94,7 @@
>>  #ifdef HAVE_LINUX_CDROM_H
>>  #include <linux/cdrom.h>
>>  #endif
>> +#include <linux/blkzoned.h>
>>  #ifdef HAVE_SYS_STAT_H
>>  #include <sys/stat.h>
>>  #endif
>> @@ -1009,8 +1010,25 @@ int blkid_probe_get_idmag(blkid_probe pr,
>> const struct blkid_idinfo *id,
>>  	/* try to detect by magic string */
>>  	while(mag && mag->magic) {
>>  		unsigned char *buf;
>> -
>> -		off = (mag->kboff + (mag->sboff >> 10)) << 10;
>> +		uint64_t kboff;
>> +
>> +		if (!mag->is_zone)
>> +			kboff = mag->kboff;
>> +		else {
>> +			uint32_t zone_size_sector;
>> +			int ret;
>> +
>> +			ret = ioctl(pr->fd, BLKGETZONESZ,
>> &zone_size_sector);
>> +			if (ret == EOPNOTSUPP)
>
>-EOPNOTSUPP??? Or this is the libblk peculiarity?
>

My bad... It should check errno in the userland code. I'll fix.

>> +				goto next;
>> +			if (ret)
>> +				return -errno;
>> +			if (zone_size_sector == 0)
>> +				goto next;
>> +			kboff = (mag->zonenum * (zone_size_sector <<
>> 9)) >> 10;
>> +			kboff += mag->kboff_inzone;
>> +		}
>> +		off = (kboff + (mag->sboff >> 10)) << 10;
>>  		buf = blkid_probe_get_buffer(pr, off, 1024);
>>
>>  		if (!buf && errno)
>> @@ -1020,13 +1038,14 @@ int blkid_probe_get_idmag(blkid_probe pr,
>> const struct blkid_idinfo *id,
>>  				buf + (mag->sboff & 0x3ff), mag->len))
>> {
>>
>>  			DBG(LOWPROBE, ul_debug("\tmagic sboff=%u,
>> kboff=%ld",
>> -				mag->sboff, mag->kboff));
>> +				mag->sboff, kboff));
>>  			if (offset)
>>  				*offset = off + (mag->sboff & 0x3ff);
>>  			if (res)
>>  				*res = mag;
>>  			return BLKID_PROBE_OK;
>>  		}
>> +next:
>>  		mag++;
>>  	}
>>
>> diff --git a/libblkid/src/superblocks/btrfs.c
>> b/libblkid/src/superblocks/btrfs.c
>> index f0fde700d896..4254220ef423 100644
>> --- a/libblkid/src/superblocks/btrfs.c
>> +++ b/libblkid/src/superblocks/btrfs.c
>> @@ -9,6 +9,9 @@
>>  #include <unistd.h>
>>  #include <string.h>
>>  #include <stdint.h>
>> +#include <stdbool.h>
>> +
>> +#include <linux/blkzoned.h>
>>
>>  #include "superblocks.h"
>>
>> @@ -59,11 +62,131 @@ struct btrfs_super_block {
>>  	uint8_t label[256];
>>  } __attribute__ ((__packed__));
>>
>> +#define BTRFS_SUPER_INFO_SIZE 4096
>
>I believe that 4K is very widely used constant.
>Are you sure that it needs to introduce some
>additional constant? Especially, it looks slightly
>strange to see the BTRFS specialized constant.
>Maybe, it needs to generalize the constant?

I don't think so...

I think it is better to define BTRFS_SUPER_INFO_SIZE here. This is an
already defined constant in btrfs-progs and this is key value to
calculate the last superblock location. I think it's OK to define
btrfs local constant in btrfs.c file...

>> +#define SECTOR_SHIFT 9
>
>Are you sure that libblkid hasn't such constant?
>
>> +
>> +#define READ 0
>> +#define WRITE 1
>> +
>> +typedef uint64_t u64;
>> +typedef uint64_t sector_t;
>
>I see the point to introduce the sector_t type.
>But is it really necessary to introduce the u64 type?
>

These SECTOR_SHIFT to sector_t are mainly introduced to unify the code
between btrfs-progs, util-linux and btrfs kernel so that I can ease
the development at least in this early stage. So, in the later
version, I'll drop some of these definitions. Maybe using
DEFAULT_SECTOR_SIZE instead of SECTOR_SHIFT, just use uint64_t instead
of u64.

>> +
>> +static int sb_write_pointer(struct blk_zone *zones, u64 *wp_ret)
>> +{
>> +	bool empty[2];
>> +	bool full[2];
>> +	sector_t sector;
>> +
>> +	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>> +		*wp_ret = zones[0].start << SECTOR_SHIFT;
>> +		return -ENOENT;
>> +	}
>> +
>> +	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
>> +	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
>> +	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
>> +	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
>> +
>> +	/*
>> +	 * Possible state of log buffer zones
>> +	 *
>> +	 *   E I F
>> +	 * E * x 0
>> +	 * I 0 x 0
>> +	 * F 1 1 x
>> +	 *
>> +	 * Row: zones[0]
>> +	 * Col: zones[1]
>> +	 * State:
>> +	 *   E: Empty, I: In-Use, F: Full
>> +	 * Log position:
>> +	 *   *: Special case, no superblock is written
>> +	 *   0: Use write pointer of zones[0]
>> +	 *   1: Use write pointer of zones[1]
>> +	 *   x: Invalid state
>> +	 */
>> +
>> +	if (empty[0] && empty[1]) {
>> +		/* special case to distinguish no superblock to read */
>> +		*wp_ret = zones[0].start << SECTOR_SHIFT;
>
>
>So, even if we return the error then somebody will check
>the *wp_ret value? Looks slightly unexpected.

I admit it is confusing. error is returned to distinguish 1) case of
both zones are empty and 2) case of having written the two zones and
wrapped around to the head. Both cases have their write position at
the beginning of the first zone. But, read position is different: the
beginning of the zones or invalid in the case 1, and the (nearly) end
of the zones in the case 2.

Since libblkid is read-only for superblocks, we can drop this setting
the *wp_ret value.

>> +		return -ENOENT;
>> +	} else if (full[0] && full[1]) {
>> +		/* cannot determine which zone has the newer superblock
>> */
>> +		return -EUCLEAN;
>> +	} else if (!full[0] && (empty[1] || full[1])) {
>> +		sector = zones[0].wp;
>> +	} else if (full[0]) {
>> +		sector = zones[1].wp;
>> +	} else {
>> +		return -EUCLEAN;
>> +	}
>> +	*wp_ret = sector << SECTOR_SHIFT;
>> +	return 0;
>> +}
>> +
>> +static int sb_log_offset(uint32_t zone_size_sector, blkid_probe pr,
>> +			 uint64_t *offset_ret)
>> +{
>> +	uint32_t zone_num = 0;
>> +	struct blk_zone_report *rep;
>> +	struct blk_zone *zones;
>> +	size_t rep_size;
>> +	int ret;
>> +	uint64_t wp;
>> +
>> +	rep_size = sizeof(struct blk_zone_report) + sizeof(struct
>> blk_zone) * 2;
>> +	rep = malloc(rep_size);
>> +	if (!rep)
>> +		return -errno;
>> +
>> +	memset(rep, 0, rep_size);
>> +	rep->sector = zone_num * zone_size_sector;
>> +	rep->nr_zones = 2;
>> +
>> +	ret = ioctl(pr->fd, BLKREPORTZONE, rep);
>> +	if (ret)
>> +		return -errno;
>
>So, the valid case if ioctl returns 0? Am I correct?

Yes.

>
>> +	if (rep->nr_zones != 2) {
>> +		free(rep);
>> +		return 1;
>> +	}
>> +
>> +	zones = (struct blk_zone *)(rep + 1);
>> +
>> +	ret = sb_write_pointer(zones, &wp);
>> +	if (ret != -ENOENT && ret)
>> +		return -EIO;
>
>
>If ret is positive then we could return the error. Am I correct?

Right. But, sb_write_pointer() will return 0 or negative (error value).

>
>> +	if (ret != -ENOENT) {
>> +		if (wp == zones[0].start << SECTOR_SHIFT)
>> +			wp = (zones[1].start + zones[1].len) <<
>> SECTOR_SHIFT;
>> +		wp -= BTRFS_SUPER_INFO_SIZE;
>> +	}
>> +	*offset_ret = wp;
>> +
>> +	return 0;
>> +}
>> +
>>  static int probe_btrfs(blkid_probe pr, const struct blkid_idmag
>> *mag)
>>  {
>>  	struct btrfs_super_block *bfs;
>> +	uint32_t zone_size_sector;
>> +	int ret;
>> +
>> +	ret = ioctl(pr->fd, BLKGETZONESZ, &zone_size_sector);
>> +	if (ret)
>> +		return errno;
>
>You returned -errno for another ioctls above. Is everything correct
>here?

My mistake. I need to return "-errno" here.

>> +	if (zone_size_sector != 0) {
>> +		uint64_t offset = 0;
>>
>> -	bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
>> +		ret = sb_log_offset(zone_size_sector, pr, &offset);
>> +		if (ret)
>> +			return ret;
>
>What about a positive value of ret? I suppose it needs to return ret
>only if we have an error. Am I correct?

sb_log_offset() can return 0 on success, negative value on error and 1
when the device has less than two zones. In the last case, we can
return the value "1" as is to indicate that there is no magic number
on this device. I should replace "1" with BLKID_PROBE_NONE to make it
clear.

>Thanks,
>Viacheslav Dubeyko.
>
>> +		bfs = (struct btrfs_super_block*)
>> +			blkid_probe_get_buffer(pr, offset,
>> +					       sizeof(struct
>> btrfs_super_block));
>> +	} else {
>> +		bfs = blkid_probe_get_sb(pr, mag, struct
>> btrfs_super_block);
>> +	}
>>  	if (!bfs)
>>  		return errno ? -errno : 1;
>>
>> @@ -88,6 +211,13 @@ const struct blkid_idinfo btrfs_idinfo =
>>  	.magics		=
>>  	{
>>  	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40, .kboff = 64
>> },
>> +	  /* for HMZONED btrfs */
>> +	  { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
>> +	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
>> +	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
>> +	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
>> +	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
>> +	    .is_zone = 1, .zonenum = 1, .kboff_inzone = 0 },
>>  	  { NULL }
>>  	}
>>  };
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs
  2019-12-05 14:51   ` Karel Zak
@ 2019-12-06  7:06     ` Naohiro Aota
  0 siblings, 0 replies; 22+ messages in thread
From: Naohiro Aota @ 2019-12-06  7:06 UTC (permalink / raw)
  To: Karel Zak
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik,
	Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On Thu, Dec 05, 2019 at 03:51:02PM +0100, Karel Zak wrote:
>On Wed, Dec 04, 2019 at 05:30:23PM +0900, Naohiro Aota wrote:
>>  	while(mag && mag->magic) {
>>  		unsigned char *buf;
>> -
>> -		off = (mag->kboff + (mag->sboff >> 10)) << 10;
>> +		uint64_t kboff;
>> +
>> +		if (!mag->is_zone)
>> +			kboff = mag->kboff;
>> +		else {
>> +			uint32_t zone_size_sector;
>> +			int ret;
>> +
>> +			ret = ioctl(pr->fd, BLKGETZONESZ, &zone_size_sector);
>
>I guess this ioctl returns always the same number, right?
>
>If yes, than you don't want to call it always when libmount compares
>any magic string. It would be better call it only once from
>blkid_probe_set_device() and save zone_size_sector to struct
>blkid_probe.

Exactly. That should save much time! I'll update the code in that
way. Thanks.

>    Karel
>
>-- 
> Karel Zak  <kzak@redhat.com>
> http://karelzak.blogspot.com
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs
  2019-12-06  7:03     ` Naohiro Aota
@ 2019-12-06 15:22       ` David Sterba
  0 siblings, 0 replies; 22+ messages in thread
From: David Sterba @ 2019-12-06 15:22 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Vyacheslav Dubeyko, linux-btrfs, David Sterba, Chris Mason,
	Josef Bacik, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On Fri, Dec 06, 2019 at 04:03:20PM +0900, Naohiro Aota wrote:
> >> +#define BTRFS_SUPER_INFO_SIZE 4096
> >
> >I believe that 4K is very widely used constant.
> >Are you sure that it needs to introduce some
> >additional constant? Especially, it looks slightly
> >strange to see the BTRFS specialized constant.
> >Maybe, it needs to generalize the constant?
> 
> I don't think so...
> 
> I think it is better to define BTRFS_SUPER_INFO_SIZE here. This is an
> already defined constant in btrfs-progs and this is key value to
> calculate the last superblock location. I think it's OK to define
> btrfs local constant in btrfs.c file...

I agree, the named constant makes the meaning more clear. In the code
where it's used:

> >> +	if (ret != -ENOENT) {
> >> +		if (wp == zones[0].start << SECTOR_SHIFT)
> >> +			wp = (zones[1].start + zones[1].len) <<
> >> SECTOR_SHIFT;
> >> +		wp -= BTRFS_SUPER_INFO_SIZE;
> >> +	}

If there's just

		wp -= 4096;

it's a magic constant out of nowhere. As pointed out, it's defined only
in btrfs.c so it does not pollute namespace in libblkid.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-12-06 15:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-04  8:24 [PATCH v5 00/15] btrfs-progs: zoned block device support Naohiro Aota
2019-12-04  8:24 ` [PATCH v5 01/15] btrfs-progs: utils: Introduce queue_param helper function Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 02/15] btrfs-progs: introduce raid parameters variables Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 03/15] btrfs-progs: build: Check zoned block device support Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 04/15] btrfs-progs: add new HMZONED feature flag Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 05/15] btrfs-progs: Introduce zone block device helper functions Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 06/15] btrfs-progs: load and check zone information Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 07/15] btrfs-progs: support discarding zoned device Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 08/15] btrfs-progs: support zero out on zoned block device Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 09/15] btrfs-progs: implement log-structured superblock for HMZONED mode Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 10/15] btrfs-progs: align device extent allocation to zone boundary Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 11/15] btrfs-progs: do sequential allocation in HMZONED mode Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 12/15] btrfs-progs: redirty clean extent buffers in seq Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 13/15] btrfs-progs: mkfs: Zoned block device support Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 14/15] btrfs-progs: device-add: support HMZONED device Naohiro Aota
2019-12-04  8:25 ` [PATCH v5 15/15] btrfs-progs: introduce support for device replace " Naohiro Aota
2019-12-04  8:30 ` [PATCH] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
2019-12-04 12:15   ` Vyacheslav Dubeyko
2019-12-06  7:03     ` Naohiro Aota
2019-12-06 15:22       ` David Sterba
2019-12-05 14:51   ` Karel Zak
2019-12-06  7:06     ` Naohiro Aota

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).