All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device
@ 2021-10-05  6:22 Naohiro Aota
  2021-10-05  6:22 ` [PATCH v2 1/7] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

As discussed in the Zoned Storage page [1],  the kernel page cache does not
guarantee that cached dirty pages will be flushed to a block device in
sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
device to ensure the write ordering.

[1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions

As a writng buffer is embedded in some other struct (e.g., "char data[]" in
struct extent_buffer), it is difficult to allocate the struct so that the
writng buffer is aligned.

This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
which allocates an aligned bounce buffer, copy the buffer contents, and
proceeds the IO. And, it now opens a zoned device with O_DIRECT.

Since the allocation and copying are costly, it is better to do them only
when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
the file is opened with O_DIRECT or not every time doing an IO.

As zoned device forces to use zoned btrfs, I decided to use the zoned flag
to determine if it is direct-IO or not. This can cause a false-positive (to
use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
emulated zoned mode on a non-zoned device or a regular file. Considering
the emulated zoned mode is mostly for debugging or testing, I believe this
is acceptable.

* Changes
v2
  - Rebased on the latest "devel" branch
  - Add patch to fix segfault in several cases
  - drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES

Patches 1 to 3 are preparation to fix some issues in the current code.

Patches 4 and 5 wraps pread/pwrite with newly introduced function
btrfs_pread/btrfs_pwrite.

Patch 6 deals with the zoned flag while reading the initial trees.

Patch 7 finally opens a zoned device with O_DIRECT.

Naohiro Aota (7):
  btrfs-progs: mkfs: do not set zone size on non-zoned mode
  btrfs-progs: set eb->fs_info properly
  btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
  btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
  btrfs-progs: introduce btrfs_pread wrapper for pread
  btrfs-progs: temporally set zoned flag for initial tree reading
  btrfs-progs: use direct-io for zoned device

 common/device-utils.c     | 76 ++++++++++++++++++++++++++++++++++++---
 common/device-utils.h     | 29 ++++++++++++++-
 common/fsfeatures.h       |  3 +-
 convert/main.c            |  1 +
 kernel-shared/disk-io.c   | 19 +++++++++-
 kernel-shared/extent_io.c | 14 +++++---
 kernel-shared/volumes.c   |  6 ++++
 kernel-shared/zoned.c     |  6 ++--
 mkfs/common.c             | 14 +++++---
 mkfs/main.c               | 12 +++++--
 mkfs/rootdir.c            |  1 +
 11 files changed, 158 insertions(+), 23 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/7] btrfs-progs: mkfs: do not set zone size on non-zoned mode
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
@ 2021-10-05  6:22 ` Naohiro Aota
  2021-10-05  6:23 ` [PATCH v2 2/7] btrfs-progs: set eb->fs_info properly Naohiro Aota
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota, Johannes Thumshirn

Since zone_size() returns an emulated zone size even for non-zoned device,
we cannot use cfg.zone_size to determine the device is zoned or not. Set
zone_size = 0 on non-zoned mode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 mkfs/main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index 8a4c9523d601..314d608a5cc5 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1357,7 +1357,10 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	mkfs_cfg.features = features;
 	mkfs_cfg.runtime_features = runtime_features;
 	mkfs_cfg.csum_type = csum_type;
-	mkfs_cfg.zone_size = zone_size(file);
+	if (zoned)
+		mkfs_cfg.zone_size = zone_size(file);
+	else
+		mkfs_cfg.zone_size = 0;
 
 	ret = make_btrfs(fd, &mkfs_cfg);
 	if (ret) {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/7] btrfs-progs: set eb->fs_info properly
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
  2021-10-05  6:22 ` [PATCH v2 1/7] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
@ 2021-10-05  6:23 ` Naohiro Aota
  2021-10-05  6:23 ` [PATCH v2 3/7] btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES Naohiro Aota
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

Several extent_buffer initialization misses fs_info setting. This is OK
before the following patch ("btrfs-progs: use direct-io for zoned device")
as eb->fs_info is not always necessary. But, after that patch, we will use
fs_info to determine it is zoned or not and that causes segfault in such
cases.

Properly set fs_info when initializing extent_buffers to fix the issue.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 convert/main.c          | 1 +
 kernel-shared/volumes.c | 2 ++
 mkfs/rootdir.c          | 1 +
 3 files changed, 4 insertions(+)

diff --git a/convert/main.c b/convert/main.c
index b705946b1312..223eebad2e72 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -397,6 +397,7 @@ static int migrate_one_reserved_range(struct btrfs_trans_handle *trans,
 		}
 		eb->start = key.objectid;
 		eb->len = key.offset;
+		eb->fs_info = root->fs_info;
 
 		/* Write the data */
 		ret = write_and_map_eb(root->fs_info, eb);
diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index b2a6b04f8e3d..2ef2a8618d1f 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -2567,6 +2567,7 @@ static int split_eb_for_raid56(struct btrfs_fs_info *info,
 		eb->flags = 0;
 		eb->fd = -1;
 		eb->dev_bytenr = (u64)-1;
+		eb->fs_info = info;
 
 		this_eb_start = raid_map[i];
 
@@ -2638,6 +2639,7 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
 		new_eb->fd = multi->stripes[i].dev->fd;
 		multi->stripes[i].dev->total_ios++;
 		new_eb->len = stripe_len;
+		new_eb->fs_info = info;
 
 		if (raid_map[i] == BTRFS_RAID5_P_STRIPE)
 			p_eb = new_eb;
diff --git a/mkfs/rootdir.c b/mkfs/rootdir.c
index c2e14daf6663..16ff257ac408 100644
--- a/mkfs/rootdir.c
+++ b/mkfs/rootdir.c
@@ -397,6 +397,7 @@ again:
 
 		eb->start = first_block + bytes_read;
 		eb->len = sectorsize;
+		eb->fs_info = root->fs_info;
 
 		/*
 		 * we're doing the csum before we record the extent, but
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/7] btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
  2021-10-05  6:22 ` [PATCH v2 1/7] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
  2021-10-05  6:23 ` [PATCH v2 2/7] btrfs-progs: set eb->fs_info properly Naohiro Aota
@ 2021-10-05  6:23 ` Naohiro Aota
  2021-10-05  6:23 ` [PATCH v2 4/7] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

Since we cannot create ext*/reiserfs on a zoned device, it is useless to
allow ZONED feature when converting a file system. Drop ZONED flag from
BTRFS_CONVERT_ALLOWED_FEATURES.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/fsfeatures.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/common/fsfeatures.h b/common/fsfeatures.h
index 163588e52933..9e39c667b900 100644
--- a/common/fsfeatures.h
+++ b/common/fsfeatures.h
@@ -39,8 +39,7 @@
 	| BTRFS_FEATURE_INCOMPAT_BIG_METADATA			\
 	| BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF			\
 	| BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA		\
-	| BTRFS_FEATURE_INCOMPAT_NO_HOLES			\
-	| BTRFS_FEATURE_INCOMPAT_ZONED)
+	| BTRFS_FEATURE_INCOMPAT_NO_HOLES)
 
 #define BTRFS_FEATURE_LIST_ALL		(1ULL << 63)
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/7] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (2 preceding siblings ...)
  2021-10-05  6:23 ` [PATCH v2 3/7] btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES Naohiro Aota
@ 2021-10-05  6:23 ` Naohiro Aota
  2021-10-06 14:10   ` David Sterba
  2021-10-05  6:23 ` [PATCH v2 5/7] btrfs-progs: introduce btrfs_pread wrapper for pread Naohiro Aota
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota, Johannes Thumshirn

Wrap pwrite with btrfs_pwrite(). It simply calls pwrite() on non-zoned
btrfs (= opened without O_DIRECT). On zoned mode (= opened with O_DIRECT),
it allocates an aligned bounce buffer, copy the contents and use it for
direct-IO writing.

Writes in device_zero_blocks() and btrfs_wipe_existing_sb() are a little
tricky. We don't have fs_info on our hands, so use zinfo to determine it is
a zoned device or not.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 common/device-utils.c     | 76 ++++++++++++++++++++++++++++++++++++---
 common/device-utils.h     | 19 +++++++++-
 kernel-shared/extent_io.c |  7 ++--
 kernel-shared/zoned.c     |  4 +--
 mkfs/common.c             | 14 +++++---
 5 files changed, 106 insertions(+), 14 deletions(-)

diff --git a/common/device-utils.c b/common/device-utils.c
index 366154b98218..3e58d6d0a412 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -26,6 +26,7 @@
 #include <dirent.h>
 #include <blkid/blkid.h>
 #include <linux/limits.h>
+#include <linux/fs.h>
 #include <limits.h>
 #include "kernel-lib/sizes.h"
 #include "kernel-shared/disk-io.h"
@@ -95,7 +96,7 @@ int device_discard_blocks(int fd, u64 start, u64 len)
 /*
  * Write zeros to the given range [start, start + len)
  */
-int device_zero_blocks(int fd, off_t start, size_t len)
+int device_zero_blocks(int fd, off_t start, size_t len, const bool direct)
 {
 	char *buf = malloc(len);
 	int ret = 0;
@@ -104,7 +105,7 @@ int device_zero_blocks(int fd, off_t start, size_t len)
 	if (!buf)
 		return -ENOMEM;
 	memset(buf, 0, len);
-	written = pwrite(fd, buf, len, start);
+	written = btrfs_pwrite(fd, buf, len, start, direct);
 	if (written != len)
 		ret = -EIO;
 	free(buf);
@@ -134,7 +135,7 @@ static int zero_dev_clamped(int fd, struct btrfs_zoned_device_info *zinfo,
 	if (zinfo && zinfo->model == ZONED_HOST_MANAGED)
 		return zero_zone_blocks(fd, zinfo, start, end - start);
 
-	return device_zero_blocks(fd, start, end - start);
+	return device_zero_blocks(fd, start, end - start, false);
 }
 
 /*
@@ -176,8 +177,10 @@ static int btrfs_wipe_existing_sb(int fd, struct btrfs_zoned_device_info *zinfo)
 		len = sizeof(buf);
 
 	if (!zone_is_sequential(zinfo, offset)) {
+		const bool direct = zinfo && zinfo->model == ZONED_HOST_MANAGED;
+
 		memset(buf, 0, len);
-		ret = pwrite(fd, buf, len, offset);
+		ret = btrfs_pwrite(fd, buf, len, offset, direct);
 		if (ret < 0) {
 			error("cannot wipe existing superblock: %m");
 			ret = -1;
@@ -510,3 +513,68 @@ out:
 	close(sysfs_fd);
 	return ret;
 }
+
+ssize_t btrfs_direct_pio(int rw, int fd, void *buf, size_t count, off_t offset)
+{
+	int alignment;
+	size_t iosize;
+	void *bounce_buf = NULL;
+	struct stat stat_buf;
+	unsigned long req;
+	int ret;
+	ssize_t ret_rw;
+
+	ASSERT(rw == READ || rw == WRITE);
+
+	if (fstat(fd, &stat_buf) == -1) {
+		error("fstat failed (%m)");
+		return 0;
+	}
+
+	if ((stat_buf.st_mode & S_IFMT) == S_IFBLK)
+		req = BLKSSZGET;
+	else
+		req = FIGETBSZ;
+
+	if (ioctl(fd, req, &alignment)) {
+		error("failed to get block size: %m");
+		return 0;
+	}
+
+	if (IS_ALIGNED((size_t)buf, alignment) && IS_ALIGNED(count, alignment)) {
+		if (rw == WRITE)
+			return pwrite(fd, buf, count, offset);
+		else
+			return pread(fd, buf, count, offset);
+	}
+
+	/* Cannot do anything if the write size is not aligned */
+	if (rw == WRITE && !IS_ALIGNED(count, alignment)) {
+		error("%lu is not aligned to %d", count, alignment);
+		return 0;
+	}
+
+	iosize = round_up(count, alignment);
+
+	ret = posix_memalign(&bounce_buf, alignment, iosize);
+	if (ret) {
+		error("failed to allocate bounce buffer: %m");
+		errno = ret;
+		return 0;
+	}
+
+	if (rw == WRITE) {
+		ASSERT(iosize == count);
+		memcpy(bounce_buf, buf, count);
+		ret_rw = pwrite(fd, bounce_buf, iosize, offset);
+	} else {
+		ret_rw = pread(fd, bounce_buf, iosize, offset);
+		if (ret_rw >= count) {
+			ret_rw = count;
+			memcpy(buf, bounce_buf, count);
+		}
+	}
+
+	free(bounce_buf);
+	return ret_rw;
+}
diff --git a/common/device-utils.h b/common/device-utils.h
index 099520bf9737..767dab4370e1 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -17,6 +17,8 @@
 #ifndef __DEVICE_UTILS_H__
 #define __DEVICE_UTILS_H__
 
+#include <stdbool.h>
+#include <unistd.h>
 #include "kerncompat.h"
 #include "sys/stat.h"
 
@@ -35,7 +37,7 @@
  * Generic block device helpers
  */
 int device_discard_blocks(int fd, u64 start, u64 len);
-int device_zero_blocks(int fd, off_t start, size_t len);
+int device_zero_blocks(int fd, off_t start, size_t len, const bool direct);
 u64 device_get_partition_size(const char *dev);
 u64 device_get_partition_size_fd(int fd);
 int device_get_queue_param(const char *file, const char *param, char *buf, size_t len);
@@ -47,5 +49,20 @@ u64 device_get_zone_size(int fd, const char *name);
 u64 btrfs_device_size(int fd, struct stat *st);
 int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		u64 max_block_count, unsigned opflags);
+ssize_t btrfs_direct_pio(int rw, int fd, void *buf, size_t count, off_t offset);
+
+#ifdef BTRFS_ZONED
+static inline ssize_t btrfs_pwrite(int fd, void *buf, size_t count,
+				   off_t offset, bool direct)
+{
+	if (!direct)
+		return pwrite(fd, buf, count, offset);
+
+	return btrfs_direct_pio(WRITE, fd, buf, count, offset);
+}
+#else
+#define btrfs_pwrite(fd, buf, count, offset, direct) \
+	({ (void)(direct); pwrite(fd, buf, count, offset); })
+#endif
 
 #endif
diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index d3d79bc8f748..b5984949f431 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -29,6 +29,7 @@
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/volumes.h"
 #include "common/utils.h"
+#include "common/device-utils.h"
 #include "common/internal.h"
 
 void extent_io_tree_init(struct extent_io_tree *tree)
@@ -809,7 +810,8 @@ out:
 int write_extent_to_disk(struct extent_buffer *eb)
 {
 	int ret;
-	ret = pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr);
+	ret = btrfs_pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr,
+			   eb->fs_info->zoned);
 	if (ret < 0)
 		goto out;
 	if (ret != eb->len) {
@@ -932,7 +934,8 @@ int write_data_to_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
 			this_len = min(this_len, bytes_left);
 			dev_nr++;
 
-			ret = pwrite(device->fd, buf + total_write, this_len, dev_bytenr);
+			ret = btrfs_pwrite(device->fd, buf + total_write,
+					   this_len, dev_bytenr, info->zoned);
 			if (ret != this_len) {
 				if (ret < 0) {
 					fprintf(stderr, "Error writing to "
diff --git a/kernel-shared/zoned.c b/kernel-shared/zoned.c
index 8d94f98a7fce..c2cce3b5f366 100644
--- a/kernel-shared/zoned.c
+++ b/kernel-shared/zoned.c
@@ -424,7 +424,7 @@ int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
 			count = zone_len - (ofst & (zone_len - 1));
 
 		if (!zone_is_sequential(zinfo, ofst)) {
-			ret = device_zero_blocks(fd, ofst, count);
+			ret = device_zero_blocks(fd, ofst, count, true);
 			if (ret != 0)
 				return ret;
 		}
@@ -595,7 +595,7 @@ size_t btrfs_sb_io(int fd, void *buf, off_t offset, int rw)
 	if (rw == READ)
 		ret_sz = pread64(fd, buf, count, mapped);
 	else
-		ret_sz = pwrite64(fd, buf, count, mapped);
+		ret_sz = btrfs_pwrite(fd, buf, count, mapped, true);
 
 	if (ret_sz != count)
 		return ret_sz;
diff --git a/mkfs/common.c b/mkfs/common.c
index 20a7d1155972..5c8d6ac13a3b 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -54,7 +54,7 @@ static int btrfs_write_empty_tree(int fd, struct btrfs_mkfs_config *cfg,
 	btrfs_set_header_nritems(buf, 0);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, block);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize, block, cfg->zone_size);
 	if (ret != cfg->nodesize)
 		return ret < 0 ? -errno : -EIO;
 	return 0;
@@ -134,7 +134,8 @@ static int btrfs_create_tree_root(int fd, struct btrfs_mkfs_config *cfg,
 			     cfg->csum_type);
 
 	/* write back root tree */
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_ROOT_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_ROOT_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize)
 		return (ret < 0 ? -errno : -EIO);
 
@@ -422,7 +423,8 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_header_nritems(buf, nritems);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_EXTENT_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_EXTENT_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize) {
 		ret = (ret < 0 ? -errno : -EIO);
 		goto out;
@@ -510,7 +512,8 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_header_nritems(buf, nritems);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_CHUNK_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_CHUNK_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize) {
 		ret = (ret < 0 ? -errno : -EIO);
 		goto out;
@@ -550,7 +553,8 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_header_nritems(buf, nritems);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_DEV_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_DEV_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize) {
 		ret = (ret < 0 ? -errno : -EIO);
 		goto out;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/7] btrfs-progs: introduce btrfs_pread wrapper for pread
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (3 preceding siblings ...)
  2021-10-05  6:23 ` [PATCH v2 4/7] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
@ 2021-10-05  6:23 ` Naohiro Aota
  2021-10-05  6:23 ` [PATCH v2 6/7] btrfs-progs: temporally set zoned flag for initial tree reading Naohiro Aota
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota, Johannes Thumshirn

Wrap pread with btrfs_pread as well.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 common/device-utils.h     | 10 ++++++++++
 kernel-shared/disk-io.c   |  4 +++-
 kernel-shared/extent_io.c |  7 ++++---
 kernel-shared/zoned.c     |  2 +-
 4 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/common/device-utils.h b/common/device-utils.h
index 767dab4370e1..f79e746840fc 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -60,9 +60,19 @@ static inline ssize_t btrfs_pwrite(int fd, void *buf, size_t count,
 
 	return btrfs_direct_pio(WRITE, fd, buf, count, offset);
 }
+static inline ssize_t btrfs_pread(int fd, void *buf, size_t count, off_t offset,
+				  bool direct)
+{
+	if (!direct)
+		return pread(fd, buf, count, offset);
+
+	return btrfs_direct_pio(READ, fd, buf, count, offset);
+}
 #else
 #define btrfs_pwrite(fd, buf, count, offset, direct) \
 	({ (void)(direct); pwrite(fd, buf, count, offset); })
+#define btrfs_pread(fd, buf, count, offset, direct) \
+	({ (void)(direct); pread(fd, buf, count, offset); })
 #endif
 
 #endif
diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index 1cda6f3a98af..740500f9fdc9 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -35,6 +35,7 @@
 #include "kernel-shared/print-tree.h"
 #include "common/rbtree-utils.h"
 #include "common/device-scan.h"
+#include "common/device-utils.h"
 #include "crypto/hash.h"
 
 /* specified errno for check_tree_block */
@@ -476,7 +477,8 @@ int read_extent_data(struct btrfs_fs_info *fs_info, char *data, u64 logical,
 		goto err;
 	}
 
-	ret = pread64(device->fd, data, *len, multi->stripes[0].physical);
+	ret = btrfs_pread(device->fd, data, *len, multi->stripes[0].physical,
+			  fs_info->zoned);
 	if (ret != *len)
 		ret = -EIO;
 	else
diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index b5984949f431..af09ade4025f 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -793,7 +793,8 @@ int read_extent_from_disk(struct extent_buffer *eb,
 			  unsigned long offset, unsigned long len)
 {
 	int ret;
-	ret = pread(eb->fd, eb->data + offset, len, eb->dev_bytenr);
+	ret = btrfs_pread(eb->fd, eb->data + offset, len, eb->dev_bytenr,
+			  eb->fs_info->zoned);
 	if (ret < 0) {
 		ret = -errno;
 		goto out;
@@ -850,8 +851,8 @@ int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
 			return -EIO;
 		}
 
-		ret = pread(device->fd, buf + total_read, read_len,
-			    multi->stripes[0].physical);
+		ret = btrfs_pread(device->fd, buf + total_read, read_len,
+				  multi->stripes[0].physical, info->zoned);
 		kfree(multi);
 		if (ret < 0) {
 			fprintf(stderr, "Error reading %llu, %d\n", offset,
diff --git a/kernel-shared/zoned.c b/kernel-shared/zoned.c
index c2cce3b5f366..f5d2299fc744 100644
--- a/kernel-shared/zoned.c
+++ b/kernel-shared/zoned.c
@@ -593,7 +593,7 @@ size_t btrfs_sb_io(int fd, void *buf, off_t offset, int rw)
 		return ret;
 
 	if (rw == READ)
-		ret_sz = pread64(fd, buf, count, mapped);
+		ret_sz = btrfs_pread(fd, buf, count, mapped, true);
 	else
 		ret_sz = btrfs_pwrite(fd, buf, count, mapped, true);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 6/7] btrfs-progs: temporally set zoned flag for initial tree reading
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (4 preceding siblings ...)
  2021-10-05  6:23 ` [PATCH v2 5/7] btrfs-progs: introduce btrfs_pread wrapper for pread Naohiro Aota
@ 2021-10-05  6:23 ` Naohiro Aota
  2021-10-05  6:23 ` [PATCH v2 7/7] btrfs-progs: use direct-io for zoned device Naohiro Aota
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota, Johannes Thumshirn

Functions to read data/metadata e.g. read_extent_from_disk() now depend on
the fs_info->zoned flag to determine if they do direct-IO or not.

The flag (and zone_size) is not known before reading the chunk tree and it
set to 0 while in the initial chunk tree setup process. That will cause
btrfs_pread() to fail because it does not align the buffer.

Use fcntl() to find out the file descriptor is opened with O_DIRECT or not,
and if it is, set the zoned flag to 1 temporally for this initial process.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 kernel-shared/disk-io.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index 740500f9fdc9..dd48599a5f1f 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -1302,10 +1302,22 @@ static struct btrfs_fs_info *__open_ctree_fd(int fp, struct open_ctree_flags *oc
 	if (ret)
 		goto out_devices;
 
+	/*
+	 * fs_info->zone_size (and zoned) are not known before reading the
+	 * chunk tree, so it's 0 at this point. But, fs_info->zoned == 0
+	 * will cause btrfs_pread() not to use an aligned bounce buffer,
+	 * causing EINVAL when the file is opened with O_DIRECT. Temporally
+	 * set zoned = 1 in that case.
+	 */
+	if (fcntl(fp, F_GETFL) & O_DIRECT)
+		fs_info->zoned = 1;
+
 	ret = btrfs_setup_chunk_tree_and_device_map(fs_info, ocf->chunk_tree_bytenr);
 	if (ret)
 		goto out_chunk;
 
+	fs_info->zoned = 0;
+
 	/* Chunk tree root is unable to read, return directly */
 	if (!fs_info->chunk_root)
 		return fs_info;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 7/7] btrfs-progs: use direct-io for zoned device
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (5 preceding siblings ...)
  2021-10-05  6:23 ` [PATCH v2 6/7] btrfs-progs: temporally set zoned flag for initial tree reading Naohiro Aota
@ 2021-10-05  6:23 ` Naohiro Aota
  2021-10-06 14:28 ` [PATCH v2 0/7] btrfs-progs: use direct-IO " David Sterba
  2021-10-06 21:02 ` David Sterba
  8 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

We need to use direct-IO for zoned devices to preserve the write
ordering.  Instead of detecting if the device is zoned or not, we simply
use direct-IO for any kind of device (even if emulated zoned mode on a
regular device).

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 kernel-shared/disk-io.c | 3 +++
 kernel-shared/volumes.c | 4 ++++
 mkfs/main.c             | 7 ++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index dd48599a5f1f..aabeba7821ed 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -1382,6 +1382,9 @@ struct btrfs_fs_info *open_ctree_fs_info(struct open_ctree_flags *ocf)
 	if (!(ocf->flags & OPEN_CTREE_WRITES))
 		oflags = O_RDONLY;
 
+	if ((oflags & O_RDWR) && zoned_model(ocf->filename) == ZONED_HOST_MANAGED)
+		oflags |= O_DIRECT;
+
 	fp = open(ocf->filename, oflags);
 	if (fp < 0) {
 		error("cannot open '%s': %m", ocf->filename);
diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index 2ef2a8618d1f..bfa60812ef97 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -455,6 +455,10 @@ int btrfs_open_devices(struct btrfs_fs_info *fs_info,
 			continue;
 		}
 
+		if ((flags & O_RDWR) &&
+		    zoned_model(device->name) == ZONED_HOST_MANAGED)
+			flags |= O_DIRECT;
+
 		fd = open(device->name, flags);
 		if (fd < 0) {
 			ret = -errno;
diff --git a/mkfs/main.c b/mkfs/main.c
index 314d608a5cc5..11a0989be281 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -893,6 +893,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	bool ssd = false;
 	bool zoned = false;
 	bool force_overwrite = false;
+	int oflags;
 	char *source_dir = NULL;
 	bool source_dir_set = false;
 	bool shrink_rootdir = false;
@@ -1305,12 +1306,16 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 
 	dev_cnt--;
 
+	oflags = O_RDWR;
+	if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
+		oflags |= O_DIRECT;
+
 	/*
 	 * Open without O_EXCL so that the problem should not occur by the
 	 * following operation in kernel:
 	 * (btrfs_register_one_device() fails if O_EXCL is on)
 	 */
-	fd = open(file, O_RDWR);
+	fd = open(file, oflags);
 	if (fd < 0) {
 		error("unable to open %s: %m", file);
 		goto error;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/7] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
  2021-10-05  6:23 ` [PATCH v2 4/7] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
@ 2021-10-06 14:10   ` David Sterba
  0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2021-10-06 14:10 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba, Johannes Thumshirn

On Tue, Oct 05, 2021 at 03:23:02PM +0900, Naohiro Aota wrote:
> -int device_zero_blocks(int fd, off_t start, size_t len)
> +int device_zero_blocks(int fd, off_t start, size_t len, const bool direct)

const for ints or bools does not make sense, the const for pointers is
an API contract that the callee won't change it but for local variables
defined inside parameter list it's not necessary.

>  {
>  	char *buf = malloc(len);
>  	int ret = 0;
> @@ -104,7 +105,7 @@ int device_zero_blocks(int fd, off_t start, size_t len)
>  	if (!buf)
>  		return -ENOMEM;
>  	memset(buf, 0, len);
> -	written = pwrite(fd, buf, len, start);
> +	written = btrfs_pwrite(fd, buf, len, start, direct);
>  	if (written != len)
>  		ret = -EIO;
>  	free(buf);
> @@ -134,7 +135,7 @@ static int zero_dev_clamped(int fd, struct btrfs_zoned_device_info *zinfo,
>  	if (zinfo && zinfo->model == ZONED_HOST_MANAGED)
>  		return zero_zone_blocks(fd, zinfo, start, end - start);
>  
> -	return device_zero_blocks(fd, start, end - start);
> +	return device_zero_blocks(fd, start, end - start, false);
>  }
>  
>  /*
> @@ -176,8 +177,10 @@ static int btrfs_wipe_existing_sb(int fd, struct btrfs_zoned_device_info *zinfo)
>  		len = sizeof(buf);
>  
>  	if (!zone_is_sequential(zinfo, offset)) {
> +		const bool direct = zinfo && zinfo->model == ZONED_HOST_MANAGED;
> +
>  		memset(buf, 0, len);
> -		ret = pwrite(fd, buf, len, offset);
> +		ret = btrfs_pwrite(fd, buf, len, offset, direct);
>  		if (ret < 0) {

This should probably also check for ret == 0 as this does nothing, but
that's a separate fix.

>  			error("cannot wipe existing superblock: %m");
>  			ret = -1;
> @@ -510,3 +513,68 @@ out:
>  	close(sysfs_fd);
>  	return ret;
>  }
> +
> +ssize_t btrfs_direct_pio(int rw, int fd, void *buf, size_t count, off_t offset)
> +{
> +	int alignment;
> +	size_t iosize;
> +	void *bounce_buf = NULL;
> +	struct stat stat_buf;
> +	unsigned long req;
> +	int ret;
> +	ssize_t ret_rw;
> +
> +	ASSERT(rw == READ || rw == WRITE);
> +
> +	if (fstat(fd, &stat_buf) == -1) {
> +		error("fstat failed (%m)");
> +		return 0;
> +	}
> +
> +	if ((stat_buf.st_mode & S_IFMT) == S_IFBLK)
> +		req = BLKSSZGET;
> +	else
> +		req = FIGETBSZ;

> +
> +	if (ioctl(fd, req, &alignment)) {
> +		error("failed to get block size: %m");
> +		return 0;

The ioctls take an int as parameter but it's not well suitable for
alignment checks as it internally does bit operations and this should be
avoided. It should work here but would be good to clean it up.

> +	}
> +
> +	if (IS_ALIGNED((size_t)buf, alignment) && IS_ALIGNED(count, alignment)) {
> +		if (rw == WRITE)
> +			return pwrite(fd, buf, count, offset);
> +		else
> +			return pread(fd, buf, count, offset);
> +	}
> +
> +	/* Cannot do anything if the write size is not aligned */
> +	if (rw == WRITE && !IS_ALIGNED(count, alignment)) {
> +		error("%lu is not aligned to %d", count, alignment);

as count is size_t, the format specifier should be %zu

> +		return 0;
> +	}
> +
> +	iosize = round_up(count, alignment);
> +
> +	ret = posix_memalign(&bounce_buf, alignment, iosize);
> +	if (ret) {
> +		error("failed to allocate bounce buffer: %m");
> +		errno = ret;
> +		return 0;
> +	}
> +
> +	if (rw == WRITE) {
> +		ASSERT(iosize == count);
> +		memcpy(bounce_buf, buf, count);
> +		ret_rw = pwrite(fd, bounce_buf, iosize, offset);
> +	} else {
> +		ret_rw = pread(fd, bounce_buf, iosize, offset);
> +		if (ret_rw >= count) {
> +			ret_rw = count;
> +			memcpy(buf, bounce_buf, count);
> +		}
> +	}
> +
> +	free(bounce_buf);

The wrappers should entirely copy the semantics of pwrite/pread, ie
return <0 on error as -errno in case of error, 0 if there was no write
and otherwise the number of written bytes.

I'll do the minor fixups only, the pwrite cleanups need to audit all
call sites so it's better for a separate patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (6 preceding siblings ...)
  2021-10-05  6:23 ` [PATCH v2 7/7] btrfs-progs: use direct-io for zoned device Naohiro Aota
@ 2021-10-06 14:28 ` David Sterba
  2021-10-06 21:02 ` David Sterba
  8 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2021-10-06 14:28 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba

On Tue, Oct 05, 2021 at 03:22:58PM +0900, Naohiro Aota wrote:
> As discussed in the Zoned Storage page [1],  the kernel page cache does not
> guarantee that cached dirty pages will be flushed to a block device in
> sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> device to ensure the write ordering.
> 
> [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> 
> As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> struct extent_buffer), it is difficult to allocate the struct so that the
> writng buffer is aligned.
> 
> This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> which allocates an aligned bounce buffer, copy the buffer contents, and
> proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> 
> Since the allocation and copying are costly, it is better to do them only
> when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> the file is opened with O_DIRECT or not every time doing an IO.
> 
> As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> to determine if it is direct-IO or not. This can cause a false-positive (to
> use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> emulated zoned mode on a non-zoned device or a regular file. Considering
> the emulated zoned mode is mostly for debugging or testing, I believe this
> is acceptable.
> 
> * Changes
> v2
>   - Rebased on the latest "devel" branch
>   - Add patch to fix segfault in several cases
>   - drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
> 
> Patches 1 to 3 are preparation to fix some issues in the current code.
> 
> Patches 4 and 5 wraps pread/pwrite with newly introduced function
> btrfs_pread/btrfs_pwrite.
> 
> Patch 6 deals with the zoned flag while reading the initial trees.
> 
> Patch 7 finally opens a zoned device with O_DIRECT.
> 
> Naohiro Aota (7):
>   btrfs-progs: mkfs: do not set zone size on non-zoned mode
>   btrfs-progs: set eb->fs_info properly
>   btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
>   btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
>   btrfs-progs: introduce btrfs_pread wrapper for pread
>   btrfs-progs: temporally set zoned flag for initial tree reading
>   btrfs-progs: use direct-io for zoned device

Added to devel with some minor fixups, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device
  2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (7 preceding siblings ...)
  2021-10-06 14:28 ` [PATCH v2 0/7] btrfs-progs: use direct-IO " David Sterba
@ 2021-10-06 21:02 ` David Sterba
  2021-10-20  6:53   ` Naohiro Aota
  8 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2021-10-06 21:02 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba

On Tue, Oct 05, 2021 at 03:22:58PM +0900, Naohiro Aota wrote:
> As discussed in the Zoned Storage page [1],  the kernel page cache does not
> guarantee that cached dirty pages will be flushed to a block device in
> sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> device to ensure the write ordering.
> 
> [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> 
> As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> struct extent_buffer), it is difficult to allocate the struct so that the
> writng buffer is aligned.
> 
> This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> which allocates an aligned bounce buffer, copy the buffer contents, and
> proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> 
> Since the allocation and copying are costly, it is better to do them only
> when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> the file is opened with O_DIRECT or not every time doing an IO.
> 
> As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> to determine if it is direct-IO or not. This can cause a false-positive (to
> use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> emulated zoned mode on a non-zoned device or a regular file. Considering
> the emulated zoned mode is mostly for debugging or testing, I believe this
> is acceptable.
> 
> * Changes
> v2
>   - Rebased on the latest "devel" branch
>   - Add patch to fix segfault in several cases
>   - drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
> 
> Patches 1 to 3 are preparation to fix some issues in the current code.
> 
> Patches 4 and 5 wraps pread/pwrite with newly introduced function
> btrfs_pread/btrfs_pwrite.
> 
> Patch 6 deals with the zoned flag while reading the initial trees.
> 
> Patch 7 finally opens a zoned device with O_DIRECT.
> 
> Naohiro Aota (7):
>   btrfs-progs: mkfs: do not set zone size on non-zoned mode
>   btrfs-progs: set eb->fs_info properly
>   btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
>   btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
>   btrfs-progs: introduce btrfs_pread wrapper for pread
>   btrfs-progs: temporally set zoned flag for initial tree reading
>   btrfs-progs: use direct-io for zoned device

Is this still supposed to work?

  $ ./mkfs.btrfs -f -O zoned -d single -m single img
  ...
  ERROR: 16384 is not aligned to 1048576
  ERROR: error during mkfs: Input/output error

On commit below this patchset it works and creates a filesystem with
zoned mode and zone size 256M.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device
  2021-10-06 21:02 ` David Sterba
@ 2021-10-20  6:53   ` Naohiro Aota
  2021-10-20 16:57     ` David Sterba
  0 siblings, 1 reply; 13+ messages in thread
From: Naohiro Aota @ 2021-10-20  6:53 UTC (permalink / raw)
  To: dsterba, linux-btrfs, David Sterba

On Wed, Oct 06, 2021 at 11:02:47PM +0200, David Sterba wrote:
> On Tue, Oct 05, 2021 at 03:22:58PM +0900, Naohiro Aota wrote:
> > As discussed in the Zoned Storage page [1],  the kernel page cache does not
> > guarantee that cached dirty pages will be flushed to a block device in
> > sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> > device to ensure the write ordering.
> > 
> > [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> > 
> > As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> > struct extent_buffer), it is difficult to allocate the struct so that the
> > writng buffer is aligned.
> > 
> > This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> > which allocates an aligned bounce buffer, copy the buffer contents, and
> > proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> > 
> > Since the allocation and copying are costly, it is better to do them only
> > when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> > the file is opened with O_DIRECT or not every time doing an IO.
> > 
> > As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> > to determine if it is direct-IO or not. This can cause a false-positive (to
> > use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> > emulated zoned mode on a non-zoned device or a regular file. Considering
> > the emulated zoned mode is mostly for debugging or testing, I believe this
> > is acceptable.
> > 
> > * Changes
> > v2
> >   - Rebased on the latest "devel" branch
> >   - Add patch to fix segfault in several cases
> >   - drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
> > 
> > Patches 1 to 3 are preparation to fix some issues in the current code.
> > 
> > Patches 4 and 5 wraps pread/pwrite with newly introduced function
> > btrfs_pread/btrfs_pwrite.
> > 
> > Patch 6 deals with the zoned flag while reading the initial trees.
> > 
> > Patch 7 finally opens a zoned device with O_DIRECT.
> > 
> > Naohiro Aota (7):
> >   btrfs-progs: mkfs: do not set zone size on non-zoned mode
> >   btrfs-progs: set eb->fs_info properly
> >   btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES
> >   btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
> >   btrfs-progs: introduce btrfs_pread wrapper for pread
> >   btrfs-progs: temporally set zoned flag for initial tree reading
> >   btrfs-progs: use direct-io for zoned device
> 
> Is this still supposed to work?
> 
>   $ ./mkfs.btrfs -f -O zoned -d single -m single img
>   ...
>   ERROR: 16384 is not aligned to 1048576
>   ERROR: error during mkfs: Input/output error
> 
> On commit below this patchset it works and creates a filesystem with
> zoned mode and zone size 256M.

I'm sorry to respond so late. My email fetcher was broken.

The mkfs on an image file should work well. I tested the current
"devel" branch. While it needs a patch to replace pwrite with
btrfs_pwrite in create_free_space_tree(), it worked well besides
that. Is this failing on your side?

I'll send the patch soon.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device
  2021-10-20  6:53   ` Naohiro Aota
@ 2021-10-20 16:57     ` David Sterba
  0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2021-10-20 16:57 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: dsterba, linux-btrfs, David Sterba

On Wed, Oct 20, 2021 at 06:53:32AM +0000, Naohiro Aota wrote:
> On Wed, Oct 06, 2021 at 11:02:47PM +0200, David Sterba wrote:
> > On Tue, Oct 05, 2021 at 03:22:58PM +0900, Naohiro Aota wrote:
> > Is this still supposed to work?
> > 
> >   $ ./mkfs.btrfs -f -O zoned -d single -m single img
> >   ...
> >   ERROR: 16384 is not aligned to 1048576
> >   ERROR: error during mkfs: Input/output error
> > 
> > On commit below this patchset it works and creates a filesystem with
> > zoned mode and zone size 256M.
> 
> The mkfs on an image file should work well. I tested the current
> "devel" branch. While it needs a patch to replace pwrite with
> btrfs_pwrite in create_free_space_tree(), it worked well besides
> that. Is this failing on your side?

Yes it still fails with the patched pwrite, same error, with or without
enabling free-space-tree.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-10-20 16:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-05  6:22 [PATCH v2 0/7] btrfs-progs: use direct-IO for zoned device Naohiro Aota
2021-10-05  6:22 ` [PATCH v2 1/7] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
2021-10-05  6:23 ` [PATCH v2 2/7] btrfs-progs: set eb->fs_info properly Naohiro Aota
2021-10-05  6:23 ` [PATCH v2 3/7] btrfs-progs: drop ZONED flag from BTRFS_CONVERT_ALLOWED_FEATURES Naohiro Aota
2021-10-05  6:23 ` [PATCH v2 4/7] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
2021-10-06 14:10   ` David Sterba
2021-10-05  6:23 ` [PATCH v2 5/7] btrfs-progs: introduce btrfs_pread wrapper for pread Naohiro Aota
2021-10-05  6:23 ` [PATCH v2 6/7] btrfs-progs: temporally set zoned flag for initial tree reading Naohiro Aota
2021-10-05  6:23 ` [PATCH v2 7/7] btrfs-progs: use direct-io for zoned device Naohiro Aota
2021-10-06 14:28 ` [PATCH v2 0/7] btrfs-progs: use direct-IO " David Sterba
2021-10-06 21:02 ` David Sterba
2021-10-20  6:53   ` Naohiro Aota
2021-10-20 16:57     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.