All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
@ 2021-09-27  4:15 Naohiro Aota
  2021-09-27  4:15 ` [PATCH 1/5] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Naohiro Aota @ 2021-09-27  4:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

As discussed in the Zoned Storage page [1],  the kernel page cache does not
guarantee that cached dirty pages will be flushed to a block device in
sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
device to ensure the write ordering.

[1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions

As a writng buffer is embedded in some other struct (e.g., "char data[]" in
struct extent_buffer), it is difficult to allocate the struct so that the
writng buffer is aligned.

This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
which allocates an aligned bounce buffer, copy the buffer contents, and
proceeds the IO. And, it now opens a zoned device with O_DIRECT.

Since the allocation and copying are costly, it is better to do them only
when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
the file is opened with O_DIRECT or not every time doing an IO.

As zoned device forces to use zoned btrfs, I decided to use the zoned flag
to determine if it is direct-IO or not. This can cause a false-positive (to
use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
emulated zoned mode on a non-zoned device or a regular file. Considering
the emulated zoned mode is mostly for debugging or testing, I believe this
is acceptable.

Patch 1 is a preparation not to set an emulated zone_size value when not
needed.

Patches 2 and 3 wraps pread/pwrite with newly introduced function
btrfs_pread/btrfs_pwrite.

Patches 4 deals with the zoned flag while reading the initial trees.

Patch 5 finally opens a zoned device with O_DIRECT.

Naohiro Aota (5):
  btrfs-progs: mkfs: do not set zone size on non-zoned mode
  btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
  btrfs-progs: introduce btrfs_pread wrapper for pread
  btrfs-progs: temporally set zoned flag for initial tree reading
  btrfs-progs: use direct-IO for zoned device

 common/device-utils.c     | 76 ++++++++++++++++++++++++++++++++++++---
 common/device-utils.h     | 29 ++++++++++++++-
 kernel-shared/disk-io.c   | 19 +++++++++-
 kernel-shared/extent_io.c | 14 +++++---
 kernel-shared/volumes.c   |  4 +++
 kernel-shared/zoned.c     |  6 ++--
 mkfs/common.c             | 14 +++++---
 mkfs/main.c               | 12 +++++--
 8 files changed, 153 insertions(+), 21 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] btrfs-progs: mkfs: do not set zone size on non-zoned mode
  2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
@ 2021-09-27  4:15 ` Naohiro Aota
  2021-09-27  9:19   ` Johannes Thumshirn
  2021-09-27  4:15 ` [PATCH 2/5] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2021-09-27  4:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

Since zone_size() returns an emulated zone size even for non-zoned device,
we cannot use cfg.zone_size to determine the device is zoned or not. Set
zone_size = 0 on non-zoned mode.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 mkfs/main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index 6f3d6ce42c5d..b925c572b2b3 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1355,7 +1355,10 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	mkfs_cfg.features = features;
 	mkfs_cfg.runtime_features = runtime_features;
 	mkfs_cfg.csum_type = csum_type;
-	mkfs_cfg.zone_size = zone_size(file);
+	if (zoned)
+		mkfs_cfg.zone_size = zone_size(file);
+	else
+		mkfs_cfg.zone_size = 0;
 
 	ret = make_btrfs(fd, &mkfs_cfg);
 	if (ret) {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
  2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
  2021-09-27  4:15 ` [PATCH 1/5] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
@ 2021-09-27  4:15 ` Naohiro Aota
  2021-09-27  9:39   ` Johannes Thumshirn
  2021-09-27  4:15 ` [PATCH 3/5] btrfs-progs: introduce btrfs_pread wrapper for pread Naohiro Aota
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2021-09-27  4:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

Wrap pwrite with btrfs_pwrite(). It simply calls pwrite() on non-zoned
btrfs (= opened without O_DIRECT). On zoned mode (= opened with O_DIRECT),
it allocates an aligned bounce buffer, copy the contents and use it for
direct-IO writing.

Writes in device_zero_blocks() and btrfs_wipe_existing_sb() are a little
tricky. We don't have fs_info on our hands, so use zinfo to determine it is
a zoned device or not.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/device-utils.c     | 76 ++++++++++++++++++++++++++++++++++++---
 common/device-utils.h     | 19 +++++++++-
 kernel-shared/extent_io.c |  7 ++--
 kernel-shared/zoned.c     |  4 +--
 mkfs/common.c             | 14 +++++---
 5 files changed, 106 insertions(+), 14 deletions(-)

diff --git a/common/device-utils.c b/common/device-utils.c
index 503705c43754..3ba4dccba689 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -26,6 +26,7 @@
 #include <dirent.h>
 #include <blkid/blkid.h>
 #include <linux/limits.h>
+#include <linux/fs.h>
 #include <limits.h>
 #include "kernel-lib/sizes.h"
 #include "kernel-shared/disk-io.h"
@@ -76,7 +77,7 @@ int device_discard_blocks(int fd, u64 start, u64 len)
 /*
  * Write zeros to the given range [start, start + len)
  */
-int device_zero_blocks(int fd, off_t start, size_t len)
+int device_zero_blocks(int fd, off_t start, size_t len, const bool direct)
 {
 	char *buf = malloc(len);
 	int ret = 0;
@@ -85,7 +86,7 @@ int device_zero_blocks(int fd, off_t start, size_t len)
 	if (!buf)
 		return -ENOMEM;
 	memset(buf, 0, len);
-	written = pwrite(fd, buf, len, start);
+	written = btrfs_pwrite(fd, buf, len, start, direct);
 	if (written != len)
 		ret = -EIO;
 	free(buf);
@@ -115,7 +116,7 @@ static int zero_dev_clamped(int fd, struct btrfs_zoned_device_info *zinfo,
 	if (zinfo && zinfo->model == ZONED_HOST_MANAGED)
 		return zero_zone_blocks(fd, zinfo, start, end - start);
 
-	return device_zero_blocks(fd, start, end - start);
+	return device_zero_blocks(fd, start, end - start, false);
 }
 
 /*
@@ -157,8 +158,10 @@ static int btrfs_wipe_existing_sb(int fd, struct btrfs_zoned_device_info *zinfo)
 		len = sizeof(buf);
 
 	if (!zone_is_sequential(zinfo, offset)) {
+		const bool direct = zinfo && zinfo->model == ZONED_HOST_MANAGED;
+
 		memset(buf, 0, len);
-		ret = pwrite(fd, buf, len, offset);
+		ret = btrfs_pwrite(fd, buf, len, offset, direct);
 		if (ret < 0) {
 			error("cannot wipe existing superblock: %m");
 			ret = -1;
@@ -491,3 +494,68 @@ out:
 	close(sysfs_fd);
 	return ret;
 }
+
+ssize_t btrfs_direct_pio(int rw, int fd, void *buf, size_t count, off_t offset)
+{
+	int alignment;
+	size_t iosize;
+	void *bounce_buf = NULL;
+	struct stat stat_buf;
+	unsigned long req;
+	int ret;
+	ssize_t ret_rw;
+
+	ASSERT(rw == READ || rw == WRITE);
+
+	if (fstat(fd, &stat_buf) == -1) {
+		error("fstat failed (%m)");
+		return 0;
+	}
+
+	if ((stat_buf.st_mode & S_IFMT) == S_IFBLK)
+		req = BLKSSZGET;
+	else
+		req = FIGETBSZ;
+
+	if (ioctl(fd, req, &alignment)) {
+		error("failed to get block size: %m");
+		return 0;
+	}
+
+	if (IS_ALIGNED((size_t)buf, alignment) && IS_ALIGNED(count, alignment)) {
+		if (rw == WRITE)
+			return pwrite(fd, buf, count, offset);
+		else
+			return pread(fd, buf, count, offset);
+	}
+
+	/* Cannot do anything if the write size is not aligned */
+	if (rw == WRITE && !IS_ALIGNED(count, alignment)) {
+		error("%lu is not aligned to %d", count, alignment);
+		return 0;
+	}
+
+	iosize = round_up(count, alignment);
+
+	ret = posix_memalign(&bounce_buf, alignment, iosize);
+	if (ret) {
+		error("failed to allocate bounce buffer: %m");
+		errno = ret;
+		return 0;
+	}
+
+	if (rw == WRITE) {
+		ASSERT(iosize == count);
+		memcpy(bounce_buf, buf, count);
+		ret_rw = pwrite(fd, bounce_buf, iosize, offset);
+	} else {
+		ret_rw = pread(fd, bounce_buf, iosize, offset);
+		if (ret_rw >= count) {
+			ret_rw = count;
+			memcpy(buf, bounce_buf, count);
+		}
+	}
+
+	free(bounce_buf);
+	return ret_rw;
+}
diff --git a/common/device-utils.h b/common/device-utils.h
index 099520bf9737..767dab4370e1 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -17,6 +17,8 @@
 #ifndef __DEVICE_UTILS_H__
 #define __DEVICE_UTILS_H__
 
+#include <stdbool.h>
+#include <unistd.h>
 #include "kerncompat.h"
 #include "sys/stat.h"
 
@@ -35,7 +37,7 @@
  * Generic block device helpers
  */
 int device_discard_blocks(int fd, u64 start, u64 len);
-int device_zero_blocks(int fd, off_t start, size_t len);
+int device_zero_blocks(int fd, off_t start, size_t len, const bool direct);
 u64 device_get_partition_size(const char *dev);
 u64 device_get_partition_size_fd(int fd);
 int device_get_queue_param(const char *file, const char *param, char *buf, size_t len);
@@ -47,5 +49,20 @@ u64 device_get_zone_size(int fd, const char *name);
 u64 btrfs_device_size(int fd, struct stat *st);
 int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		u64 max_block_count, unsigned opflags);
+ssize_t btrfs_direct_pio(int rw, int fd, void *buf, size_t count, off_t offset);
+
+#ifdef BTRFS_ZONED
+static inline ssize_t btrfs_pwrite(int fd, void *buf, size_t count,
+				   off_t offset, bool direct)
+{
+	if (!direct)
+		return pwrite(fd, buf, count, offset);
+
+	return btrfs_direct_pio(WRITE, fd, buf, count, offset);
+}
+#else
+#define btrfs_pwrite(fd, buf, count, offset, direct) \
+	({ (void)(direct); pwrite(fd, buf, count, offset); })
+#endif
 
 #endif
diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index d3d79bc8f748..b5984949f431 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -29,6 +29,7 @@
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/volumes.h"
 #include "common/utils.h"
+#include "common/device-utils.h"
 #include "common/internal.h"
 
 void extent_io_tree_init(struct extent_io_tree *tree)
@@ -809,7 +810,8 @@ out:
 int write_extent_to_disk(struct extent_buffer *eb)
 {
 	int ret;
-	ret = pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr);
+	ret = btrfs_pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr,
+			   eb->fs_info->zoned);
 	if (ret < 0)
 		goto out;
 	if (ret != eb->len) {
@@ -932,7 +934,8 @@ int write_data_to_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
 			this_len = min(this_len, bytes_left);
 			dev_nr++;
 
-			ret = pwrite(device->fd, buf + total_write, this_len, dev_bytenr);
+			ret = btrfs_pwrite(device->fd, buf + total_write,
+					   this_len, dev_bytenr, info->zoned);
 			if (ret != this_len) {
 				if (ret < 0) {
 					fprintf(stderr, "Error writing to "
diff --git a/kernel-shared/zoned.c b/kernel-shared/zoned.c
index 8d94f98a7fce..c2cce3b5f366 100644
--- a/kernel-shared/zoned.c
+++ b/kernel-shared/zoned.c
@@ -424,7 +424,7 @@ int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
 			count = zone_len - (ofst & (zone_len - 1));
 
 		if (!zone_is_sequential(zinfo, ofst)) {
-			ret = device_zero_blocks(fd, ofst, count);
+			ret = device_zero_blocks(fd, ofst, count, true);
 			if (ret != 0)
 				return ret;
 		}
@@ -595,7 +595,7 @@ size_t btrfs_sb_io(int fd, void *buf, off_t offset, int rw)
 	if (rw == READ)
 		ret_sz = pread64(fd, buf, count, mapped);
 	else
-		ret_sz = pwrite64(fd, buf, count, mapped);
+		ret_sz = btrfs_pwrite(fd, buf, count, mapped, true);
 
 	if (ret_sz != count)
 		return ret_sz;
diff --git a/mkfs/common.c b/mkfs/common.c
index 20a7d1155972..5c8d6ac13a3b 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -54,7 +54,7 @@ static int btrfs_write_empty_tree(int fd, struct btrfs_mkfs_config *cfg,
 	btrfs_set_header_nritems(buf, 0);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, block);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize, block, cfg->zone_size);
 	if (ret != cfg->nodesize)
 		return ret < 0 ? -errno : -EIO;
 	return 0;
@@ -134,7 +134,8 @@ static int btrfs_create_tree_root(int fd, struct btrfs_mkfs_config *cfg,
 			     cfg->csum_type);
 
 	/* write back root tree */
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_ROOT_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_ROOT_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize)
 		return (ret < 0 ? -errno : -EIO);
 
@@ -422,7 +423,8 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_header_nritems(buf, nritems);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_EXTENT_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_EXTENT_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize) {
 		ret = (ret < 0 ? -errno : -EIO);
 		goto out;
@@ -510,7 +512,8 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_header_nritems(buf, nritems);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_CHUNK_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_CHUNK_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize) {
 		ret = (ret < 0 ? -errno : -EIO);
 		goto out;
@@ -550,7 +553,8 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_header_nritems(buf, nritems);
 	csum_tree_block_size(buf, btrfs_csum_type_size(cfg->csum_type), 0,
 			     cfg->csum_type);
-	ret = pwrite(fd, buf->data, cfg->nodesize, cfg->blocks[MKFS_DEV_TREE]);
+	ret = btrfs_pwrite(fd, buf->data, cfg->nodesize,
+			   cfg->blocks[MKFS_DEV_TREE], cfg->zone_size);
 	if (ret != cfg->nodesize) {
 		ret = (ret < 0 ? -errno : -EIO);
 		goto out;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] btrfs-progs: introduce btrfs_pread wrapper for pread
  2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
  2021-09-27  4:15 ` [PATCH 1/5] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
  2021-09-27  4:15 ` [PATCH 2/5] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
@ 2021-09-27  4:15 ` Naohiro Aota
  2021-09-27 10:23   ` Johannes Thumshirn
  2021-09-27  4:15 ` [PATCH 4/5] btrfs-progs: temporally set zoned flag for initial tree reading Naohiro Aota
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2021-09-27  4:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

Wrap pread with btrfs_pread as well.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/device-utils.h     | 10 ++++++++++
 kernel-shared/disk-io.c   |  4 +++-
 kernel-shared/extent_io.c |  7 ++++---
 kernel-shared/zoned.c     |  2 +-
 4 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/common/device-utils.h b/common/device-utils.h
index 767dab4370e1..f79e746840fc 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -60,9 +60,19 @@ static inline ssize_t btrfs_pwrite(int fd, void *buf, size_t count,
 
 	return btrfs_direct_pio(WRITE, fd, buf, count, offset);
 }
+static inline ssize_t btrfs_pread(int fd, void *buf, size_t count, off_t offset,
+				  bool direct)
+{
+	if (!direct)
+		return pread(fd, buf, count, offset);
+
+	return btrfs_direct_pio(READ, fd, buf, count, offset);
+}
 #else
 #define btrfs_pwrite(fd, buf, count, offset, direct) \
 	({ (void)(direct); pwrite(fd, buf, count, offset); })
+#define btrfs_pread(fd, buf, count, offset, direct) \
+	({ (void)(direct); pread(fd, buf, count, offset); })
 #endif
 
 #endif
diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index 1cda6f3a98af..740500f9fdc9 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -35,6 +35,7 @@
 #include "kernel-shared/print-tree.h"
 #include "common/rbtree-utils.h"
 #include "common/device-scan.h"
+#include "common/device-utils.h"
 #include "crypto/hash.h"
 
 /* specified errno for check_tree_block */
@@ -476,7 +477,8 @@ int read_extent_data(struct btrfs_fs_info *fs_info, char *data, u64 logical,
 		goto err;
 	}
 
-	ret = pread64(device->fd, data, *len, multi->stripes[0].physical);
+	ret = btrfs_pread(device->fd, data, *len, multi->stripes[0].physical,
+			  fs_info->zoned);
 	if (ret != *len)
 		ret = -EIO;
 	else
diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index b5984949f431..af09ade4025f 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -793,7 +793,8 @@ int read_extent_from_disk(struct extent_buffer *eb,
 			  unsigned long offset, unsigned long len)
 {
 	int ret;
-	ret = pread(eb->fd, eb->data + offset, len, eb->dev_bytenr);
+	ret = btrfs_pread(eb->fd, eb->data + offset, len, eb->dev_bytenr,
+			  eb->fs_info->zoned);
 	if (ret < 0) {
 		ret = -errno;
 		goto out;
@@ -850,8 +851,8 @@ int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
 			return -EIO;
 		}
 
-		ret = pread(device->fd, buf + total_read, read_len,
-			    multi->stripes[0].physical);
+		ret = btrfs_pread(device->fd, buf + total_read, read_len,
+				  multi->stripes[0].physical, info->zoned);
 		kfree(multi);
 		if (ret < 0) {
 			fprintf(stderr, "Error reading %llu, %d\n", offset,
diff --git a/kernel-shared/zoned.c b/kernel-shared/zoned.c
index c2cce3b5f366..f5d2299fc744 100644
--- a/kernel-shared/zoned.c
+++ b/kernel-shared/zoned.c
@@ -593,7 +593,7 @@ size_t btrfs_sb_io(int fd, void *buf, off_t offset, int rw)
 		return ret;
 
 	if (rw == READ)
-		ret_sz = pread64(fd, buf, count, mapped);
+		ret_sz = btrfs_pread(fd, buf, count, mapped, true);
 	else
 		ret_sz = btrfs_pwrite(fd, buf, count, mapped, true);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] btrfs-progs: temporally set zoned flag for initial tree reading
  2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (2 preceding siblings ...)
  2021-09-27  4:15 ` [PATCH 3/5] btrfs-progs: introduce btrfs_pread wrapper for pread Naohiro Aota
@ 2021-09-27  4:15 ` Naohiro Aota
  2021-09-27 12:38   ` Johannes Thumshirn
  2021-09-27  4:15 ` [PATCH 5/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2021-09-27  4:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

Functions to read data/metadata e.g. read_extent_from_disk() now depend on
the fs_info->zoned flag to determine if they do direct-IO or not.

The flag (and zone_size) is not known before reading the chunk tree and it
set to 0 while in the initial chunk tree setup process. That will cause
btrfs_pread() to fail because it does not align the buffer.

Use fcntl() to find out the file descriptor is opened with O_DIRECT or not,
and if it is, set the zoned flag to 1 temporally for this initial process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 kernel-shared/disk-io.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index 740500f9fdc9..dd48599a5f1f 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -1302,10 +1302,22 @@ static struct btrfs_fs_info *__open_ctree_fd(int fp, struct open_ctree_flags *oc
 	if (ret)
 		goto out_devices;
 
+	/*
+	 * fs_info->zone_size (and zoned) are not known before reading the
+	 * chunk tree, so it's 0 at this point. But, fs_info->zoned == 0
+	 * will cause btrfs_pread() not to use an aligned bounce buffer,
+	 * causing EINVAL when the file is opened with O_DIRECT. Temporally
+	 * set zoned = 1 in that case.
+	 */
+	if (fcntl(fp, F_GETFL) & O_DIRECT)
+		fs_info->zoned = 1;
+
 	ret = btrfs_setup_chunk_tree_and_device_map(fs_info, ocf->chunk_tree_bytenr);
 	if (ret)
 		goto out_chunk;
 
+	fs_info->zoned = 0;
+
 	/* Chunk tree root is unable to read, return directly */
 	if (!fs_info->chunk_root)
 		return fs_info;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] btrfs-progs: use direct-IO for zoned device
  2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (3 preceding siblings ...)
  2021-09-27  4:15 ` [PATCH 4/5] btrfs-progs: temporally set zoned flag for initial tree reading Naohiro Aota
@ 2021-09-27  4:15 ` Naohiro Aota
  2021-09-27 18:48   ` David Sterba
  2021-09-27 19:26 ` [PATCH 0/5] " David Sterba
  2021-09-27 21:51 ` David Sterba
  6 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2021-09-27  4:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba, Naohiro Aota

We need to use direct-IO for zoned devices to preserve the write ordering.
Instead of detecting if the device is zoned or not, we simply use direct-IO
for any kind of device (even if emulated zoned mode on a regular device).

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 kernel-shared/disk-io.c | 3 +++
 kernel-shared/volumes.c | 4 ++++
 mkfs/main.c             | 7 ++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index dd48599a5f1f..aabeba7821ed 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -1382,6 +1382,9 @@ struct btrfs_fs_info *open_ctree_fs_info(struct open_ctree_flags *ocf)
 	if (!(ocf->flags & OPEN_CTREE_WRITES))
 		oflags = O_RDONLY;
 
+	if ((oflags & O_RDWR) && zoned_model(ocf->filename) == ZONED_HOST_MANAGED)
+		oflags |= O_DIRECT;
+
 	fp = open(ocf->filename, oflags);
 	if (fp < 0) {
 		error("cannot open '%s': %m", ocf->filename);
diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index b2a6b04f8e3d..ff4bd0723dbb 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -455,6 +455,10 @@ int btrfs_open_devices(struct btrfs_fs_info *fs_info,
 			continue;
 		}
 
+		if ((flags & O_RDWR) &&
+		    zoned_model(device->name) == ZONED_HOST_MANAGED)
+			flags |= O_DIRECT;
+
 		fd = open(device->name, flags);
 		if (fd < 0) {
 			ret = -errno;
diff --git a/mkfs/main.c b/mkfs/main.c
index b925c572b2b3..01187763a90c 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -894,6 +894,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	int ssd = 0;
 	int zoned = 0;
 	int force_overwrite = 0;
+	int oflags;
 	char *source_dir = NULL;
 	bool source_dir_set = false;
 	bool shrink_rootdir = false;
@@ -1310,12 +1311,16 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 
 	dev_cnt--;
 
+	oflags = O_RDWR;
+	if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
+		oflags |= O_DIRECT;
+
 	/*
 	 * Open without O_EXCL so that the problem should not occur by the
 	 * following operation in kernel:
 	 * (btrfs_register_one_device() fails if O_EXCL is on)
 	 */
-	fd = open(file, O_RDWR);
+	fd = open(file, oflags);
 	if (fd < 0) {
 		error("unable to open %s: %m", file);
 		goto error;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] btrfs-progs: mkfs: do not set zone size on non-zoned mode
  2021-09-27  4:15 ` [PATCH 1/5] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
@ 2021-09-27  9:19   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-09-27  9:19 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: David Sterba

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
  2021-09-27  4:15 ` [PATCH 2/5] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
@ 2021-09-27  9:39   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-09-27  9:39 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: David Sterba

On 27/09/2021 06:16, Naohiro Aota wrote:
> +ssize_t btrfs_direct_pio(int rw, int fd, void *buf, size_t count, off_t offset)

What's the _pio suffix, posix memaligned io?

Apart form the question,

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] btrfs-progs: introduce btrfs_pread wrapper for pread
  2021-09-27  4:15 ` [PATCH 3/5] btrfs-progs: introduce btrfs_pread wrapper for pread Naohiro Aota
@ 2021-09-27 10:23   ` Johannes Thumshirn
  2021-09-27 18:41     ` David Sterba
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2021-09-27 10:23 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: David Sterba

I'd squash that one into the previous patch, but that's preference I guess.

Anyways,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] btrfs-progs: temporally set zoned flag for initial tree reading
  2021-09-27  4:15 ` [PATCH 4/5] btrfs-progs: temporally set zoned flag for initial tree reading Naohiro Aota
@ 2021-09-27 12:38   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-09-27 12:38 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: David Sterba

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] btrfs-progs: introduce btrfs_pread wrapper for pread
  2021-09-27 10:23   ` Johannes Thumshirn
@ 2021-09-27 18:41     ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2021-09-27 18:41 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Naohiro Aota, linux-btrfs, David Sterba

On Mon, Sep 27, 2021 at 10:23:17AM +0000, Johannes Thumshirn wrote:
> I'd squash that one into the previous patch, but that's preference I guess.

Yeah, 2 patches are preferred.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] btrfs-progs: use direct-IO for zoned device
  2021-09-27  4:15 ` [PATCH 5/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
@ 2021-09-27 18:48   ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2021-09-27 18:48 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba

On Mon, Sep 27, 2021 at 01:15:54PM +0900, Naohiro Aota wrote:
> We need to use direct-IO for zoned devices to preserve the write ordering.
> Instead of detecting if the device is zoned or not, we simply use direct-IO
> for any kind of device (even if emulated zoned mode on a regular device).

I think this should be ok, we don't want to mix direct io and buffered
writes and both main device opening wrappers do that. As long as it's
abstracted like that we don't need any special detection or flags in the
callers. I was thinking about adding one to open_ctree_flags but that
does not seem necessary.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
  2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (4 preceding siblings ...)
  2021-09-27  4:15 ` [PATCH 5/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
@ 2021-09-27 19:26 ` David Sterba
  2021-09-29  2:21   ` Naohiro Aota
  2021-09-27 21:51 ` David Sterba
  6 siblings, 1 reply; 19+ messages in thread
From: David Sterba @ 2021-09-27 19:26 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba

On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> As discussed in the Zoned Storage page [1],  the kernel page cache does not
> guarantee that cached dirty pages will be flushed to a block device in
> sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> device to ensure the write ordering.
> 
> [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> 
> As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> struct extent_buffer), it is difficult to allocate the struct so that the
> writng buffer is aligned.
> 
> This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> which allocates an aligned bounce buffer, copy the buffer contents, and
> proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> 
> Since the allocation and copying are costly, it is better to do them only
> when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> the file is opened with O_DIRECT or not every time doing an IO.

This should be in the changelog somewhere too, the last patch looks like
a good place so I'll copy it there.

> As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> to determine if it is direct-IO or not. This can cause a false-positive (to
> use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> emulated zoned mode on a non-zoned device or a regular file. Considering
> the emulated zoned mode is mostly for debugging or testing, I believe this
> is acceptable.

Agreed.

All patches added to devel. Would be good to add some tests for the
emulated mode, ie. that we can test at least something regularly without
special devices.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
  2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
                   ` (5 preceding siblings ...)
  2021-09-27 19:26 ` [PATCH 0/5] " David Sterba
@ 2021-09-27 21:51 ` David Sterba
  2021-09-29  2:24   ` Naohiro Aota
  6 siblings, 1 reply; 19+ messages in thread
From: David Sterba @ 2021-09-27 21:51 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba

On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> As discussed in the Zoned Storage page [1],  the kernel page cache does not
> guarantee that cached dirty pages will be flushed to a block device in
> sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> device to ensure the write ordering.
> 
> [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> 
> As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> struct extent_buffer), it is difficult to allocate the struct so that the
> writng buffer is aligned.
> 
> This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> which allocates an aligned bounce buffer, copy the buffer contents, and
> proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> 
> Since the allocation and copying are costly, it is better to do them only
> when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> the file is opened with O_DIRECT or not every time doing an IO.
> 
> As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> to determine if it is direct-IO or not. This can cause a false-positive (to
> use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> emulated zoned mode on a non-zoned device or a regular file. Considering
> the emulated zoned mode is mostly for debugging or testing, I believe this
> is acceptable.
> 
> Patch 1 is a preparation not to set an emulated zone_size value when not
> needed.
> 
> Patches 2 and 3 wraps pread/pwrite with newly introduced function
> btrfs_pread/btrfs_pwrite.
> 
> Patches 4 deals with the zoned flag while reading the initial trees.
> 
> Patch 5 finally opens a zoned device with O_DIRECT.
> 
> Naohiro Aota (5):
>   btrfs-progs: mkfs: do not set zone size on non-zoned mode
>   btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
>   btrfs-progs: introduce btrfs_pread wrapper for pread
>   btrfs-progs: temporally set zoned flag for initial tree reading
>   btrfs-progs: use direct-IO for zoned device

I was doing some btrfs-convert changes and found that it crashed, rough
bisection points to this series. With the last patch applied, convert
fails with the following ASAN error:

...
Create initial btrfs filesystem
Create ext2 image file
AddressSanitizer:DEADLYSIGNAL
=================================================================
==18432==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000000496627 bp 0x7ffe5299e4d0 sp 0x7ffe5299e4b0 T0)
==18432==The signal is caused by a READ memory access.
==18432==Hint: this fault was caused by a dereference of a high value address (see register values below).  Dissassemble the provided pc to learn which register was used.
    #0 0x496627 in write_extent_to_disk kernel-shared/extent_io.c:815
    #1 0x470080 in write_and_map_eb kernel-shared/disk-io.c:525
    #2 0x411af9 in migrate_one_reserved_range convert/main.c:402
    #3 0x411fa5 in migrate_reserved_ranges convert/main.c:459
    #4 0x414088 in create_image convert/main.c:878
    #5 0x416d70 in do_convert convert/main.c:1269
    #6 0x41a294 in main convert/main.c:1993
    #7 0x7f7ef7c2753f in __libc_start_call_main (/lib64/libc.so.6+0x2d53f)
    #8 0x7f7ef7c275eb in __libc_start_main_alias_1 (/lib64/libc.so.6+0x2d5eb)
    #9 0x40ed04 in _start (.../btrfs-progs/btrfs-convert+0x40ed04)

kernel-shared/extent_io.c:815:

 811 int write_extent_to_disk(struct extent_buffer *eb)
 812 {
 813         int ret;
 814         ret = btrfs_pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr,
 815                            eb->fs_info->zoned);
 816         if (ret < 0)
 817                 goto out;
 818         if (ret != eb->len) {
 819                 ret = -EIO;
 820                 goto out;
 821         }
 822         ret = 0;
 823 out:
 824         return ret;
 825 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
  2021-09-27 19:26 ` [PATCH 0/5] " David Sterba
@ 2021-09-29  2:21   ` Naohiro Aota
  2021-09-29 10:16     ` David Sterba
  0 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2021-09-29  2:21 UTC (permalink / raw)
  To: dsterba, linux-btrfs, David Sterba

On Mon, Sep 27, 2021 at 09:26:18PM +0200, David Sterba wrote:
> On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > As discussed in the Zoned Storage page [1],  the kernel page cache does not
> > guarantee that cached dirty pages will be flushed to a block device in
> > sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> > device to ensure the write ordering.
> > 
> > [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> > 
> > As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> > struct extent_buffer), it is difficult to allocate the struct so that the
> > writng buffer is aligned.
> > 
> > This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> > which allocates an aligned bounce buffer, copy the buffer contents, and
> > proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> > 
> > Since the allocation and copying are costly, it is better to do them only
> > when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> > the file is opened with O_DIRECT or not every time doing an IO.
> 
> This should be in the changelog somewhere too, the last patch looks like
> a good place so I'll copy it there.
> 
> > As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> > to determine if it is direct-IO or not. This can cause a false-positive (to
> > use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> > emulated zoned mode on a non-zoned device or a regular file. Considering
> > the emulated zoned mode is mostly for debugging or testing, I believe this
> > is acceptable.
> 
> Agreed.
> 
> All patches added to devel. Would be good to add some tests for the
> emulated mode, ie. that we can test at least something regularly without
> special devices.

Will do. We may also add some tests for zoned device by setting up
null_blk (provided the machine has enough memory).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
  2021-09-27 21:51 ` David Sterba
@ 2021-09-29  2:24   ` Naohiro Aota
  2021-09-29 10:22     ` David Sterba
  0 siblings, 1 reply; 19+ messages in thread
From: Naohiro Aota @ 2021-09-29  2:24 UTC (permalink / raw)
  To: dsterba, linux-btrfs, David Sterba

On Mon, Sep 27, 2021 at 11:51:39PM +0200, David Sterba wrote:
> On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > As discussed in the Zoned Storage page [1],  the kernel page cache does not
> > guarantee that cached dirty pages will be flushed to a block device in
> > sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> > device to ensure the write ordering.
> > 
> > [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> > 
> > As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> > struct extent_buffer), it is difficult to allocate the struct so that the
> > writng buffer is aligned.
> > 
> > This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> > which allocates an aligned bounce buffer, copy the buffer contents, and
> > proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> > 
> > Since the allocation and copying are costly, it is better to do them only
> > when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> > the file is opened with O_DIRECT or not every time doing an IO.
> > 
> > As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> > to determine if it is direct-IO or not. This can cause a false-positive (to
> > use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> > emulated zoned mode on a non-zoned device or a regular file. Considering
> > the emulated zoned mode is mostly for debugging or testing, I believe this
> > is acceptable.
> > 
> > Patch 1 is a preparation not to set an emulated zone_size value when not
> > needed.
> > 
> > Patches 2 and 3 wraps pread/pwrite with newly introduced function
> > btrfs_pread/btrfs_pwrite.
> > 
> > Patches 4 deals with the zoned flag while reading the initial trees.
> > 
> > Patch 5 finally opens a zoned device with O_DIRECT.
> > 
> > Naohiro Aota (5):
> >   btrfs-progs: mkfs: do not set zone size on non-zoned mode
> >   btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
> >   btrfs-progs: introduce btrfs_pread wrapper for pread
> >   btrfs-progs: temporally set zoned flag for initial tree reading
> >   btrfs-progs: use direct-IO for zoned device
> 
> I was doing some btrfs-convert changes and found that it crashed, rough
> bisection points to this series. With the last patch applied, convert
> fails with the following ASAN error:

It looks like eb->fs_info == NULL at this point. In case of
btrfs-convert, we can assume it is non-zoned because we do not support
the converting on a zoned device (we can't create ext*, reiserfs on a
zoned device anyway).

But, I also found a similar issue occurs with "mkfs.btrfs -f -d raid5
-m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3" in
read_extent_from_disk(). Let me check which is better to ensure the
fs_info is set or to check if it's NULL.

> ...
> Create initial btrfs filesystem
> Create ext2 image file
> AddressSanitizer:DEADLYSIGNAL
> =================================================================
> ==18432==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000000496627 bp 0x7ffe5299e4d0 sp 0x7ffe5299e4b0 T0)
> ==18432==The signal is caused by a READ memory access.
> ==18432==Hint: this fault was caused by a dereference of a high value address (see register values below).  Dissassemble the provided pc to learn which register was used.
>     #0 0x496627 in write_extent_to_disk kernel-shared/extent_io.c:815
>     #1 0x470080 in write_and_map_eb kernel-shared/disk-io.c:525
>     #2 0x411af9 in migrate_one_reserved_range convert/main.c:402
>     #3 0x411fa5 in migrate_reserved_ranges convert/main.c:459
>     #4 0x414088 in create_image convert/main.c:878
>     #5 0x416d70 in do_convert convert/main.c:1269
>     #6 0x41a294 in main convert/main.c:1993
>     #7 0x7f7ef7c2753f in __libc_start_call_main (/lib64/libc.so.6+0x2d53f)
>     #8 0x7f7ef7c275eb in __libc_start_main_alias_1 (/lib64/libc.so.6+0x2d5eb)
>     #9 0x40ed04 in _start (.../btrfs-progs/btrfs-convert+0x40ed04)
> 
> kernel-shared/extent_io.c:815:
> 
>  811 int write_extent_to_disk(struct extent_buffer *eb)
>  812 {
>  813         int ret;
>  814         ret = btrfs_pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr,
>  815                            eb->fs_info->zoned);
>  816         if (ret < 0)
>  817                 goto out;
>  818         if (ret != eb->len) {
>  819                 ret = -EIO;
>  820                 goto out;
>  821         }
>  822         ret = 0;
>  823 out:
>  824         return ret;
>  825 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
  2021-09-29  2:21   ` Naohiro Aota
@ 2021-09-29 10:16     ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2021-09-29 10:16 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: dsterba, linux-btrfs, David Sterba

On Wed, Sep 29, 2021 at 02:21:02AM +0000, Naohiro Aota wrote:
> On Mon, Sep 27, 2021 at 09:26:18PM +0200, David Sterba wrote:
> > On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > > As discussed in the Zoned Storage page [1],  the kernel page cache does not
> > > guarantee that cached dirty pages will be flushed to a block device in
> > > sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> > > device to ensure the write ordering.
> > > 
> > > [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> > > 
> > > As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> > > struct extent_buffer), it is difficult to allocate the struct so that the
> > > writng buffer is aligned.
> > > 
> > > This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> > > which allocates an aligned bounce buffer, copy the buffer contents, and
> > > proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> > > 
> > > Since the allocation and copying are costly, it is better to do them only
> > > when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> > > the file is opened with O_DIRECT or not every time doing an IO.
> > 
> > This should be in the changelog somewhere too, the last patch looks like
> > a good place so I'll copy it there.
> > 
> > > As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> > > to determine if it is direct-IO or not. This can cause a false-positive (to
> > > use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> > > emulated zoned mode on a non-zoned device or a regular file. Considering
> > > the emulated zoned mode is mostly for debugging or testing, I believe this
> > > is acceptable.
> > 
> > Agreed.
> > 
> > All patches added to devel. Would be good to add some tests for the
> > emulated mode, ie. that we can test at least something regularly without
> > special devices.
> 
> Will do. We may also add some tests for zoned device by setting up
> null_blk (provided the machine has enough memory).

As setting up the elated zoned devices requires some resources or
non-trivial setup we can add a separate class of tests. As there are
still limitations to what zoned mode supports, running all current tests
won't work anyway without tons of workarounds or quirks to existing
tests.

Adding a separate class would probably duplicate some of the tests but
that's IMHO less error prone way than changing the other tests. If the
zoned-tests are not run by default it should be safe for regular
testing.

Eventually, to avoid code duplication, we cand do some sort of test
links. The zoned-test will set up the environment and then run the
existing test from other directory.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
  2021-09-29  2:24   ` Naohiro Aota
@ 2021-09-29 10:22     ` David Sterba
  2021-10-05  6:11       ` Naohiro Aota
  0 siblings, 1 reply; 19+ messages in thread
From: David Sterba @ 2021-09-29 10:22 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: dsterba, linux-btrfs, David Sterba

On Wed, Sep 29, 2021 at 02:24:22AM +0000, Naohiro Aota wrote:
> On Mon, Sep 27, 2021 at 11:51:39PM +0200, David Sterba wrote:
> > On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > I was doing some btrfs-convert changes and found that it crashed, rough
> > bisection points to this series. With the last patch applied, convert
> > fails with the following ASAN error:
> 
> It looks like eb->fs_info == NULL at this point. In case of
> btrfs-convert, we can assume it is non-zoned because we do not support
> the converting on a zoned device (we can't create ext*, reiserfs on a
> zoned device anyway).

That would mean that extN/reiserfs was created on a zoned device. One
can still do a image copy to a zoned device and then convert. Even if
this is possible in theory I'd rather not allow that right now because
there are probably more changes required to do full support.

I've just noticed that ZONED bit is mistakenly among the feature flag
bits allowed in convert. Added in 242c8328bcd55175 "btrfs-progs: zoned:
add new ZONED feature flag":

BTRFS_CONVERT_ALLOWED_FEATURES must not contain
BTRFS_FEATURE_INCOMPAT_ZONED.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs-progs: use direct-IO for zoned device
  2021-09-29 10:22     ` David Sterba
@ 2021-10-05  6:11       ` Naohiro Aota
  0 siblings, 0 replies; 19+ messages in thread
From: Naohiro Aota @ 2021-10-05  6:11 UTC (permalink / raw)
  To: dsterba, linux-btrfs, David Sterba

On Wed, Sep 29, 2021 at 12:22:23PM +0200, David Sterba wrote:
> On Wed, Sep 29, 2021 at 02:24:22AM +0000, Naohiro Aota wrote:
> > On Mon, Sep 27, 2021 at 11:51:39PM +0200, David Sterba wrote:
> > > On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > > I was doing some btrfs-convert changes and found that it crashed, rough
> > > bisection points to this series. With the last patch applied, convert
> > > fails with the following ASAN error:
> > 
> > It looks like eb->fs_info == NULL at this point. In case of
> > btrfs-convert, we can assume it is non-zoned because we do not support
> > the converting on a zoned device (we can't create ext*, reiserfs on a
> > zoned device anyway).
> 
> That would mean that extN/reiserfs was created on a zoned device. One
> can still do a image copy to a zoned device and then convert. Even if
> this is possible in theory I'd rather not allow that right now because
> there are probably more changes required to do full support.
> 
> I've just noticed that ZONED bit is mistakenly among the feature flag
> bits allowed in convert. Added in 242c8328bcd55175 "btrfs-progs: zoned:
> add new ZONED feature flag":
> 
> BTRFS_CONVERT_ALLOWED_FEATURES must not contain
> BTRFS_FEATURE_INCOMPAT_ZONED.

Oops, I thought I did not list BTRFS_FEATURE_INCOMPAT_ZONED in the
ALLOWED_FEATURES list. I'll fix it in the new series.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-10-05  6:11 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27  4:15 [PATCH 0/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
2021-09-27  4:15 ` [PATCH 1/5] btrfs-progs: mkfs: do not set zone size on non-zoned mode Naohiro Aota
2021-09-27  9:19   ` Johannes Thumshirn
2021-09-27  4:15 ` [PATCH 2/5] btrfs-progs: introduce btrfs_pwrite wrapper for pwrite Naohiro Aota
2021-09-27  9:39   ` Johannes Thumshirn
2021-09-27  4:15 ` [PATCH 3/5] btrfs-progs: introduce btrfs_pread wrapper for pread Naohiro Aota
2021-09-27 10:23   ` Johannes Thumshirn
2021-09-27 18:41     ` David Sterba
2021-09-27  4:15 ` [PATCH 4/5] btrfs-progs: temporally set zoned flag for initial tree reading Naohiro Aota
2021-09-27 12:38   ` Johannes Thumshirn
2021-09-27  4:15 ` [PATCH 5/5] btrfs-progs: use direct-IO for zoned device Naohiro Aota
2021-09-27 18:48   ` David Sterba
2021-09-27 19:26 ` [PATCH 0/5] " David Sterba
2021-09-29  2:21   ` Naohiro Aota
2021-09-29 10:16     ` David Sterba
2021-09-27 21:51 ` David Sterba
2021-09-29  2:24   ` Naohiro Aota
2021-09-29 10:22     ` David Sterba
2021-10-05  6:11       ` Naohiro Aota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.