All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Reduce zonefs memory usage
@ 2023-01-10 13:08 Damien Le Moal
  2023-01-10 13:08 ` [PATCH 1/7] zonefs: Detect append writes at invalid locations Damien Le Moal
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

This series improves memory usage by switching to using dynamically
allocated inodes and dentries, similarly to regular file systems. This
drastically reduces the memory consumption of zonefs when the file
system is mounted. E.g., for a 26 TB SMR HDD with over 95000 zones,
memory usage is decreased from about 130 MB down to a little over 5 MB.

Since zonefs does not have persistent metadata of its own and relies
completely on the device zone state management, information such as the 
zone write pointer position (used to determine an inode size) is kept in
memory at all time, even when an inode is evicted from the inode cache. 
Dentries can be trivially searched for and created dynamically as needed
due to the static file tree structure of a zonefs volume (fully
determined by a zoned device zone configuration).

The patch series is organized as follows. Patch 1 fixes a problem with
synchronous write error detection and error recovery. Patch 2
reorganizes the code (splitting out file operation code into a new file)
and patch 3 simplifies the IO error recovery code. Patch 4 and 5 cleanup
inode zone information and split that information out of the zonefs
inode structure. Finally, Patch 6 implements dynamic inode and dentry
allocation and maanagement operations. Ptch 7 introduces a small
optimization to reduce file open latency.

Damien Le Moal (7):
  zonefs: Detect append writes at invalid locations
  zonefs: Reorganize code
  zonefs: Simplify IO error handling
  zonefs: Reduce struct zonefs_inode_info size
  zonefs: Separate zone information from inode information
  zonefs: Dynamically create file inodes when needed
  zonefs: Cache zone group directory inodes

 fs/zonefs/Makefile |    2 +-
 fs/zonefs/file.c   |  878 ++++++++++++++++++++
 fs/zonefs/super.c  | 1923 ++++++++++++++++----------------------------
 fs/zonefs/trace.h  |   20 +-
 fs/zonefs/zonefs.h |  110 ++-
 5 files changed, 1684 insertions(+), 1249 deletions(-)
 create mode 100644 fs/zonefs/file.c

-- 
2.39.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/7] zonefs: Detect append writes at invalid locations
  2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
@ 2023-01-10 13:08 ` Damien Le Moal
  2023-01-11 12:23   ` Johannes Thumshirn
  2023-01-10 13:08 ` [PATCH 2/7] zonefs: Reorganize code Damien Le Moal
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential
files succeeds regardless of the zone write pointer position, as long as
the target zone is not full. This means that if an external (buggy)
application writes to the zone of a sequential file underneath the file
system, subsequent file write() operation will succeed but the file size
will not be correct and the file will contain invalid data written by
another application.

Modify zonefs_file_dio_append() to check the written sector of an append
write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a
mismatch with the file zone wp offset field. This change triggers a call
to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to
not expose the unexpected data after the current inode size when the
errors=remount-ro mode is used. Other error modes are correctly handled
already.

Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/super.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 2c53fbb8d918..a9c5c3f720ad 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -442,6 +442,10 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 			data_size = zonefs_check_zone_condition(inode, zone,
 								false, false);
 		}
+	} else if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_RO &&
+		   data_size > isize) {
+		/* Do not expose garbage data */
+		data_size = isize;
 	}
 
 	/*
@@ -805,6 +809,24 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
 
 	ret = submit_bio_wait(bio);
 
+	/*
+	 * If the file zone was written underneath the file system, the zone
+	 * write pointer may not be where we expect it to be, but the zone
+	 * append write can still succeed. So check manually that we wrote where
+	 * we intended to, that is, at zi->i_wpoffset.
+	 */
+	if (!ret) {
+		sector_t wpsector =
+			zi->i_zsector + (zi->i_wpoffset >> SECTOR_SHIFT);
+
+		if (bio->bi_iter.bi_sector != wpsector) {
+			zonefs_warn(inode->i_sb,
+				"Corrupted write pointer %llu for zone at %llu\n",
+				wpsector, zi->i_zsector);
+			ret = -EIO;
+		}
+	}
+
 	zonefs_file_write_dio_end_io(iocb, size, ret, 0);
 	trace_zonefs_file_dio_append(inode, size, ret);
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/7] zonefs: Reorganize code
  2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
  2023-01-10 13:08 ` [PATCH 1/7] zonefs: Detect append writes at invalid locations Damien Le Moal
@ 2023-01-10 13:08 ` Damien Le Moal
  2023-01-11 16:12   ` Johannes Thumshirn
  2023-01-10 13:08 ` [PATCH 3/7] zonefs: Simplify IO error handling Damien Le Moal
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

Move all code related to zone file operations from super.c to the new
file.c file. Inode and zone management code remains in super.c.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/Makefile |   2 +-
 fs/zonefs/file.c   | 874 ++++++++++++++++++++++++++++++++++++++++
 fs/zonefs/super.c  | 973 +++------------------------------------------
 fs/zonefs/zonefs.h |  22 +
 4 files changed, 955 insertions(+), 916 deletions(-)
 create mode 100644 fs/zonefs/file.c

diff --git a/fs/zonefs/Makefile b/fs/zonefs/Makefile
index 9fe54f5319f2..645f7229de4a 100644
--- a/fs/zonefs/Makefile
+++ b/fs/zonefs/Makefile
@@ -3,4 +3,4 @@ ccflags-y				+= -I$(src)
 
 obj-$(CONFIG_ZONEFS_FS) += zonefs.o
 
-zonefs-y	:= super.o sysfs.o
+zonefs-y	:= super.o file.o sysfs.o
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
new file mode 100644
index 000000000000..ece0f3959b6d
--- /dev/null
+++ b/fs/zonefs/file.c
@@ -0,0 +1,874 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Simple file system for zoned block devices exposing zones as files.
+ *
+ * Copyright (C) 2022 Western Digital Corporation or its affiliates.
+ */
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/iomap.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/statfs.h>
+#include <linux/writeback.h>
+#include <linux/quotaops.h>
+#include <linux/seq_file.h>
+#include <linux/parser.h>
+#include <linux/uio.h>
+#include <linux/mman.h>
+#include <linux/sched/mm.h>
+#include <linux/task_io_accounting_ops.h>
+
+#include "zonefs.h"
+
+#include "trace.h"
+
+static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
+				   loff_t length, unsigned int flags,
+				   struct iomap *iomap, struct iomap *srcmap)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	loff_t isize;
+
+	/*
+	 * All blocks are always mapped below EOF. If reading past EOF,
+	 * act as if there is a hole up to the file maximum size.
+	 */
+	mutex_lock(&zi->i_truncate_mutex);
+	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->offset = ALIGN_DOWN(offset, sb->s_blocksize);
+	isize = i_size_read(inode);
+	if (iomap->offset >= isize) {
+		iomap->type = IOMAP_HOLE;
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->length = length;
+	} else {
+		iomap->type = IOMAP_MAPPED;
+		iomap->addr = (zi->i_zsector << SECTOR_SHIFT) + iomap->offset;
+		iomap->length = isize - iomap->offset;
+	}
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	trace_zonefs_iomap_begin(inode, iomap);
+
+	return 0;
+}
+
+static const struct iomap_ops zonefs_read_iomap_ops = {
+	.iomap_begin	= zonefs_read_iomap_begin,
+};
+
+static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
+				    loff_t length, unsigned int flags,
+				    struct iomap *iomap, struct iomap *srcmap)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	loff_t isize;
+
+	/* All write I/Os should always be within the file maximum size */
+	if (WARN_ON_ONCE(offset + length > zi->i_max_size))
+		return -EIO;
+
+	/*
+	 * Sequential zones can only accept direct writes. This is already
+	 * checked when writes are issued, so warn if we see a page writeback
+	 * operation.
+	 */
+	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
+			 !(flags & IOMAP_DIRECT)))
+		return -EIO;
+
+	/*
+	 * For conventional zones, all blocks are always mapped. For sequential
+	 * zones, all blocks after always mapped below the inode size (zone
+	 * write pointer) and unwriten beyond.
+	 */
+	mutex_lock(&zi->i_truncate_mutex);
+	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->offset = ALIGN_DOWN(offset, sb->s_blocksize);
+	iomap->addr = (zi->i_zsector << SECTOR_SHIFT) + iomap->offset;
+	isize = i_size_read(inode);
+	if (iomap->offset >= isize) {
+		iomap->type = IOMAP_UNWRITTEN;
+		iomap->length = zi->i_max_size - iomap->offset;
+	} else {
+		iomap->type = IOMAP_MAPPED;
+		iomap->length = isize - iomap->offset;
+	}
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	trace_zonefs_iomap_begin(inode, iomap);
+
+	return 0;
+}
+
+static const struct iomap_ops zonefs_write_iomap_ops = {
+	.iomap_begin	= zonefs_write_iomap_begin,
+};
+
+static int zonefs_read_folio(struct file *unused, struct folio *folio)
+{
+	return iomap_read_folio(folio, &zonefs_read_iomap_ops);
+}
+
+static void zonefs_readahead(struct readahead_control *rac)
+{
+	iomap_readahead(rac, &zonefs_read_iomap_ops);
+}
+
+/*
+ * Map blocks for page writeback. This is used only on conventional zone files,
+ * which implies that the page range can only be within the fixed inode size.
+ */
+static int zonefs_write_map_blocks(struct iomap_writepage_ctx *wpc,
+				   struct inode *inode, loff_t offset)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+
+	if (WARN_ON_ONCE(zi->i_ztype != ZONEFS_ZTYPE_CNV))
+		return -EIO;
+	if (WARN_ON_ONCE(offset >= i_size_read(inode)))
+		return -EIO;
+
+	/* If the mapping is already OK, nothing needs to be done */
+	if (offset >= wpc->iomap.offset &&
+	    offset < wpc->iomap.offset + wpc->iomap.length)
+		return 0;
+
+	return zonefs_write_iomap_begin(inode, offset, zi->i_max_size - offset,
+					IOMAP_WRITE, &wpc->iomap, NULL);
+}
+
+static const struct iomap_writeback_ops zonefs_writeback_ops = {
+	.map_blocks		= zonefs_write_map_blocks,
+};
+
+static int zonefs_writepages(struct address_space *mapping,
+			     struct writeback_control *wbc)
+{
+	struct iomap_writepage_ctx wpc = { };
+
+	return iomap_writepages(mapping, wbc, &wpc, &zonefs_writeback_ops);
+}
+
+static int zonefs_swap_activate(struct swap_info_struct *sis,
+				struct file *swap_file, sector_t *span)
+{
+	struct inode *inode = file_inode(swap_file);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+
+	if (zi->i_ztype != ZONEFS_ZTYPE_CNV) {
+		zonefs_err(inode->i_sb,
+			   "swap file: not a conventional zone file\n");
+		return -EINVAL;
+	}
+
+	return iomap_swapfile_activate(sis, swap_file, span,
+				       &zonefs_read_iomap_ops);
+}
+
+const struct address_space_operations zonefs_file_aops = {
+	.read_folio		= zonefs_read_folio,
+	.readahead		= zonefs_readahead,
+	.writepages		= zonefs_writepages,
+	.dirty_folio		= filemap_dirty_folio,
+	.release_folio		= iomap_release_folio,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.migrate_folio		= filemap_migrate_folio,
+	.is_partially_uptodate	= iomap_is_partially_uptodate,
+	.error_remove_page	= generic_error_remove_page,
+	.direct_IO		= noop_direct_IO,
+	.swap_activate		= zonefs_swap_activate,
+};
+
+int zonefs_file_truncate(struct inode *inode, loff_t isize)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	loff_t old_isize;
+	enum req_op op;
+	int ret = 0;
+
+	/*
+	 * Only sequential zone files can be truncated and truncation is allowed
+	 * only down to a 0 size, which is equivalent to a zone reset, and to
+	 * the maximum file size, which is equivalent to a zone finish.
+	 */
+	if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
+		return -EPERM;
+
+	if (!isize)
+		op = REQ_OP_ZONE_RESET;
+	else if (isize == zi->i_max_size)
+		op = REQ_OP_ZONE_FINISH;
+	else
+		return -EPERM;
+
+	inode_dio_wait(inode);
+
+	/* Serialize against page faults */
+	filemap_invalidate_lock(inode->i_mapping);
+
+	/* Serialize against zonefs_iomap_begin() */
+	mutex_lock(&zi->i_truncate_mutex);
+
+	old_isize = i_size_read(inode);
+	if (isize == old_isize)
+		goto unlock;
+
+	ret = zonefs_zone_mgmt(inode, op);
+	if (ret)
+		goto unlock;
+
+	/*
+	 * If the mount option ZONEFS_MNTOPT_EXPLICIT_OPEN is set,
+	 * take care of open zones.
+	 */
+	if (zi->i_flags & ZONEFS_ZONE_OPEN) {
+		/*
+		 * Truncating a zone to EMPTY or FULL is the equivalent of
+		 * closing the zone. For a truncation to 0, we need to
+		 * re-open the zone to ensure new writes can be processed.
+		 * For a truncation to the maximum file size, the zone is
+		 * closed and writes cannot be accepted anymore, so clear
+		 * the open flag.
+		 */
+		if (!isize)
+			ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
+		else
+			zi->i_flags &= ~ZONEFS_ZONE_OPEN;
+	}
+
+	zonefs_update_stats(inode, isize);
+	truncate_setsize(inode, isize);
+	zi->i_wpoffset = isize;
+	zonefs_account_active(inode);
+
+unlock:
+	mutex_unlock(&zi->i_truncate_mutex);
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return ret;
+}
+
+static int zonefs_file_fsync(struct file *file, loff_t start, loff_t end,
+			     int datasync)
+{
+	struct inode *inode = file_inode(file);
+	int ret = 0;
+
+	if (unlikely(IS_IMMUTABLE(inode)))
+		return -EPERM;
+
+	/*
+	 * Since only direct writes are allowed in sequential files, page cache
+	 * flush is needed only for conventional zone files.
+	 */
+	if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV)
+		ret = file_write_and_wait_range(file, start, end);
+	if (!ret)
+		ret = blkdev_issue_flush(inode->i_sb->s_bdev);
+
+	if (ret)
+		zonefs_io_error(inode, true);
+
+	return ret;
+}
+
+static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	vm_fault_t ret;
+
+	if (unlikely(IS_IMMUTABLE(inode)))
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * Sanity check: only conventional zone files can have shared
+	 * writeable mappings.
+	 */
+	if (WARN_ON_ONCE(zi->i_ztype != ZONEFS_ZTYPE_CNV))
+		return VM_FAULT_NOPAGE;
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vmf->vma->vm_file);
+
+	/* Serialize against truncates */
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	ret = iomap_page_mkwrite(vmf, &zonefs_write_iomap_ops);
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+}
+
+static const struct vm_operations_struct zonefs_file_vm_ops = {
+	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= zonefs_filemap_page_mkwrite,
+};
+
+static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	/*
+	 * Conventional zones accept random writes, so their files can support
+	 * shared writable mappings. For sequential zone files, only read
+	 * mappings are possible since there are no guarantees for write
+	 * ordering between msync() and page cache writeback.
+	 */
+	if (ZONEFS_I(file_inode(file))->i_ztype == ZONEFS_ZTYPE_SEQ &&
+	    (vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		return -EINVAL;
+
+	file_accessed(file);
+	vma->vm_ops = &zonefs_file_vm_ops;
+
+	return 0;
+}
+
+static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence)
+{
+	loff_t isize = i_size_read(file_inode(file));
+
+	/*
+	 * Seeks are limited to below the zone size for conventional zones
+	 * and below the zone write pointer for sequential zones. In both
+	 * cases, this limit is the inode size.
+	 */
+	return generic_file_llseek_size(file, offset, whence, isize, isize);
+}
+
+static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
+					int error, unsigned int flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+
+	if (error) {
+		zonefs_io_error(inode, true);
+		return error;
+	}
+
+	if (size && zi->i_ztype != ZONEFS_ZTYPE_CNV) {
+		/*
+		 * Note that we may be seeing completions out of order,
+		 * but that is not a problem since a write completed
+		 * successfully necessarily means that all preceding writes
+		 * were also successful. So we can safely increase the inode
+		 * size to the write end location.
+		 */
+		mutex_lock(&zi->i_truncate_mutex);
+		if (i_size_read(inode) < iocb->ki_pos + size) {
+			zonefs_update_stats(inode, iocb->ki_pos + size);
+			zonefs_i_size_write(inode, iocb->ki_pos + size);
+		}
+		mutex_unlock(&zi->i_truncate_mutex);
+	}
+
+	return 0;
+}
+
+static const struct iomap_dio_ops zonefs_write_dio_ops = {
+	.end_io			= zonefs_file_write_dio_end_io,
+};
+
+static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	unsigned int max = bdev_max_zone_append_sectors(bdev);
+	struct bio *bio;
+	ssize_t size;
+	int nr_pages;
+	ssize_t ret;
+
+	max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
+	iov_iter_truncate(from, max);
+
+	nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
+	if (!nr_pages)
+		return 0;
+
+	bio = bio_alloc(bdev, nr_pages,
+			REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE, GFP_NOFS);
+	bio->bi_iter.bi_sector = zi->i_zsector;
+	bio->bi_ioprio = iocb->ki_ioprio;
+	if (iocb_is_dsync(iocb))
+		bio->bi_opf |= REQ_FUA;
+
+	ret = bio_iov_iter_get_pages(bio, from);
+	if (unlikely(ret))
+		goto out_release;
+
+	size = bio->bi_iter.bi_size;
+	task_io_account_write(size);
+
+	if (iocb->ki_flags & IOCB_HIPRI)
+		bio_set_polled(bio, iocb);
+
+	ret = submit_bio_wait(bio);
+
+	/*
+	 * If the file zone was written underneath the file system, the zone
+	 * write pointer may not be where we expect it to be, but the zone
+	 * append write can still succeed. So check manually that we wrote where
+	 * we intended to, that is, at zi->i_wpoffset.
+	 */
+	if (!ret) {
+		sector_t wpsector =
+			zi->i_zsector + (zi->i_wpoffset >> SECTOR_SHIFT);
+
+		if (bio->bi_iter.bi_sector != wpsector) {
+			zonefs_warn(inode->i_sb,
+				"Corrupted write pointer %llu for zone at %llu\n",
+				wpsector, zi->i_zsector);
+			ret = -EIO;
+		}
+	}
+
+	zonefs_file_write_dio_end_io(iocb, size, ret, 0);
+	trace_zonefs_file_dio_append(inode, size, ret);
+
+out_release:
+	bio_release_pages(bio, false);
+	bio_put(bio);
+
+	if (ret >= 0) {
+		iocb->ki_pos += size;
+		return size;
+	}
+
+	return ret;
+}
+
+/*
+ * Do not exceed the LFS limits nor the file zone size. If pos is under the
+ * limit it becomes a short access. If it exceeds the limit, return -EFBIG.
+ */
+static loff_t zonefs_write_check_limits(struct file *file, loff_t pos,
+					loff_t count)
+{
+	struct inode *inode = file_inode(file);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	loff_t limit = rlimit(RLIMIT_FSIZE);
+	loff_t max_size = zi->i_max_size;
+
+	if (limit != RLIM_INFINITY) {
+		if (pos >= limit) {
+			send_sig(SIGXFSZ, current, 0);
+			return -EFBIG;
+		}
+		count = min(count, limit - pos);
+	}
+
+	if (!(file->f_flags & O_LARGEFILE))
+		max_size = min_t(loff_t, MAX_NON_LFS, max_size);
+
+	if (unlikely(pos >= max_size))
+		return -EFBIG;
+
+	return min(count, max_size - pos);
+}
+
+static ssize_t zonefs_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	loff_t count;
+
+	if (IS_SWAPFILE(inode))
+		return -ETXTBSY;
+
+	if (!iov_iter_count(from))
+		return 0;
+
+	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+		return -EINVAL;
+
+	if (iocb->ki_flags & IOCB_APPEND) {
+		if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
+			return -EINVAL;
+		mutex_lock(&zi->i_truncate_mutex);
+		iocb->ki_pos = zi->i_wpoffset;
+		mutex_unlock(&zi->i_truncate_mutex);
+	}
+
+	count = zonefs_write_check_limits(file, iocb->ki_pos,
+					  iov_iter_count(from));
+	if (count < 0)
+		return count;
+
+	iov_iter_truncate(from, count);
+	return iov_iter_count(from);
+}
+
+/*
+ * Handle direct writes. For sequential zone files, this is the only possible
+ * write path. For these files, check that the user is issuing writes
+ * sequentially from the end of the file. This code assumes that the block layer
+ * delivers write requests to the device in sequential order. This is always the
+ * case if a block IO scheduler implementing the ELEVATOR_F_ZBD_SEQ_WRITE
+ * elevator feature is being used (e.g. mq-deadline). The block layer always
+ * automatically select such an elevator for zoned block devices during the
+ * device initialization.
+ */
+static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	bool sync = is_sync_kiocb(iocb);
+	bool append = false;
+	ssize_t ret, count;
+
+	/*
+	 * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT
+	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
+	 * on the inode lock but the second goes through but is now unaligned).
+	 */
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync &&
+	    (iocb->ki_flags & IOCB_NOWAIT))
+		return -EOPNOTSUPP;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
+	count = zonefs_write_checks(iocb, from);
+	if (count <= 0) {
+		ret = count;
+		goto inode_unlock;
+	}
+
+	if ((iocb->ki_pos | count) & (sb->s_blocksize - 1)) {
+		ret = -EINVAL;
+		goto inode_unlock;
+	}
+
+	/* Enforce sequential writes (append only) in sequential zones */
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
+		mutex_lock(&zi->i_truncate_mutex);
+		if (iocb->ki_pos != zi->i_wpoffset) {
+			mutex_unlock(&zi->i_truncate_mutex);
+			ret = -EINVAL;
+			goto inode_unlock;
+		}
+		mutex_unlock(&zi->i_truncate_mutex);
+		append = sync;
+	}
+
+	if (append)
+		ret = zonefs_file_dio_append(iocb, from);
+	else
+		ret = iomap_dio_rw(iocb, from, &zonefs_write_iomap_ops,
+				   &zonefs_write_dio_ops, 0, NULL, 0);
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
+	    (ret > 0 || ret == -EIOCBQUEUED)) {
+		if (ret > 0)
+			count = ret;
+
+		/*
+		 * Update the zone write pointer offset assuming the write
+		 * operation succeeded. If it did not, the error recovery path
+		 * will correct it. Also do active seq file accounting.
+		 */
+		mutex_lock(&zi->i_truncate_mutex);
+		zi->i_wpoffset += count;
+		zonefs_account_active(inode);
+		mutex_unlock(&zi->i_truncate_mutex);
+	}
+
+inode_unlock:
+	inode_unlock(inode);
+
+	return ret;
+}
+
+static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
+					  struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	ssize_t ret;
+
+	/*
+	 * Direct IO writes are mandatory for sequential zone files so that the
+	 * write IO issuing order is preserved.
+	 */
+	if (zi->i_ztype != ZONEFS_ZTYPE_CNV)
+		return -EIO;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
+	ret = zonefs_write_checks(iocb, from);
+	if (ret <= 0)
+		goto inode_unlock;
+
+	ret = iomap_file_buffered_write(iocb, from, &zonefs_write_iomap_ops);
+	if (ret > 0)
+		iocb->ki_pos += ret;
+	else if (ret == -EIO)
+		zonefs_io_error(inode, true);
+
+inode_unlock:
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+
+	return ret;
+}
+
+static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (unlikely(IS_IMMUTABLE(inode)))
+		return -EPERM;
+
+	if (sb_rdonly(inode->i_sb))
+		return -EROFS;
+
+	/* Write operations beyond the zone size are not allowed */
+	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
+		return -EFBIG;
+
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		ssize_t ret = zonefs_file_dio_write(iocb, from);
+
+		if (ret != -ENOTBLK)
+			return ret;
+	}
+
+	return zonefs_file_buffered_write(iocb, from);
+}
+
+static int zonefs_file_read_dio_end_io(struct kiocb *iocb, ssize_t size,
+				       int error, unsigned int flags)
+{
+	if (error) {
+		zonefs_io_error(file_inode(iocb->ki_filp), false);
+		return error;
+	}
+
+	return 0;
+}
+
+static const struct iomap_dio_ops zonefs_read_dio_ops = {
+	.end_io			= zonefs_file_read_dio_end_io,
+};
+
+static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	loff_t isize;
+	ssize_t ret;
+
+	/* Offline zones cannot be read */
+	if (unlikely(IS_IMMUTABLE(inode) && !(inode->i_mode & 0777)))
+		return -EPERM;
+
+	if (iocb->ki_pos >= zi->i_max_size)
+		return 0;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock_shared(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock_shared(inode);
+	}
+
+	/* Limit read operations to written data */
+	mutex_lock(&zi->i_truncate_mutex);
+	isize = i_size_read(inode);
+	if (iocb->ki_pos >= isize) {
+		mutex_unlock(&zi->i_truncate_mutex);
+		ret = 0;
+		goto inode_unlock;
+	}
+	iov_iter_truncate(to, isize - iocb->ki_pos);
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		size_t count = iov_iter_count(to);
+
+		if ((iocb->ki_pos | count) & (sb->s_blocksize - 1)) {
+			ret = -EINVAL;
+			goto inode_unlock;
+		}
+		file_accessed(iocb->ki_filp);
+		ret = iomap_dio_rw(iocb, to, &zonefs_read_iomap_ops,
+				   &zonefs_read_dio_ops, 0, NULL, 0);
+	} else {
+		ret = generic_file_read_iter(iocb, to);
+		if (ret == -EIO)
+			zonefs_io_error(inode, false);
+	}
+
+inode_unlock:
+	inode_unlock_shared(inode);
+
+	return ret;
+}
+
+/*
+ * Write open accounting is done only for sequential files.
+ */
+static inline bool zonefs_seq_file_need_wro(struct inode *inode,
+					    struct file *file)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+
+	if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
+		return false;
+
+	if (!(file->f_mode & FMODE_WRITE))
+		return false;
+
+	return true;
+}
+
+static int zonefs_seq_file_write_open(struct inode *inode)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	int ret = 0;
+
+	mutex_lock(&zi->i_truncate_mutex);
+
+	if (!zi->i_wr_refcnt) {
+		struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
+		unsigned int wro = atomic_inc_return(&sbi->s_wro_seq_files);
+
+		if (sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
+
+			if (sbi->s_max_wro_seq_files
+			    && wro > sbi->s_max_wro_seq_files) {
+				atomic_dec(&sbi->s_wro_seq_files);
+				ret = -EBUSY;
+				goto unlock;
+			}
+
+			if (i_size_read(inode) < zi->i_max_size) {
+				ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
+				if (ret) {
+					atomic_dec(&sbi->s_wro_seq_files);
+					goto unlock;
+				}
+				zi->i_flags |= ZONEFS_ZONE_OPEN;
+				zonefs_account_active(inode);
+			}
+		}
+	}
+
+	zi->i_wr_refcnt++;
+
+unlock:
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	return ret;
+}
+
+static int zonefs_file_open(struct inode *inode, struct file *file)
+{
+	int ret;
+
+	ret = generic_file_open(inode, file);
+	if (ret)
+		return ret;
+
+	if (zonefs_seq_file_need_wro(inode, file))
+		return zonefs_seq_file_write_open(inode);
+
+	return 0;
+}
+
+static void zonefs_seq_file_write_close(struct inode *inode)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	int ret = 0;
+
+	mutex_lock(&zi->i_truncate_mutex);
+
+	zi->i_wr_refcnt--;
+	if (zi->i_wr_refcnt)
+		goto unlock;
+
+	/*
+	 * The file zone may not be open anymore (e.g. the file was truncated to
+	 * its maximum size or it was fully written). For this case, we only
+	 * need to decrement the write open count.
+	 */
+	if (zi->i_flags & ZONEFS_ZONE_OPEN) {
+		ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_CLOSE);
+		if (ret) {
+			__zonefs_io_error(inode, false);
+			/*
+			 * Leaving zones explicitly open may lead to a state
+			 * where most zones cannot be written (zone resources
+			 * exhausted). So take preventive action by remounting
+			 * read-only.
+			 */
+			if (zi->i_flags & ZONEFS_ZONE_OPEN &&
+			    !(sb->s_flags & SB_RDONLY)) {
+				zonefs_warn(sb,
+					"closing zone at %llu failed %d\n",
+					zi->i_zsector, ret);
+				zonefs_warn(sb,
+					"remounting filesystem read-only\n");
+				sb->s_flags |= SB_RDONLY;
+			}
+			goto unlock;
+		}
+
+		zi->i_flags &= ~ZONEFS_ZONE_OPEN;
+		zonefs_account_active(inode);
+	}
+
+	atomic_dec(&sbi->s_wro_seq_files);
+
+unlock:
+	mutex_unlock(&zi->i_truncate_mutex);
+}
+
+static int zonefs_file_release(struct inode *inode, struct file *file)
+{
+	/*
+	 * If we explicitly open a zone we must close it again as well, but the
+	 * zone management operation can fail (either due to an IO error or as
+	 * the zone has gone offline or read-only). Make sure we don't fail the
+	 * close(2) for user-space.
+	 */
+	if (zonefs_seq_file_need_wro(inode, file))
+		zonefs_seq_file_write_close(inode);
+
+	return 0;
+}
+
+const struct file_operations zonefs_file_operations = {
+	.open		= zonefs_file_open,
+	.release	= zonefs_file_release,
+	.fsync		= zonefs_file_fsync,
+	.mmap		= zonefs_file_mmap,
+	.llseek		= zonefs_file_llseek,
+	.read_iter	= zonefs_file_read_iter,
+	.write_iter	= zonefs_file_write_iter,
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= iter_file_splice_write,
+	.iopoll		= iocb_bio_iopoll,
+};
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index a9c5c3f720ad..e808276b8801 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -30,7 +30,7 @@
 /*
  * Manage the active zone count. Called with zi->i_truncate_mutex held.
  */
-static void zonefs_account_active(struct inode *inode)
+void zonefs_account_active(struct inode *inode)
 {
 	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
@@ -68,7 +68,7 @@ static void zonefs_account_active(struct inode *inode)
 	}
 }
 
-static inline int zonefs_zone_mgmt(struct inode *inode, enum req_op op)
+int zonefs_zone_mgmt(struct inode *inode, enum req_op op)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	int ret;
@@ -99,7 +99,7 @@ static inline int zonefs_zone_mgmt(struct inode *inode, enum req_op op)
 	return 0;
 }
 
-static inline void zonefs_i_size_write(struct inode *inode, loff_t isize)
+void zonefs_i_size_write(struct inode *inode, loff_t isize)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 
@@ -117,167 +117,7 @@ static inline void zonefs_i_size_write(struct inode *inode, loff_t isize)
 	}
 }
 
-static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
-				   loff_t length, unsigned int flags,
-				   struct iomap *iomap, struct iomap *srcmap)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	struct super_block *sb = inode->i_sb;
-	loff_t isize;
-
-	/*
-	 * All blocks are always mapped below EOF. If reading past EOF,
-	 * act as if there is a hole up to the file maximum size.
-	 */
-	mutex_lock(&zi->i_truncate_mutex);
-	iomap->bdev = inode->i_sb->s_bdev;
-	iomap->offset = ALIGN_DOWN(offset, sb->s_blocksize);
-	isize = i_size_read(inode);
-	if (iomap->offset >= isize) {
-		iomap->type = IOMAP_HOLE;
-		iomap->addr = IOMAP_NULL_ADDR;
-		iomap->length = length;
-	} else {
-		iomap->type = IOMAP_MAPPED;
-		iomap->addr = (zi->i_zsector << SECTOR_SHIFT) + iomap->offset;
-		iomap->length = isize - iomap->offset;
-	}
-	mutex_unlock(&zi->i_truncate_mutex);
-
-	trace_zonefs_iomap_begin(inode, iomap);
-
-	return 0;
-}
-
-static const struct iomap_ops zonefs_read_iomap_ops = {
-	.iomap_begin	= zonefs_read_iomap_begin,
-};
-
-static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
-				    loff_t length, unsigned int flags,
-				    struct iomap *iomap, struct iomap *srcmap)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	struct super_block *sb = inode->i_sb;
-	loff_t isize;
-
-	/* All write I/Os should always be within the file maximum size */
-	if (WARN_ON_ONCE(offset + length > zi->i_max_size))
-		return -EIO;
-
-	/*
-	 * Sequential zones can only accept direct writes. This is already
-	 * checked when writes are issued, so warn if we see a page writeback
-	 * operation.
-	 */
-	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
-			 !(flags & IOMAP_DIRECT)))
-		return -EIO;
-
-	/*
-	 * For conventional zones, all blocks are always mapped. For sequential
-	 * zones, all blocks after always mapped below the inode size (zone
-	 * write pointer) and unwriten beyond.
-	 */
-	mutex_lock(&zi->i_truncate_mutex);
-	iomap->bdev = inode->i_sb->s_bdev;
-	iomap->offset = ALIGN_DOWN(offset, sb->s_blocksize);
-	iomap->addr = (zi->i_zsector << SECTOR_SHIFT) + iomap->offset;
-	isize = i_size_read(inode);
-	if (iomap->offset >= isize) {
-		iomap->type = IOMAP_UNWRITTEN;
-		iomap->length = zi->i_max_size - iomap->offset;
-	} else {
-		iomap->type = IOMAP_MAPPED;
-		iomap->length = isize - iomap->offset;
-	}
-	mutex_unlock(&zi->i_truncate_mutex);
-
-	trace_zonefs_iomap_begin(inode, iomap);
-
-	return 0;
-}
-
-static const struct iomap_ops zonefs_write_iomap_ops = {
-	.iomap_begin	= zonefs_write_iomap_begin,
-};
-
-static int zonefs_read_folio(struct file *unused, struct folio *folio)
-{
-	return iomap_read_folio(folio, &zonefs_read_iomap_ops);
-}
-
-static void zonefs_readahead(struct readahead_control *rac)
-{
-	iomap_readahead(rac, &zonefs_read_iomap_ops);
-}
-
-/*
- * Map blocks for page writeback. This is used only on conventional zone files,
- * which implies that the page range can only be within the fixed inode size.
- */
-static int zonefs_write_map_blocks(struct iomap_writepage_ctx *wpc,
-				   struct inode *inode, loff_t offset)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-
-	if (WARN_ON_ONCE(zi->i_ztype != ZONEFS_ZTYPE_CNV))
-		return -EIO;
-	if (WARN_ON_ONCE(offset >= i_size_read(inode)))
-		return -EIO;
-
-	/* If the mapping is already OK, nothing needs to be done */
-	if (offset >= wpc->iomap.offset &&
-	    offset < wpc->iomap.offset + wpc->iomap.length)
-		return 0;
-
-	return zonefs_write_iomap_begin(inode, offset, zi->i_max_size - offset,
-					IOMAP_WRITE, &wpc->iomap, NULL);
-}
-
-static const struct iomap_writeback_ops zonefs_writeback_ops = {
-	.map_blocks		= zonefs_write_map_blocks,
-};
-
-static int zonefs_writepages(struct address_space *mapping,
-			     struct writeback_control *wbc)
-{
-	struct iomap_writepage_ctx wpc = { };
-
-	return iomap_writepages(mapping, wbc, &wpc, &zonefs_writeback_ops);
-}
-
-static int zonefs_swap_activate(struct swap_info_struct *sis,
-				struct file *swap_file, sector_t *span)
-{
-	struct inode *inode = file_inode(swap_file);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-
-	if (zi->i_ztype != ZONEFS_ZTYPE_CNV) {
-		zonefs_err(inode->i_sb,
-			   "swap file: not a conventional zone file\n");
-		return -EINVAL;
-	}
-
-	return iomap_swapfile_activate(sis, swap_file, span,
-				       &zonefs_read_iomap_ops);
-}
-
-static const struct address_space_operations zonefs_file_aops = {
-	.read_folio		= zonefs_read_folio,
-	.readahead		= zonefs_readahead,
-	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
-	.release_folio		= iomap_release_folio,
-	.invalidate_folio	= iomap_invalidate_folio,
-	.migrate_folio		= filemap_migrate_folio,
-	.is_partially_uptodate	= iomap_is_partially_uptodate,
-	.error_remove_page	= generic_error_remove_page,
-	.direct_IO		= noop_direct_IO,
-	.swap_activate		= zonefs_swap_activate,
-};
-
-static void zonefs_update_stats(struct inode *inode, loff_t new_isize)
+void zonefs_update_stats(struct inode *inode, loff_t new_isize)
 {
 	struct super_block *sb = inode->i_sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
@@ -487,7 +327,7 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
  * eventually correct the file size and zonefs inode write pointer offset
  * (which can be out of sync with the drive due to partial write failures).
  */
-static void __zonefs_io_error(struct inode *inode, bool write)
+void __zonefs_io_error(struct inode *inode, bool write)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
@@ -526,749 +366,6 @@ static void __zonefs_io_error(struct inode *inode, bool write)
 	memalloc_noio_restore(noio_flag);
 }
 
-static void zonefs_io_error(struct inode *inode, bool write)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-
-	mutex_lock(&zi->i_truncate_mutex);
-	__zonefs_io_error(inode, write);
-	mutex_unlock(&zi->i_truncate_mutex);
-}
-
-static int zonefs_file_truncate(struct inode *inode, loff_t isize)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	loff_t old_isize;
-	enum req_op op;
-	int ret = 0;
-
-	/*
-	 * Only sequential zone files can be truncated and truncation is allowed
-	 * only down to a 0 size, which is equivalent to a zone reset, and to
-	 * the maximum file size, which is equivalent to a zone finish.
-	 */
-	if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
-		return -EPERM;
-
-	if (!isize)
-		op = REQ_OP_ZONE_RESET;
-	else if (isize == zi->i_max_size)
-		op = REQ_OP_ZONE_FINISH;
-	else
-		return -EPERM;
-
-	inode_dio_wait(inode);
-
-	/* Serialize against page faults */
-	filemap_invalidate_lock(inode->i_mapping);
-
-	/* Serialize against zonefs_iomap_begin() */
-	mutex_lock(&zi->i_truncate_mutex);
-
-	old_isize = i_size_read(inode);
-	if (isize == old_isize)
-		goto unlock;
-
-	ret = zonefs_zone_mgmt(inode, op);
-	if (ret)
-		goto unlock;
-
-	/*
-	 * If the mount option ZONEFS_MNTOPT_EXPLICIT_OPEN is set,
-	 * take care of open zones.
-	 */
-	if (zi->i_flags & ZONEFS_ZONE_OPEN) {
-		/*
-		 * Truncating a zone to EMPTY or FULL is the equivalent of
-		 * closing the zone. For a truncation to 0, we need to
-		 * re-open the zone to ensure new writes can be processed.
-		 * For a truncation to the maximum file size, the zone is
-		 * closed and writes cannot be accepted anymore, so clear
-		 * the open flag.
-		 */
-		if (!isize)
-			ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
-		else
-			zi->i_flags &= ~ZONEFS_ZONE_OPEN;
-	}
-
-	zonefs_update_stats(inode, isize);
-	truncate_setsize(inode, isize);
-	zi->i_wpoffset = isize;
-	zonefs_account_active(inode);
-
-unlock:
-	mutex_unlock(&zi->i_truncate_mutex);
-	filemap_invalidate_unlock(inode->i_mapping);
-
-	return ret;
-}
-
-static int zonefs_inode_setattr(struct user_namespace *mnt_userns,
-				struct dentry *dentry, struct iattr *iattr)
-{
-	struct inode *inode = d_inode(dentry);
-	int ret;
-
-	if (unlikely(IS_IMMUTABLE(inode)))
-		return -EPERM;
-
-	ret = setattr_prepare(&init_user_ns, dentry, iattr);
-	if (ret)
-		return ret;
-
-	/*
-	 * Since files and directories cannot be created nor deleted, do not
-	 * allow setting any write attributes on the sub-directories grouping
-	 * files by zone type.
-	 */
-	if ((iattr->ia_valid & ATTR_MODE) && S_ISDIR(inode->i_mode) &&
-	    (iattr->ia_mode & 0222))
-		return -EPERM;
-
-	if (((iattr->ia_valid & ATTR_UID) &&
-	     !uid_eq(iattr->ia_uid, inode->i_uid)) ||
-	    ((iattr->ia_valid & ATTR_GID) &&
-	     !gid_eq(iattr->ia_gid, inode->i_gid))) {
-		ret = dquot_transfer(mnt_userns, inode, iattr);
-		if (ret)
-			return ret;
-	}
-
-	if (iattr->ia_valid & ATTR_SIZE) {
-		ret = zonefs_file_truncate(inode, iattr->ia_size);
-		if (ret)
-			return ret;
-	}
-
-	setattr_copy(&init_user_ns, inode, iattr);
-
-	return 0;
-}
-
-static const struct inode_operations zonefs_file_inode_operations = {
-	.setattr	= zonefs_inode_setattr,
-};
-
-static int zonefs_file_fsync(struct file *file, loff_t start, loff_t end,
-			     int datasync)
-{
-	struct inode *inode = file_inode(file);
-	int ret = 0;
-
-	if (unlikely(IS_IMMUTABLE(inode)))
-		return -EPERM;
-
-	/*
-	 * Since only direct writes are allowed in sequential files, page cache
-	 * flush is needed only for conventional zone files.
-	 */
-	if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV)
-		ret = file_write_and_wait_range(file, start, end);
-	if (!ret)
-		ret = blkdev_issue_flush(inode->i_sb->s_bdev);
-
-	if (ret)
-		zonefs_io_error(inode, true);
-
-	return ret;
-}
-
-static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
-{
-	struct inode *inode = file_inode(vmf->vma->vm_file);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	vm_fault_t ret;
-
-	if (unlikely(IS_IMMUTABLE(inode)))
-		return VM_FAULT_SIGBUS;
-
-	/*
-	 * Sanity check: only conventional zone files can have shared
-	 * writeable mappings.
-	 */
-	if (WARN_ON_ONCE(zi->i_ztype != ZONEFS_ZTYPE_CNV))
-		return VM_FAULT_NOPAGE;
-
-	sb_start_pagefault(inode->i_sb);
-	file_update_time(vmf->vma->vm_file);
-
-	/* Serialize against truncates */
-	filemap_invalidate_lock_shared(inode->i_mapping);
-	ret = iomap_page_mkwrite(vmf, &zonefs_write_iomap_ops);
-	filemap_invalidate_unlock_shared(inode->i_mapping);
-
-	sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
-static const struct vm_operations_struct zonefs_file_vm_ops = {
-	.fault		= filemap_fault,
-	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= zonefs_filemap_page_mkwrite,
-};
-
-static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma)
-{
-	/*
-	 * Conventional zones accept random writes, so their files can support
-	 * shared writable mappings. For sequential zone files, only read
-	 * mappings are possible since there are no guarantees for write
-	 * ordering between msync() and page cache writeback.
-	 */
-	if (ZONEFS_I(file_inode(file))->i_ztype == ZONEFS_ZTYPE_SEQ &&
-	    (vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
-		return -EINVAL;
-
-	file_accessed(file);
-	vma->vm_ops = &zonefs_file_vm_ops;
-
-	return 0;
-}
-
-static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence)
-{
-	loff_t isize = i_size_read(file_inode(file));
-
-	/*
-	 * Seeks are limited to below the zone size for conventional zones
-	 * and below the zone write pointer for sequential zones. In both
-	 * cases, this limit is the inode size.
-	 */
-	return generic_file_llseek_size(file, offset, whence, isize, isize);
-}
-
-static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
-					int error, unsigned int flags)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-
-	if (error) {
-		zonefs_io_error(inode, true);
-		return error;
-	}
-
-	if (size && zi->i_ztype != ZONEFS_ZTYPE_CNV) {
-		/*
-		 * Note that we may be seeing completions out of order,
-		 * but that is not a problem since a write completed
-		 * successfully necessarily means that all preceding writes
-		 * were also successful. So we can safely increase the inode
-		 * size to the write end location.
-		 */
-		mutex_lock(&zi->i_truncate_mutex);
-		if (i_size_read(inode) < iocb->ki_pos + size) {
-			zonefs_update_stats(inode, iocb->ki_pos + size);
-			zonefs_i_size_write(inode, iocb->ki_pos + size);
-		}
-		mutex_unlock(&zi->i_truncate_mutex);
-	}
-
-	return 0;
-}
-
-static const struct iomap_dio_ops zonefs_write_dio_ops = {
-	.end_io			= zonefs_file_write_dio_end_io,
-};
-
-static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	struct block_device *bdev = inode->i_sb->s_bdev;
-	unsigned int max = bdev_max_zone_append_sectors(bdev);
-	struct bio *bio;
-	ssize_t size;
-	int nr_pages;
-	ssize_t ret;
-
-	max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
-	iov_iter_truncate(from, max);
-
-	nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
-	if (!nr_pages)
-		return 0;
-
-	bio = bio_alloc(bdev, nr_pages,
-			REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE, GFP_NOFS);
-	bio->bi_iter.bi_sector = zi->i_zsector;
-	bio->bi_ioprio = iocb->ki_ioprio;
-	if (iocb_is_dsync(iocb))
-		bio->bi_opf |= REQ_FUA;
-
-	ret = bio_iov_iter_get_pages(bio, from);
-	if (unlikely(ret))
-		goto out_release;
-
-	size = bio->bi_iter.bi_size;
-	task_io_account_write(size);
-
-	if (iocb->ki_flags & IOCB_HIPRI)
-		bio_set_polled(bio, iocb);
-
-	ret = submit_bio_wait(bio);
-
-	/*
-	 * If the file zone was written underneath the file system, the zone
-	 * write pointer may not be where we expect it to be, but the zone
-	 * append write can still succeed. So check manually that we wrote where
-	 * we intended to, that is, at zi->i_wpoffset.
-	 */
-	if (!ret) {
-		sector_t wpsector =
-			zi->i_zsector + (zi->i_wpoffset >> SECTOR_SHIFT);
-
-		if (bio->bi_iter.bi_sector != wpsector) {
-			zonefs_warn(inode->i_sb,
-				"Corrupted write pointer %llu for zone at %llu\n",
-				wpsector, zi->i_zsector);
-			ret = -EIO;
-		}
-	}
-
-	zonefs_file_write_dio_end_io(iocb, size, ret, 0);
-	trace_zonefs_file_dio_append(inode, size, ret);
-
-out_release:
-	bio_release_pages(bio, false);
-	bio_put(bio);
-
-	if (ret >= 0) {
-		iocb->ki_pos += size;
-		return size;
-	}
-
-	return ret;
-}
-
-/*
- * Do not exceed the LFS limits nor the file zone size. If pos is under the
- * limit it becomes a short access. If it exceeds the limit, return -EFBIG.
- */
-static loff_t zonefs_write_check_limits(struct file *file, loff_t pos,
-					loff_t count)
-{
-	struct inode *inode = file_inode(file);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	loff_t limit = rlimit(RLIMIT_FSIZE);
-	loff_t max_size = zi->i_max_size;
-
-	if (limit != RLIM_INFINITY) {
-		if (pos >= limit) {
-			send_sig(SIGXFSZ, current, 0);
-			return -EFBIG;
-		}
-		count = min(count, limit - pos);
-	}
-
-	if (!(file->f_flags & O_LARGEFILE))
-		max_size = min_t(loff_t, MAX_NON_LFS, max_size);
-
-	if (unlikely(pos >= max_size))
-		return -EFBIG;
-
-	return min(count, max_size - pos);
-}
-
-static ssize_t zonefs_write_checks(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file_inode(file);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	loff_t count;
-
-	if (IS_SWAPFILE(inode))
-		return -ETXTBSY;
-
-	if (!iov_iter_count(from))
-		return 0;
-
-	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
-		return -EINVAL;
-
-	if (iocb->ki_flags & IOCB_APPEND) {
-		if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
-			return -EINVAL;
-		mutex_lock(&zi->i_truncate_mutex);
-		iocb->ki_pos = zi->i_wpoffset;
-		mutex_unlock(&zi->i_truncate_mutex);
-	}
-
-	count = zonefs_write_check_limits(file, iocb->ki_pos,
-					  iov_iter_count(from));
-	if (count < 0)
-		return count;
-
-	iov_iter_truncate(from, count);
-	return iov_iter_count(from);
-}
-
-/*
- * Handle direct writes. For sequential zone files, this is the only possible
- * write path. For these files, check that the user is issuing writes
- * sequentially from the end of the file. This code assumes that the block layer
- * delivers write requests to the device in sequential order. This is always the
- * case if a block IO scheduler implementing the ELEVATOR_F_ZBD_SEQ_WRITE
- * elevator feature is being used (e.g. mq-deadline). The block layer always
- * automatically select such an elevator for zoned block devices during the
- * device initialization.
- */
-static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	struct super_block *sb = inode->i_sb;
-	bool sync = is_sync_kiocb(iocb);
-	bool append = false;
-	ssize_t ret, count;
-
-	/*
-	 * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT
-	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
-	 * on the inode lock but the second goes through but is now unaligned).
-	 */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
-		return -EOPNOTSUPP;
-
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!inode_trylock(inode))
-			return -EAGAIN;
-	} else {
-		inode_lock(inode);
-	}
-
-	count = zonefs_write_checks(iocb, from);
-	if (count <= 0) {
-		ret = count;
-		goto inode_unlock;
-	}
-
-	if ((iocb->ki_pos | count) & (sb->s_blocksize - 1)) {
-		ret = -EINVAL;
-		goto inode_unlock;
-	}
-
-	/* Enforce sequential writes (append only) in sequential zones */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
-		mutex_lock(&zi->i_truncate_mutex);
-		if (iocb->ki_pos != zi->i_wpoffset) {
-			mutex_unlock(&zi->i_truncate_mutex);
-			ret = -EINVAL;
-			goto inode_unlock;
-		}
-		mutex_unlock(&zi->i_truncate_mutex);
-		append = sync;
-	}
-
-	if (append)
-		ret = zonefs_file_dio_append(iocb, from);
-	else
-		ret = iomap_dio_rw(iocb, from, &zonefs_write_iomap_ops,
-				   &zonefs_write_dio_ops, 0, NULL, 0);
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
-	    (ret > 0 || ret == -EIOCBQUEUED)) {
-		if (ret > 0)
-			count = ret;
-
-		/*
-		 * Update the zone write pointer offset assuming the write
-		 * operation succeeded. If it did not, the error recovery path
-		 * will correct it. Also do active seq file accounting.
-		 */
-		mutex_lock(&zi->i_truncate_mutex);
-		zi->i_wpoffset += count;
-		zonefs_account_active(inode);
-		mutex_unlock(&zi->i_truncate_mutex);
-	}
-
-inode_unlock:
-	inode_unlock(inode);
-
-	return ret;
-}
-
-static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
-					  struct iov_iter *from)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	ssize_t ret;
-
-	/*
-	 * Direct IO writes are mandatory for sequential zone files so that the
-	 * write IO issuing order is preserved.
-	 */
-	if (zi->i_ztype != ZONEFS_ZTYPE_CNV)
-		return -EIO;
-
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!inode_trylock(inode))
-			return -EAGAIN;
-	} else {
-		inode_lock(inode);
-	}
-
-	ret = zonefs_write_checks(iocb, from);
-	if (ret <= 0)
-		goto inode_unlock;
-
-	ret = iomap_file_buffered_write(iocb, from, &zonefs_write_iomap_ops);
-	if (ret > 0)
-		iocb->ki_pos += ret;
-	else if (ret == -EIO)
-		zonefs_io_error(inode, true);
-
-inode_unlock:
-	inode_unlock(inode);
-	if (ret > 0)
-		ret = generic_write_sync(iocb, ret);
-
-	return ret;
-}
-
-static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-
-	if (unlikely(IS_IMMUTABLE(inode)))
-		return -EPERM;
-
-	if (sb_rdonly(inode->i_sb))
-		return -EROFS;
-
-	/* Write operations beyond the zone size are not allowed */
-	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
-		return -EFBIG;
-
-	if (iocb->ki_flags & IOCB_DIRECT) {
-		ssize_t ret = zonefs_file_dio_write(iocb, from);
-		if (ret != -ENOTBLK)
-			return ret;
-	}
-
-	return zonefs_file_buffered_write(iocb, from);
-}
-
-static int zonefs_file_read_dio_end_io(struct kiocb *iocb, ssize_t size,
-				       int error, unsigned int flags)
-{
-	if (error) {
-		zonefs_io_error(file_inode(iocb->ki_filp), false);
-		return error;
-	}
-
-	return 0;
-}
-
-static const struct iomap_dio_ops zonefs_read_dio_ops = {
-	.end_io			= zonefs_file_read_dio_end_io,
-};
-
-static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	struct super_block *sb = inode->i_sb;
-	loff_t isize;
-	ssize_t ret;
-
-	/* Offline zones cannot be read */
-	if (unlikely(IS_IMMUTABLE(inode) && !(inode->i_mode & 0777)))
-		return -EPERM;
-
-	if (iocb->ki_pos >= zi->i_max_size)
-		return 0;
-
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!inode_trylock_shared(inode))
-			return -EAGAIN;
-	} else {
-		inode_lock_shared(inode);
-	}
-
-	/* Limit read operations to written data */
-	mutex_lock(&zi->i_truncate_mutex);
-	isize = i_size_read(inode);
-	if (iocb->ki_pos >= isize) {
-		mutex_unlock(&zi->i_truncate_mutex);
-		ret = 0;
-		goto inode_unlock;
-	}
-	iov_iter_truncate(to, isize - iocb->ki_pos);
-	mutex_unlock(&zi->i_truncate_mutex);
-
-	if (iocb->ki_flags & IOCB_DIRECT) {
-		size_t count = iov_iter_count(to);
-
-		if ((iocb->ki_pos | count) & (sb->s_blocksize - 1)) {
-			ret = -EINVAL;
-			goto inode_unlock;
-		}
-		file_accessed(iocb->ki_filp);
-		ret = iomap_dio_rw(iocb, to, &zonefs_read_iomap_ops,
-				   &zonefs_read_dio_ops, 0, NULL, 0);
-	} else {
-		ret = generic_file_read_iter(iocb, to);
-		if (ret == -EIO)
-			zonefs_io_error(inode, false);
-	}
-
-inode_unlock:
-	inode_unlock_shared(inode);
-
-	return ret;
-}
-
-/*
- * Write open accounting is done only for sequential files.
- */
-static inline bool zonefs_seq_file_need_wro(struct inode *inode,
-					    struct file *file)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-
-	if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
-		return false;
-
-	if (!(file->f_mode & FMODE_WRITE))
-		return false;
-
-	return true;
-}
-
-static int zonefs_seq_file_write_open(struct inode *inode)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	int ret = 0;
-
-	mutex_lock(&zi->i_truncate_mutex);
-
-	if (!zi->i_wr_refcnt) {
-		struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
-		unsigned int wro = atomic_inc_return(&sbi->s_wro_seq_files);
-
-		if (sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
-
-			if (sbi->s_max_wro_seq_files
-			    && wro > sbi->s_max_wro_seq_files) {
-				atomic_dec(&sbi->s_wro_seq_files);
-				ret = -EBUSY;
-				goto unlock;
-			}
-
-			if (i_size_read(inode) < zi->i_max_size) {
-				ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
-				if (ret) {
-					atomic_dec(&sbi->s_wro_seq_files);
-					goto unlock;
-				}
-				zi->i_flags |= ZONEFS_ZONE_OPEN;
-				zonefs_account_active(inode);
-			}
-		}
-	}
-
-	zi->i_wr_refcnt++;
-
-unlock:
-	mutex_unlock(&zi->i_truncate_mutex);
-
-	return ret;
-}
-
-static int zonefs_file_open(struct inode *inode, struct file *file)
-{
-	int ret;
-
-	ret = generic_file_open(inode, file);
-	if (ret)
-		return ret;
-
-	if (zonefs_seq_file_need_wro(inode, file))
-		return zonefs_seq_file_write_open(inode);
-
-	return 0;
-}
-
-static void zonefs_seq_file_write_close(struct inode *inode)
-{
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	struct super_block *sb = inode->i_sb;
-	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
-	int ret = 0;
-
-	mutex_lock(&zi->i_truncate_mutex);
-
-	zi->i_wr_refcnt--;
-	if (zi->i_wr_refcnt)
-		goto unlock;
-
-	/*
-	 * The file zone may not be open anymore (e.g. the file was truncated to
-	 * its maximum size or it was fully written). For this case, we only
-	 * need to decrement the write open count.
-	 */
-	if (zi->i_flags & ZONEFS_ZONE_OPEN) {
-		ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_CLOSE);
-		if (ret) {
-			__zonefs_io_error(inode, false);
-			/*
-			 * Leaving zones explicitly open may lead to a state
-			 * where most zones cannot be written (zone resources
-			 * exhausted). So take preventive action by remounting
-			 * read-only.
-			 */
-			if (zi->i_flags & ZONEFS_ZONE_OPEN &&
-			    !(sb->s_flags & SB_RDONLY)) {
-				zonefs_warn(sb,
-					"closing zone at %llu failed %d\n",
-					zi->i_zsector, ret);
-				zonefs_warn(sb,
-					"remounting filesystem read-only\n");
-				sb->s_flags |= SB_RDONLY;
-			}
-			goto unlock;
-		}
-
-		zi->i_flags &= ~ZONEFS_ZONE_OPEN;
-		zonefs_account_active(inode);
-	}
-
-	atomic_dec(&sbi->s_wro_seq_files);
-
-unlock:
-	mutex_unlock(&zi->i_truncate_mutex);
-}
-
-static int zonefs_file_release(struct inode *inode, struct file *file)
-{
-	/*
-	 * If we explicitly open a zone we must close it again as well, but the
-	 * zone management operation can fail (either due to an IO error or as
-	 * the zone has gone offline or read-only). Make sure we don't fail the
-	 * close(2) for user-space.
-	 */
-	if (zonefs_seq_file_need_wro(inode, file))
-		zonefs_seq_file_write_close(inode);
-
-	return 0;
-}
-
-static const struct file_operations zonefs_file_operations = {
-	.open		= zonefs_file_open,
-	.release	= zonefs_file_release,
-	.fsync		= zonefs_file_fsync,
-	.mmap		= zonefs_file_mmap,
-	.llseek		= zonefs_file_llseek,
-	.read_iter	= zonefs_file_read_iter,
-	.write_iter	= zonefs_file_write_iter,
-	.splice_read	= generic_file_splice_read,
-	.splice_write	= iter_file_splice_write,
-	.iopoll		= iocb_bio_iopoll,
-};
-
 static struct kmem_cache *zonefs_inode_cachep;
 
 static struct inode *zonefs_alloc_inode(struct super_block *sb)
@@ -1408,13 +505,47 @@ static int zonefs_remount(struct super_block *sb, int *flags, char *data)
 	return zonefs_parse_options(sb, data);
 }
 
-static const struct super_operations zonefs_sops = {
-	.alloc_inode	= zonefs_alloc_inode,
-	.free_inode	= zonefs_free_inode,
-	.statfs		= zonefs_statfs,
-	.remount_fs	= zonefs_remount,
-	.show_options	= zonefs_show_options,
-};
+static int zonefs_inode_setattr(struct user_namespace *mnt_userns,
+				struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = d_inode(dentry);
+	int ret;
+
+	if (unlikely(IS_IMMUTABLE(inode)))
+		return -EPERM;
+
+	ret = setattr_prepare(&init_user_ns, dentry, iattr);
+	if (ret)
+		return ret;
+
+	/*
+	 * Since files and directories cannot be created nor deleted, do not
+	 * allow setting any write attributes on the sub-directories grouping
+	 * files by zone type.
+	 */
+	if ((iattr->ia_valid & ATTR_MODE) && S_ISDIR(inode->i_mode) &&
+	    (iattr->ia_mode & 0222))
+		return -EPERM;
+
+	if (((iattr->ia_valid & ATTR_UID) &&
+	     !uid_eq(iattr->ia_uid, inode->i_uid)) ||
+	    ((iattr->ia_valid & ATTR_GID) &&
+	     !gid_eq(iattr->ia_gid, inode->i_gid))) {
+		ret = dquot_transfer(mnt_userns, inode, iattr);
+		if (ret)
+			return ret;
+	}
+
+	if (iattr->ia_valid & ATTR_SIZE) {
+		ret = zonefs_file_truncate(inode, iattr->ia_size);
+		if (ret)
+			return ret;
+	}
+
+	setattr_copy(&init_user_ns, inode, iattr);
+
+	return 0;
+}
 
 static const struct inode_operations zonefs_dir_inode_operations = {
 	.lookup		= simple_lookup,
@@ -1434,6 +565,10 @@ static void zonefs_init_dir_inode(struct inode *parent, struct inode *inode,
 	inc_nlink(parent);
 }
 
+static const struct inode_operations zonefs_file_inode_operations = {
+	.setattr	= zonefs_inode_setattr,
+};
+
 static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 				  enum zonefs_ztype type)
 {
@@ -1785,6 +920,14 @@ static int zonefs_read_super(struct super_block *sb)
 	return ret;
 }
 
+static const struct super_operations zonefs_sops = {
+	.alloc_inode	= zonefs_alloc_inode,
+	.free_inode	= zonefs_free_inode,
+	.statfs		= zonefs_statfs,
+	.remount_fs	= zonefs_remount,
+	.show_options	= zonefs_show_options,
+};
+
 /*
  * Check that the device is zoned. If it is, get the list of zones and create
  * sub-directories and files according to the device zone configuration and
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 1dbe78119ff1..839ebe9afb6c 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -209,6 +209,28 @@ static inline struct zonefs_sb_info *ZONEFS_SB(struct super_block *sb)
 #define zonefs_warn(sb, format, args...)	\
 	pr_warn("zonefs (%s) WARNING: " format, sb->s_id, ## args)
 
+/* In super.c */
+void zonefs_account_active(struct inode *inode);
+int zonefs_zone_mgmt(struct inode *inode, enum req_op op);
+void zonefs_i_size_write(struct inode *inode, loff_t isize);
+void zonefs_update_stats(struct inode *inode, loff_t new_isize);
+void __zonefs_io_error(struct inode *inode, bool write);
+
+static inline void zonefs_io_error(struct inode *inode, bool write)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+
+	mutex_lock(&zi->i_truncate_mutex);
+	__zonefs_io_error(inode, write);
+	mutex_unlock(&zi->i_truncate_mutex);
+}
+
+/* In file.c */
+extern const struct address_space_operations zonefs_file_aops;
+extern const struct file_operations zonefs_file_operations;
+int zonefs_file_truncate(struct inode *inode, loff_t isize);
+
+/* In sysfs.c */
 int zonefs_sysfs_register(struct super_block *sb);
 void zonefs_sysfs_unregister(struct super_block *sb);
 int zonefs_sysfs_init(void);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/7] zonefs: Simplify IO error handling
  2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
  2023-01-10 13:08 ` [PATCH 1/7] zonefs: Detect append writes at invalid locations Damien Le Moal
  2023-01-10 13:08 ` [PATCH 2/7] zonefs: Reorganize code Damien Le Moal
@ 2023-01-10 13:08 ` Damien Le Moal
  2023-01-13 10:09   ` Johannes Thumshirn
  2023-01-10 13:08 ` [PATCH 4/7] zonefs: Reduce struct zonefs_inode_info size Damien Le Moal
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

Simplify zonefs_check_zone_condition() by moving the code that changes
an inode access rights to the new function zonefs_inode_update_mode().
Furthermore, since on mount an inode wpoffset is always zero when
zonefs_check_zone_condition() is called during an inode initialization,
the "mount" boolean argument is not necessary for the readonly zone
case. This argument is thus removed.

zonefs_io_error_cb() is also modified to use the inode offline and
zone state flags instead of checking the device zone condition. The
multiple calls to zonefs_check_zone_condition() are reduced to the first
call on entry, which allows removing the "warn" argument.
zonefs_inode_update_mode() is also used to update an inode access rights
as zonefs_io_error_cb() modifies the inode flags depending on the volume
error handling mode (defined with a mount option). Since an inode mode
change differs for read-only zones between mount time and IO error time,
the flag ZONEFS_ZONE_INIT_MODE is used to differentiate both cases.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/super.c  | 110 ++++++++++++++++++++++++---------------------
 fs/zonefs/zonefs.h |   9 ++--
 2 files changed, 64 insertions(+), 55 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index e808276b8801..6307cc95be06 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -155,48 +155,31 @@ void zonefs_update_stats(struct inode *inode, loff_t new_isize)
  * amount of readable data in the zone.
  */
 static loff_t zonefs_check_zone_condition(struct inode *inode,
-					  struct blk_zone *zone, bool warn,
-					  bool mount)
+					  struct blk_zone *zone)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 
 	switch (zone->cond) {
 	case BLK_ZONE_COND_OFFLINE:
-		/*
-		 * Dead zone: make the inode immutable, disable all accesses
-		 * and set the file size to 0 (zone wp set to zone start).
-		 */
-		if (warn)
-			zonefs_warn(inode->i_sb, "inode %lu: offline zone\n",
-				    inode->i_ino);
-		inode->i_flags |= S_IMMUTABLE;
-		inode->i_mode &= ~0777;
-		zone->wp = zone->start;
+		zonefs_warn(inode->i_sb, "inode %lu: offline zone\n",
+			    inode->i_ino);
 		zi->i_flags |= ZONEFS_ZONE_OFFLINE;
 		return 0;
 	case BLK_ZONE_COND_READONLY:
 		/*
-		 * The write pointer of read-only zones is invalid. If such a
-		 * zone is found during mount, the file size cannot be retrieved
-		 * so we treat the zone as offline (mount == true case).
-		 * Otherwise, keep the file size as it was when last updated
-		 * so that the user can recover data. In both cases, writes are
-		 * always disabled for the zone.
+		 * The write pointer of read-only zones is invalid, so we cannot
+		 * determine the zone wpoffset (inode size). We thus keep the
+		 * zone wpoffset as is, which leads to an empty file
+		 * (wpoffset == 0) on mount. For a runtime error, this keeps
+		 * the inode size as it was when last updated so that the user
+		 * can recover data.
 		 */
-		if (warn)
-			zonefs_warn(inode->i_sb, "inode %lu: read-only zone\n",
-				    inode->i_ino);
-		inode->i_flags |= S_IMMUTABLE;
-		if (mount) {
-			zone->cond = BLK_ZONE_COND_OFFLINE;
-			inode->i_mode &= ~0777;
-			zone->wp = zone->start;
-			zi->i_flags |= ZONEFS_ZONE_OFFLINE;
-			return 0;
-		}
+		zonefs_warn(inode->i_sb, "inode %lu: read-only zone\n",
+			    inode->i_ino);
 		zi->i_flags |= ZONEFS_ZONE_READONLY;
-		inode->i_mode &= ~0222;
-		return i_size_read(inode);
+		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
+			return zi->i_max_size;
+		return zi->i_wpoffset;
 	case BLK_ZONE_COND_FULL:
 		/* The write pointer of full zones is invalid. */
 		return zi->i_max_size;
@@ -207,6 +190,30 @@ static loff_t zonefs_check_zone_condition(struct inode *inode,
 	}
 }
 
+/*
+ * Check a zone condition and adjust its inode access permissions for
+ * offline and readonly zones.
+ */
+static void zonefs_inode_update_mode(struct inode *inode)
+{
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+
+	if (zi->i_flags & ZONEFS_ZONE_OFFLINE) {
+		/* Offline zones cannot be read nor written */
+		inode->i_flags |= S_IMMUTABLE;
+		inode->i_mode &= ~0777;
+	} else if (zi->i_flags & ZONEFS_ZONE_READONLY) {
+		/* Readonly zones cannot be written */
+		inode->i_flags |= S_IMMUTABLE;
+		if (zi->i_flags & ZONEFS_ZONE_INIT_MODE)
+			inode->i_mode &= ~0777;
+		else
+			inode->i_mode &= ~0222;
+	}
+
+	zi->i_flags &= ~ZONEFS_ZONE_INIT_MODE;
+}
+
 struct zonefs_ioerr_data {
 	struct inode	*inode;
 	bool		write;
@@ -228,10 +235,9 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * as there is no inconsistency between the inode size and the amount of
 	 * data writen in the zone (data_size).
 	 */
-	data_size = zonefs_check_zone_condition(inode, zone, true, false);
+	data_size = zonefs_check_zone_condition(inode, zone);
 	isize = i_size_read(inode);
-	if (zone->cond != BLK_ZONE_COND_OFFLINE &&
-	    zone->cond != BLK_ZONE_COND_READONLY &&
+	if (!(zi->i_flags & (ZONEFS_ZONE_READONLY | ZONEFS_ZONE_OFFLINE)) &&
 	    !err->write && isize == data_size)
 		return 0;
 
@@ -264,24 +270,22 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * zone condition to read-only and offline respectively, as if the
 	 * condition was signaled by the hardware.
 	 */
-	if (zone->cond == BLK_ZONE_COND_OFFLINE ||
-	    sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_ZOL) {
+	if ((zi->i_flags & ZONEFS_ZONE_OFFLINE) ||
+	    (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_ZOL)) {
 		zonefs_warn(sb, "inode %lu: read/write access disabled\n",
 			    inode->i_ino);
-		if (zone->cond != BLK_ZONE_COND_OFFLINE) {
-			zone->cond = BLK_ZONE_COND_OFFLINE;
-			data_size = zonefs_check_zone_condition(inode, zone,
-								false, false);
-		}
-	} else if (zone->cond == BLK_ZONE_COND_READONLY ||
-		   sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_ZRO) {
+		if (!(zi->i_flags & ZONEFS_ZONE_OFFLINE))
+			zi->i_flags |= ZONEFS_ZONE_OFFLINE;
+		zonefs_inode_update_mode(inode);
+		data_size = 0;
+	} else if ((zi->i_flags & ZONEFS_ZONE_READONLY) ||
+		   (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_ZRO)) {
 		zonefs_warn(sb, "inode %lu: write access disabled\n",
 			    inode->i_ino);
-		if (zone->cond != BLK_ZONE_COND_READONLY) {
-			zone->cond = BLK_ZONE_COND_READONLY;
-			data_size = zonefs_check_zone_condition(inode, zone,
-								false, false);
-		}
+		if (!(zi->i_flags & ZONEFS_ZONE_READONLY))
+			zi->i_flags |= ZONEFS_ZONE_READONLY;
+		zonefs_inode_update_mode(inode);
+		data_size = isize;
 	} else if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_RO &&
 		   data_size > isize) {
 		/* Do not expose garbage data */
@@ -295,8 +299,7 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * close of the zone when the inode file is closed.
 	 */
 	if ((sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) &&
-	    (zone->cond == BLK_ZONE_COND_OFFLINE ||
-	     zone->cond == BLK_ZONE_COND_READONLY))
+	    (zi->i_flags & (ZONEFS_ZONE_READONLY | ZONEFS_ZONE_OFFLINE)))
 		zi->i_flags &= ~ZONEFS_ZONE_OPEN;
 
 	/*
@@ -378,6 +381,7 @@ static struct inode *zonefs_alloc_inode(struct super_block *sb)
 
 	inode_init_once(&zi->i_vnode);
 	mutex_init(&zi->i_truncate_mutex);
+	zi->i_wpoffset = 0;
 	zi->i_wr_refcnt = 0;
 	zi->i_flags = 0;
 
@@ -594,7 +598,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 
 	zi->i_max_size = min_t(loff_t, MAX_LFS_FILESIZE,
 			       zone->capacity << SECTOR_SHIFT);
-	zi->i_wpoffset = zonefs_check_zone_condition(inode, zone, true, true);
+	zi->i_wpoffset = zonefs_check_zone_condition(inode, zone);
 
 	inode->i_uid = sbi->s_uid;
 	inode->i_gid = sbi->s_gid;
@@ -605,6 +609,10 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	inode->i_fop = &zonefs_file_operations;
 	inode->i_mapping->a_ops = &zonefs_file_aops;
 
+	/* Update the inode access rights depending on the zone condition */
+	zi->i_flags |= ZONEFS_ZONE_INIT_MODE;
+	zonefs_inode_update_mode(inode);
+
 	sb->s_maxbytes = max(zi->i_max_size, sb->s_maxbytes);
 	sbi->s_blocks += zi->i_max_size >> sb->s_blocksize_bits;
 	sbi->s_used_blocks += zi->i_wpoffset >> sb->s_blocksize_bits;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 839ebe9afb6c..439096445ee5 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -39,10 +39,11 @@ static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone)
 	return ZONEFS_ZTYPE_SEQ;
 }
 
-#define ZONEFS_ZONE_OPEN	(1U << 0)
-#define ZONEFS_ZONE_ACTIVE	(1U << 1)
-#define ZONEFS_ZONE_OFFLINE	(1U << 2)
-#define ZONEFS_ZONE_READONLY	(1U << 3)
+#define ZONEFS_ZONE_INIT_MODE	(1U << 0)
+#define ZONEFS_ZONE_OPEN	(1U << 1)
+#define ZONEFS_ZONE_ACTIVE	(1U << 2)
+#define ZONEFS_ZONE_OFFLINE	(1U << 3)
+#define ZONEFS_ZONE_READONLY	(1U << 4)
 
 /*
  * In-memory inode data.
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/7] zonefs: Reduce struct zonefs_inode_info size
  2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
                   ` (2 preceding siblings ...)
  2023-01-10 13:08 ` [PATCH 3/7] zonefs: Simplify IO error handling Damien Le Moal
@ 2023-01-10 13:08 ` Damien Le Moal
  2023-01-13 10:11   ` Johannes Thumshirn
  2023-01-10 13:08 ` [PATCH 5/7] zonefs: Separate zone information from inode information Damien Le Moal
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

Instead of using the i_ztype field in struct zonefs_inode_info to
indicate the zone type of an inode, introduce the new inode flag
ZONEFS_ZONE_CNV to be set in the i_flags field of struct
zonefs_inode_info to identify conventional zones. If this flag is not
set, the zone of an inode is considered to be a sequential zone.

The helpers zonefs_zone_is_cnv(), zonefs_zone_is_seq(),
zonefs_inode_is_cnv() and zonefs_inode_is_seq() are introduced to
simplify testing the zone type of a struct zonefs_inode_info and of a
struct inode.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/file.c   | 35 ++++++++++++++---------------------
 fs/zonefs/super.c  | 12 +++++++-----
 fs/zonefs/zonefs.h | 24 +++++++++++++++++++++---
 3 files changed, 42 insertions(+), 29 deletions(-)

diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index ece0f3959b6d..64873d31d75d 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -77,8 +77,7 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
 	 * checked when writes are issued, so warn if we see a page writeback
 	 * operation.
 	 */
-	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
-			 !(flags & IOMAP_DIRECT)))
+	if (WARN_ON_ONCE(zonefs_zone_is_seq(zi) && !(flags & IOMAP_DIRECT)))
 		return -EIO;
 
 	/*
@@ -128,7 +127,7 @@ static int zonefs_write_map_blocks(struct iomap_writepage_ctx *wpc,
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 
-	if (WARN_ON_ONCE(zi->i_ztype != ZONEFS_ZTYPE_CNV))
+	if (WARN_ON_ONCE(zonefs_zone_is_seq(zi)))
 		return -EIO;
 	if (WARN_ON_ONCE(offset >= i_size_read(inode)))
 		return -EIO;
@@ -158,9 +157,8 @@ static int zonefs_swap_activate(struct swap_info_struct *sis,
 				struct file *swap_file, sector_t *span)
 {
 	struct inode *inode = file_inode(swap_file);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 
-	if (zi->i_ztype != ZONEFS_ZTYPE_CNV) {
+	if (zonefs_inode_is_seq(inode)) {
 		zonefs_err(inode->i_sb,
 			   "swap file: not a conventional zone file\n");
 		return -EINVAL;
@@ -196,7 +194,7 @@ int zonefs_file_truncate(struct inode *inode, loff_t isize)
 	 * only down to a 0 size, which is equivalent to a zone reset, and to
 	 * the maximum file size, which is equivalent to a zone finish.
 	 */
-	if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
+	if (!zonefs_zone_is_seq(zi))
 		return -EPERM;
 
 	if (!isize)
@@ -266,7 +264,7 @@ static int zonefs_file_fsync(struct file *file, loff_t start, loff_t end,
 	 * Since only direct writes are allowed in sequential files, page cache
 	 * flush is needed only for conventional zone files.
 	 */
-	if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV)
+	if (zonefs_inode_is_cnv(inode))
 		ret = file_write_and_wait_range(file, start, end);
 	if (!ret)
 		ret = blkdev_issue_flush(inode->i_sb->s_bdev);
@@ -280,7 +278,6 @@ static int zonefs_file_fsync(struct file *file, loff_t start, loff_t end,
 static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vmf->vma->vm_file);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	vm_fault_t ret;
 
 	if (unlikely(IS_IMMUTABLE(inode)))
@@ -290,7 +287,7 @@ static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
 	 * Sanity check: only conventional zone files can have shared
 	 * writeable mappings.
 	 */
-	if (WARN_ON_ONCE(zi->i_ztype != ZONEFS_ZTYPE_CNV))
+	if (zonefs_inode_is_seq(inode))
 		return VM_FAULT_NOPAGE;
 
 	sb_start_pagefault(inode->i_sb);
@@ -319,7 +316,7 @@ static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * mappings are possible since there are no guarantees for write
 	 * ordering between msync() and page cache writeback.
 	 */
-	if (ZONEFS_I(file_inode(file))->i_ztype == ZONEFS_ZTYPE_SEQ &&
+	if (zonefs_inode_is_seq(file_inode(file)) &&
 	    (vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
 		return -EINVAL;
 
@@ -352,7 +349,7 @@ static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
 		return error;
 	}
 
-	if (size && zi->i_ztype != ZONEFS_ZTYPE_CNV) {
+	if (size && zonefs_zone_is_seq(zi)) {
 		/*
 		 * Note that we may be seeing completions out of order,
 		 * but that is not a problem since a write completed
@@ -491,7 +488,7 @@ static ssize_t zonefs_write_checks(struct kiocb *iocb, struct iov_iter *from)
 		return -EINVAL;
 
 	if (iocb->ki_flags & IOCB_APPEND) {
-		if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
+		if (zonefs_zone_is_cnv(zi))
 			return -EINVAL;
 		mutex_lock(&zi->i_truncate_mutex);
 		iocb->ki_pos = zi->i_wpoffset;
@@ -531,8 +528,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
 	 * on the inode lock but the second goes through but is now unaligned).
 	 */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
+	if (zonefs_zone_is_seq(zi) && !sync && (iocb->ki_flags & IOCB_NOWAIT))
 		return -EOPNOTSUPP;
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
@@ -554,7 +550,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	/* Enforce sequential writes (append only) in sequential zones */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
+	if (zonefs_zone_is_seq(zi)) {
 		mutex_lock(&zi->i_truncate_mutex);
 		if (iocb->ki_pos != zi->i_wpoffset) {
 			mutex_unlock(&zi->i_truncate_mutex);
@@ -570,7 +566,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	else
 		ret = iomap_dio_rw(iocb, from, &zonefs_write_iomap_ops,
 				   &zonefs_write_dio_ops, 0, NULL, 0);
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
+	if (zonefs_zone_is_seq(zi) &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
 			count = ret;
@@ -596,14 +592,13 @@ static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
 					  struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	ssize_t ret;
 
 	/*
 	 * Direct IO writes are mandatory for sequential zone files so that the
 	 * write IO issuing order is preserved.
 	 */
-	if (zi->i_ztype != ZONEFS_ZTYPE_CNV)
+	if (zonefs_inode_is_seq(inode))
 		return -EIO;
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
@@ -731,9 +726,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 static inline bool zonefs_seq_file_need_wro(struct inode *inode,
 					    struct file *file)
 {
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-
-	if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
+	if (zonefs_inode_is_cnv(inode))
 		return false;
 
 	if (!(file->f_mode & FMODE_WRITE))
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 6307cc95be06..a4af29dc32e7 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -37,7 +37,7 @@ void zonefs_account_active(struct inode *inode)
 
 	lockdep_assert_held(&zi->i_truncate_mutex);
 
-	if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
+	if (zonefs_zone_is_cnv(zi))
 		return;
 
 	/*
@@ -177,14 +177,14 @@ static loff_t zonefs_check_zone_condition(struct inode *inode,
 		zonefs_warn(inode->i_sb, "inode %lu: read-only zone\n",
 			    inode->i_ino);
 		zi->i_flags |= ZONEFS_ZONE_READONLY;
-		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
+		if (zonefs_zone_is_cnv(zi))
 			return zi->i_max_size;
 		return zi->i_wpoffset;
 	case BLK_ZONE_COND_FULL:
 		/* The write pointer of full zones is invalid. */
 		return zi->i_max_size;
 	default:
-		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
+		if (zonefs_zone_is_cnv(zi))
 			return zi->i_max_size;
 		return (zone->wp - zone->start) << SECTOR_SHIFT;
 	}
@@ -260,7 +260,7 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * In all cases, warn about inode size inconsistency and handle the
 	 * IO error according to the zone condition and to the mount options.
 	 */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && isize != data_size)
+	if (zonefs_zone_is_seq(zi) && isize != data_size)
 		zonefs_warn(sb, "inode %lu: invalid size %lld (should be %lld)\n",
 			    inode->i_ino, isize, data_size);
 
@@ -584,7 +584,9 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
 	inode->i_mode = S_IFREG | sbi->s_perm;
 
-	zi->i_ztype = type;
+	if (type == ZONEFS_ZTYPE_CNV)
+		zi->i_flags |= ZONEFS_ZONE_CNV;
+
 	zi->i_zsector = zone->start;
 	zi->i_zone_size = zone->len << SECTOR_SHIFT;
 	if (zi->i_zone_size > bdev_zone_sectors(sb->s_bdev) << SECTOR_SHIFT &&
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 439096445ee5..1a225f74015a 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -44,6 +44,7 @@ static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone)
 #define ZONEFS_ZONE_ACTIVE	(1U << 2)
 #define ZONEFS_ZONE_OFFLINE	(1U << 3)
 #define ZONEFS_ZONE_READONLY	(1U << 4)
+#define ZONEFS_ZONE_CNV		(1U << 31)
 
 /*
  * In-memory inode data.
@@ -51,9 +52,6 @@ static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone)
 struct zonefs_inode_info {
 	struct inode		i_vnode;
 
-	/* File zone type */
-	enum zonefs_ztype	i_ztype;
-
 	/* File zone start sector (512B unit) */
 	sector_t		i_zsector;
 
@@ -91,6 +89,26 @@ static inline struct zonefs_inode_info *ZONEFS_I(struct inode *inode)
 	return container_of(inode, struct zonefs_inode_info, i_vnode);
 }
 
+static inline bool zonefs_zone_is_cnv(struct zonefs_inode_info *zi)
+{
+	return zi->i_flags & ZONEFS_ZONE_CNV;
+}
+
+static inline bool zonefs_zone_is_seq(struct zonefs_inode_info *zi)
+{
+	return !zonefs_zone_is_cnv(zi);
+}
+
+static inline bool zonefs_inode_is_cnv(struct inode *inode)
+{
+	return zonefs_zone_is_cnv(ZONEFS_I(inode));
+}
+
+static inline bool zonefs_inode_is_seq(struct inode *inode)
+{
+	return zonefs_zone_is_seq(ZONEFS_I(inode));
+}
+
 /*
  * On-disk super block (block 0).
  */
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/7] zonefs: Separate zone information from inode information
  2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
                   ` (3 preceding siblings ...)
  2023-01-10 13:08 ` [PATCH 4/7] zonefs: Reduce struct zonefs_inode_info size Damien Le Moal
@ 2023-01-10 13:08 ` Damien Le Moal
  2023-01-13 11:30   ` Johannes Thumshirn
  2023-01-10 13:08 ` [PATCH 6/7] zonefs: Dynamically create file inodes when needed Damien Le Moal
  2023-01-10 13:08 ` [PATCH 7/7] zonefs: Cache zone group directory inodes Damien Le Moal
  6 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

In preparation for adding dynamic inode allocation, separate an inode
zone information from the zonefs inode structure. The new data structure
zonefs_zone is introduced to store in memory information about a zone
that must be kept throughout the lifetime of the device mount.

Linking between a zone file inode and its zone information is done by
setting the inode i_private field to point to a struct zonefs_zone.
Using the i_private pointer avoids the need for adding a pointer in
struct zonefs_inode_info. Beside the vfs inode, this structure is
reduced to a mutex and a write open counter.

One struct zonefs_zone is created per file inode on mount. These
structures are organized in an array using the new struct
zonefs_zone_group data structure to represent zone groups. The
zonefs_zone arrays are indexed per file number (the index of a struct
zonefs_zone in its array directly gives the file number/name for that
zone file inode).

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/file.c   |  99 ++++----
 fs/zonefs/super.c  | 571 +++++++++++++++++++++++++++------------------
 fs/zonefs/trace.h  |  20 +-
 fs/zonefs/zonefs.h |  63 +++--
 4 files changed, 449 insertions(+), 304 deletions(-)

diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 64873d31d75d..738b0e28d74b 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -29,6 +29,7 @@ static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
 				   struct iomap *iomap, struct iomap *srcmap)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct super_block *sb = inode->i_sb;
 	loff_t isize;
 
@@ -46,7 +47,7 @@ static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
 		iomap->length = length;
 	} else {
 		iomap->type = IOMAP_MAPPED;
-		iomap->addr = (zi->i_zsector << SECTOR_SHIFT) + iomap->offset;
+		iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
 		iomap->length = isize - iomap->offset;
 	}
 	mutex_unlock(&zi->i_truncate_mutex);
@@ -65,11 +66,12 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
 				    struct iomap *iomap, struct iomap *srcmap)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct super_block *sb = inode->i_sb;
 	loff_t isize;
 
 	/* All write I/Os should always be within the file maximum size */
-	if (WARN_ON_ONCE(offset + length > zi->i_max_size))
+	if (WARN_ON_ONCE(offset + length > z->z_capacity))
 		return -EIO;
 
 	/*
@@ -77,7 +79,7 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
 	 * checked when writes are issued, so warn if we see a page writeback
 	 * operation.
 	 */
-	if (WARN_ON_ONCE(zonefs_zone_is_seq(zi) && !(flags & IOMAP_DIRECT)))
+	if (WARN_ON_ONCE(zonefs_zone_is_seq(z) && !(flags & IOMAP_DIRECT)))
 		return -EIO;
 
 	/*
@@ -88,11 +90,11 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
 	mutex_lock(&zi->i_truncate_mutex);
 	iomap->bdev = inode->i_sb->s_bdev;
 	iomap->offset = ALIGN_DOWN(offset, sb->s_blocksize);
-	iomap->addr = (zi->i_zsector << SECTOR_SHIFT) + iomap->offset;
+	iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
 	isize = i_size_read(inode);
 	if (iomap->offset >= isize) {
 		iomap->type = IOMAP_UNWRITTEN;
-		iomap->length = zi->i_max_size - iomap->offset;
+		iomap->length = z->z_capacity - iomap->offset;
 	} else {
 		iomap->type = IOMAP_MAPPED;
 		iomap->length = isize - iomap->offset;
@@ -125,9 +127,9 @@ static void zonefs_readahead(struct readahead_control *rac)
 static int zonefs_write_map_blocks(struct iomap_writepage_ctx *wpc,
 				   struct inode *inode, loff_t offset)
 {
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 
-	if (WARN_ON_ONCE(zonefs_zone_is_seq(zi)))
+	if (WARN_ON_ONCE(zonefs_zone_is_seq(z)))
 		return -EIO;
 	if (WARN_ON_ONCE(offset >= i_size_read(inode)))
 		return -EIO;
@@ -137,7 +139,8 @@ static int zonefs_write_map_blocks(struct iomap_writepage_ctx *wpc,
 	    offset < wpc->iomap.offset + wpc->iomap.length)
 		return 0;
 
-	return zonefs_write_iomap_begin(inode, offset, zi->i_max_size - offset,
+	return zonefs_write_iomap_begin(inode, offset,
+					z->z_capacity - offset,
 					IOMAP_WRITE, &wpc->iomap, NULL);
 }
 
@@ -185,6 +188,7 @@ const struct address_space_operations zonefs_file_aops = {
 int zonefs_file_truncate(struct inode *inode, loff_t isize)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	loff_t old_isize;
 	enum req_op op;
 	int ret = 0;
@@ -194,12 +198,12 @@ int zonefs_file_truncate(struct inode *inode, loff_t isize)
 	 * only down to a 0 size, which is equivalent to a zone reset, and to
 	 * the maximum file size, which is equivalent to a zone finish.
 	 */
-	if (!zonefs_zone_is_seq(zi))
+	if (!zonefs_zone_is_seq(z))
 		return -EPERM;
 
 	if (!isize)
 		op = REQ_OP_ZONE_RESET;
-	else if (isize == zi->i_max_size)
+	else if (isize == z->z_capacity)
 		op = REQ_OP_ZONE_FINISH;
 	else
 		return -EPERM;
@@ -216,7 +220,7 @@ int zonefs_file_truncate(struct inode *inode, loff_t isize)
 	if (isize == old_isize)
 		goto unlock;
 
-	ret = zonefs_zone_mgmt(inode, op);
+	ret = zonefs_inode_zone_mgmt(inode, op);
 	if (ret)
 		goto unlock;
 
@@ -224,7 +228,7 @@ int zonefs_file_truncate(struct inode *inode, loff_t isize)
 	 * If the mount option ZONEFS_MNTOPT_EXPLICIT_OPEN is set,
 	 * take care of open zones.
 	 */
-	if (zi->i_flags & ZONEFS_ZONE_OPEN) {
+	if (z->z_flags & ZONEFS_ZONE_OPEN) {
 		/*
 		 * Truncating a zone to EMPTY or FULL is the equivalent of
 		 * closing the zone. For a truncation to 0, we need to
@@ -234,15 +238,15 @@ int zonefs_file_truncate(struct inode *inode, loff_t isize)
 		 * the open flag.
 		 */
 		if (!isize)
-			ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
+			ret = zonefs_inode_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
 		else
-			zi->i_flags &= ~ZONEFS_ZONE_OPEN;
+			z->z_flags &= ~ZONEFS_ZONE_OPEN;
 	}
 
 	zonefs_update_stats(inode, isize);
 	truncate_setsize(inode, isize);
-	zi->i_wpoffset = isize;
-	zonefs_account_active(inode);
+	z->z_wpoffset = isize;
+	zonefs_inode_account_active(inode);
 
 unlock:
 	mutex_unlock(&zi->i_truncate_mutex);
@@ -349,7 +353,7 @@ static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
 		return error;
 	}
 
-	if (size && zonefs_zone_is_seq(zi)) {
+	if (size && zonefs_inode_is_seq(inode)) {
 		/*
 		 * Note that we may be seeing completions out of order,
 		 * but that is not a problem since a write completed
@@ -375,7 +379,7 @@ static const struct iomap_dio_ops zonefs_write_dio_ops = {
 static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct block_device *bdev = inode->i_sb->s_bdev;
 	unsigned int max = bdev_max_zone_append_sectors(bdev);
 	struct bio *bio;
@@ -392,7 +396,7 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
 
 	bio = bio_alloc(bdev, nr_pages,
 			REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE, GFP_NOFS);
-	bio->bi_iter.bi_sector = zi->i_zsector;
+	bio->bi_iter.bi_sector = z->z_sector;
 	bio->bi_ioprio = iocb->ki_ioprio;
 	if (iocb_is_dsync(iocb))
 		bio->bi_opf |= REQ_FUA;
@@ -417,12 +421,12 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
 	 */
 	if (!ret) {
 		sector_t wpsector =
-			zi->i_zsector + (zi->i_wpoffset >> SECTOR_SHIFT);
+			z->z_sector + (z->z_wpoffset >> SECTOR_SHIFT);
 
 		if (bio->bi_iter.bi_sector != wpsector) {
 			zonefs_warn(inode->i_sb,
 				"Corrupted write pointer %llu for zone at %llu\n",
-				wpsector, zi->i_zsector);
+				wpsector, z->z_sector);
 			ret = -EIO;
 		}
 	}
@@ -450,9 +454,9 @@ static loff_t zonefs_write_check_limits(struct file *file, loff_t pos,
 					loff_t count)
 {
 	struct inode *inode = file_inode(file);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	loff_t limit = rlimit(RLIMIT_FSIZE);
-	loff_t max_size = zi->i_max_size;
+	loff_t max_size = z->z_capacity;
 
 	if (limit != RLIM_INFINITY) {
 		if (pos >= limit) {
@@ -476,6 +480,7 @@ static ssize_t zonefs_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	loff_t count;
 
 	if (IS_SWAPFILE(inode))
@@ -488,10 +493,10 @@ static ssize_t zonefs_write_checks(struct kiocb *iocb, struct iov_iter *from)
 		return -EINVAL;
 
 	if (iocb->ki_flags & IOCB_APPEND) {
-		if (zonefs_zone_is_cnv(zi))
+		if (zonefs_zone_is_cnv(z))
 			return -EINVAL;
 		mutex_lock(&zi->i_truncate_mutex);
-		iocb->ki_pos = zi->i_wpoffset;
+		iocb->ki_pos = z->z_wpoffset;
 		mutex_unlock(&zi->i_truncate_mutex);
 	}
 
@@ -518,6 +523,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct super_block *sb = inode->i_sb;
 	bool sync = is_sync_kiocb(iocb);
 	bool append = false;
@@ -528,7 +534,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
 	 * on the inode lock but the second goes through but is now unaligned).
 	 */
-	if (zonefs_zone_is_seq(zi) && !sync && (iocb->ki_flags & IOCB_NOWAIT))
+	if (zonefs_zone_is_seq(z) && !sync && (iocb->ki_flags & IOCB_NOWAIT))
 		return -EOPNOTSUPP;
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
@@ -550,9 +556,9 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	/* Enforce sequential writes (append only) in sequential zones */
-	if (zonefs_zone_is_seq(zi)) {
+	if (zonefs_zone_is_seq(z)) {
 		mutex_lock(&zi->i_truncate_mutex);
-		if (iocb->ki_pos != zi->i_wpoffset) {
+		if (iocb->ki_pos != z->z_wpoffset) {
 			mutex_unlock(&zi->i_truncate_mutex);
 			ret = -EINVAL;
 			goto inode_unlock;
@@ -566,7 +572,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	else
 		ret = iomap_dio_rw(iocb, from, &zonefs_write_iomap_ops,
 				   &zonefs_write_dio_ops, 0, NULL, 0);
-	if (zonefs_zone_is_seq(zi) &&
+	if (zonefs_zone_is_seq(z) &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
 			count = ret;
@@ -577,8 +583,8 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		 * will correct it. Also do active seq file accounting.
 		 */
 		mutex_lock(&zi->i_truncate_mutex);
-		zi->i_wpoffset += count;
-		zonefs_account_active(inode);
+		z->z_wpoffset += count;
+		zonefs_inode_account_active(inode);
 		mutex_unlock(&zi->i_truncate_mutex);
 	}
 
@@ -629,6 +635,7 @@ static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
 static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 
 	if (unlikely(IS_IMMUTABLE(inode)))
 		return -EPERM;
@@ -636,8 +643,8 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (sb_rdonly(inode->i_sb))
 		return -EROFS;
 
-	/* Write operations beyond the zone size are not allowed */
-	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
+	/* Write operations beyond the zone capacity are not allowed */
+	if (iocb->ki_pos >= z->z_capacity)
 		return -EFBIG;
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
@@ -669,6 +676,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct super_block *sb = inode->i_sb;
 	loff_t isize;
 	ssize_t ret;
@@ -677,7 +685,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (unlikely(IS_IMMUTABLE(inode) && !(inode->i_mode & 0777)))
 		return -EPERM;
 
-	if (iocb->ki_pos >= zi->i_max_size)
+	if (iocb->ki_pos >= z->z_capacity)
 		return 0;
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
@@ -738,6 +746,7 @@ static inline bool zonefs_seq_file_need_wro(struct inode *inode,
 static int zonefs_seq_file_write_open(struct inode *inode)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	int ret = 0;
 
 	mutex_lock(&zi->i_truncate_mutex);
@@ -755,14 +764,15 @@ static int zonefs_seq_file_write_open(struct inode *inode)
 				goto unlock;
 			}
 
-			if (i_size_read(inode) < zi->i_max_size) {
-				ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
+			if (i_size_read(inode) < z->z_capacity) {
+				ret = zonefs_inode_zone_mgmt(inode,
+							     REQ_OP_ZONE_OPEN);
 				if (ret) {
 					atomic_dec(&sbi->s_wro_seq_files);
 					goto unlock;
 				}
-				zi->i_flags |= ZONEFS_ZONE_OPEN;
-				zonefs_account_active(inode);
+				z->z_flags |= ZONEFS_ZONE_OPEN;
+				zonefs_inode_account_active(inode);
 			}
 		}
 	}
@@ -792,6 +802,7 @@ static int zonefs_file_open(struct inode *inode, struct file *file)
 static void zonefs_seq_file_write_close(struct inode *inode)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct super_block *sb = inode->i_sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	int ret = 0;
@@ -807,8 +818,8 @@ static void zonefs_seq_file_write_close(struct inode *inode)
 	 * its maximum size or it was fully written). For this case, we only
 	 * need to decrement the write open count.
 	 */
-	if (zi->i_flags & ZONEFS_ZONE_OPEN) {
-		ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_CLOSE);
+	if (z->z_flags & ZONEFS_ZONE_OPEN) {
+		ret = zonefs_inode_zone_mgmt(inode, REQ_OP_ZONE_CLOSE);
 		if (ret) {
 			__zonefs_io_error(inode, false);
 			/*
@@ -817,11 +828,11 @@ static void zonefs_seq_file_write_close(struct inode *inode)
 			 * exhausted). So take preventive action by remounting
 			 * read-only.
 			 */
-			if (zi->i_flags & ZONEFS_ZONE_OPEN &&
+			if (z->z_flags & ZONEFS_ZONE_OPEN &&
 			    !(sb->s_flags & SB_RDONLY)) {
 				zonefs_warn(sb,
 					"closing zone at %llu failed %d\n",
-					zi->i_zsector, ret);
+					z->z_sector, ret);
 				zonefs_warn(sb,
 					"remounting filesystem read-only\n");
 				sb->s_flags |= SB_RDONLY;
@@ -829,8 +840,8 @@ static void zonefs_seq_file_write_close(struct inode *inode)
 			goto unlock;
 		}
 
-		zi->i_flags &= ~ZONEFS_ZONE_OPEN;
-		zonefs_account_active(inode);
+		z->z_flags &= ~ZONEFS_ZONE_OPEN;
+		zonefs_inode_account_active(inode);
 	}
 
 	atomic_dec(&sbi->s_wro_seq_files);
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index a4af29dc32e7..270ded209dde 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -28,33 +28,47 @@
 #include "trace.h"
 
 /*
- * Manage the active zone count. Called with zi->i_truncate_mutex held.
+ * Get the name of a zone group directory.
  */
-void zonefs_account_active(struct inode *inode)
+static const char *zonefs_zgroup_name(enum zonefs_ztype ztype)
 {
-	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	switch (ztype) {
+	case ZONEFS_ZTYPE_CNV:
+		return "cnv";
+	case ZONEFS_ZTYPE_SEQ:
+		return "seq";
+	default:
+		WARN_ON_ONCE(1);
+		return "???";
+	}
+}
 
-	lockdep_assert_held(&zi->i_truncate_mutex);
+/*
+ * Manage the active zone count.
+ */
+static void zonefs_account_active(struct super_block *sb,
+				  struct zonefs_zone *z)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 
-	if (zonefs_zone_is_cnv(zi))
+	if (zonefs_zone_is_cnv(z))
 		return;
 
 	/*
 	 * For zones that transitioned to the offline or readonly condition,
 	 * we only need to clear the active state.
 	 */
-	if (zi->i_flags & (ZONEFS_ZONE_OFFLINE | ZONEFS_ZONE_READONLY))
+	if (z->z_flags & (ZONEFS_ZONE_OFFLINE | ZONEFS_ZONE_READONLY))
 		goto out;
 
 	/*
 	 * If the zone is active, that is, if it is explicitly open or
 	 * partially written, check if it was already accounted as active.
 	 */
-	if ((zi->i_flags & ZONEFS_ZONE_OPEN) ||
-	    (zi->i_wpoffset > 0 && zi->i_wpoffset < zi->i_max_size)) {
-		if (!(zi->i_flags & ZONEFS_ZONE_ACTIVE)) {
-			zi->i_flags |= ZONEFS_ZONE_ACTIVE;
+	if ((z->z_flags & ZONEFS_ZONE_OPEN) ||
+	    (z->z_wpoffset > 0 && z->z_wpoffset < z->z_capacity)) {
+		if (!(z->z_flags & ZONEFS_ZONE_ACTIVE)) {
+			z->z_flags |= ZONEFS_ZONE_ACTIVE;
 			atomic_inc(&sbi->s_active_seq_files);
 		}
 		return;
@@ -62,18 +76,29 @@ void zonefs_account_active(struct inode *inode)
 
 out:
 	/* The zone is not active. If it was, update the active count */
-	if (zi->i_flags & ZONEFS_ZONE_ACTIVE) {
-		zi->i_flags &= ~ZONEFS_ZONE_ACTIVE;
+	if (z->z_flags & ZONEFS_ZONE_ACTIVE) {
+		z->z_flags &= ~ZONEFS_ZONE_ACTIVE;
 		atomic_dec(&sbi->s_active_seq_files);
 	}
 }
 
-int zonefs_zone_mgmt(struct inode *inode, enum req_op op)
+/*
+ * Manage the active zone count. Called with zi->i_truncate_mutex held.
+ */
+void zonefs_inode_account_active(struct inode *inode)
 {
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	int ret;
+	lockdep_assert_held(&ZONEFS_I(inode)->i_truncate_mutex);
 
-	lockdep_assert_held(&zi->i_truncate_mutex);
+	return zonefs_account_active(inode->i_sb, zonefs_inode_zone(inode));
+}
+
+/*
+ * Execute a zone management operation.
+ */
+static int zonefs_zone_mgmt(struct super_block *sb,
+			    struct zonefs_zone *z, enum req_op op)
+{
+	int ret;
 
 	/*
 	 * With ZNS drives, closing an explicitly open zone that has not been
@@ -83,37 +108,45 @@ int zonefs_zone_mgmt(struct inode *inode, enum req_op op)
 	 * are exceeded, make sure that the zone does not remain active by
 	 * resetting it.
 	 */
-	if (op == REQ_OP_ZONE_CLOSE && !zi->i_wpoffset)
+	if (op == REQ_OP_ZONE_CLOSE && !z->z_wpoffset)
 		op = REQ_OP_ZONE_RESET;
 
-	trace_zonefs_zone_mgmt(inode, op);
-	ret = blkdev_zone_mgmt(inode->i_sb->s_bdev, op, zi->i_zsector,
-			       zi->i_zone_size >> SECTOR_SHIFT, GFP_NOFS);
+	trace_zonefs_zone_mgmt(sb, z, op);
+	ret = blkdev_zone_mgmt(sb->s_bdev, op, z->z_sector,
+			       z->z_size >> SECTOR_SHIFT, GFP_NOFS);
 	if (ret) {
-		zonefs_err(inode->i_sb,
+		zonefs_err(sb,
 			   "Zone management operation %s at %llu failed %d\n",
-			   blk_op_str(op), zi->i_zsector, ret);
+			   blk_op_str(op), z->z_sector, ret);
 		return ret;
 	}
 
 	return 0;
 }
 
+int zonefs_inode_zone_mgmt(struct inode *inode, enum req_op op)
+{
+	lockdep_assert_held(&ZONEFS_I(inode)->i_truncate_mutex);
+
+	return zonefs_zone_mgmt(inode->i_sb, zonefs_inode_zone(inode), op);
+}
+
 void zonefs_i_size_write(struct inode *inode, loff_t isize)
 {
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 
 	i_size_write(inode, isize);
+
 	/*
 	 * A full zone is no longer open/active and does not need
 	 * explicit closing.
 	 */
-	if (isize >= zi->i_max_size) {
+	if (isize >= z->z_capacity) {
 		struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
 
-		if (zi->i_flags & ZONEFS_ZONE_ACTIVE)
+		if (z->z_flags & ZONEFS_ZONE_ACTIVE)
 			atomic_dec(&sbi->s_active_seq_files);
-		zi->i_flags &= ~(ZONEFS_ZONE_OPEN | ZONEFS_ZONE_ACTIVE);
+		z->z_flags &= ~(ZONEFS_ZONE_OPEN | ZONEFS_ZONE_ACTIVE);
 	}
 }
 
@@ -150,20 +183,18 @@ void zonefs_update_stats(struct inode *inode, loff_t new_isize)
 }
 
 /*
- * Check a zone condition and adjust its file inode access permissions for
- * offline and readonly zones. Return the inode size corresponding to the
- * amount of readable data in the zone.
+ * Check a zone condition. Return the amount of written (and still readable)
+ * data in the zone.
  */
-static loff_t zonefs_check_zone_condition(struct inode *inode,
+static loff_t zonefs_check_zone_condition(struct super_block *sb,
+					  struct zonefs_zone *z,
 					  struct blk_zone *zone)
 {
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-
 	switch (zone->cond) {
 	case BLK_ZONE_COND_OFFLINE:
-		zonefs_warn(inode->i_sb, "inode %lu: offline zone\n",
-			    inode->i_ino);
-		zi->i_flags |= ZONEFS_ZONE_OFFLINE;
+		zonefs_warn(sb, "Zone %llu: offline zone\n",
+			    z->z_sector);
+		z->z_flags |= ZONEFS_ZONE_OFFLINE;
 		return 0;
 	case BLK_ZONE_COND_READONLY:
 		/*
@@ -174,18 +205,18 @@ static loff_t zonefs_check_zone_condition(struct inode *inode,
 		 * the inode size as it was when last updated so that the user
 		 * can recover data.
 		 */
-		zonefs_warn(inode->i_sb, "inode %lu: read-only zone\n",
-			    inode->i_ino);
-		zi->i_flags |= ZONEFS_ZONE_READONLY;
-		if (zonefs_zone_is_cnv(zi))
-			return zi->i_max_size;
-		return zi->i_wpoffset;
+		zonefs_warn(sb, "Zone %llu: read-only zone\n",
+			    z->z_sector);
+		z->z_flags |= ZONEFS_ZONE_READONLY;
+		if (zonefs_zone_is_cnv(z))
+			return z->z_capacity;
+		return z->z_wpoffset;
 	case BLK_ZONE_COND_FULL:
 		/* The write pointer of full zones is invalid. */
-		return zi->i_max_size;
+		return z->z_capacity;
 	default:
-		if (zonefs_zone_is_cnv(zi))
-			return zi->i_max_size;
+		if (zonefs_zone_is_cnv(z))
+			return z->z_capacity;
 		return (zone->wp - zone->start) << SECTOR_SHIFT;
 	}
 }
@@ -196,22 +227,22 @@ static loff_t zonefs_check_zone_condition(struct inode *inode,
  */
 static void zonefs_inode_update_mode(struct inode *inode)
 {
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 
-	if (zi->i_flags & ZONEFS_ZONE_OFFLINE) {
+	if (z->z_flags & ZONEFS_ZONE_OFFLINE) {
 		/* Offline zones cannot be read nor written */
 		inode->i_flags |= S_IMMUTABLE;
 		inode->i_mode &= ~0777;
-	} else if (zi->i_flags & ZONEFS_ZONE_READONLY) {
+	} else if (z->z_flags & ZONEFS_ZONE_READONLY) {
 		/* Readonly zones cannot be written */
 		inode->i_flags |= S_IMMUTABLE;
-		if (zi->i_flags & ZONEFS_ZONE_INIT_MODE)
+		if (z->z_flags & ZONEFS_ZONE_INIT_MODE)
 			inode->i_mode &= ~0777;
 		else
 			inode->i_mode &= ~0222;
 	}
 
-	zi->i_flags &= ~ZONEFS_ZONE_INIT_MODE;
+	z->z_flags &= ~ZONEFS_ZONE_INIT_MODE;
 }
 
 struct zonefs_ioerr_data {
@@ -224,7 +255,7 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 {
 	struct zonefs_ioerr_data *err = data;
 	struct inode *inode = err->inode;
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct super_block *sb = inode->i_sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	loff_t isize, data_size;
@@ -235,9 +266,9 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * as there is no inconsistency between the inode size and the amount of
 	 * data writen in the zone (data_size).
 	 */
-	data_size = zonefs_check_zone_condition(inode, zone);
+	data_size = zonefs_check_zone_condition(sb, z, zone);
 	isize = i_size_read(inode);
-	if (!(zi->i_flags & (ZONEFS_ZONE_READONLY | ZONEFS_ZONE_OFFLINE)) &&
+	if (!(z->z_flags & (ZONEFS_ZONE_READONLY | ZONEFS_ZONE_OFFLINE)) &&
 	    !err->write && isize == data_size)
 		return 0;
 
@@ -260,8 +291,9 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * In all cases, warn about inode size inconsistency and handle the
 	 * IO error according to the zone condition and to the mount options.
 	 */
-	if (zonefs_zone_is_seq(zi) && isize != data_size)
-		zonefs_warn(sb, "inode %lu: invalid size %lld (should be %lld)\n",
+	if (zonefs_zone_is_seq(z) && isize != data_size)
+		zonefs_warn(sb,
+			    "inode %lu: invalid size %lld (should be %lld)\n",
 			    inode->i_ino, isize, data_size);
 
 	/*
@@ -270,20 +302,20 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * zone condition to read-only and offline respectively, as if the
 	 * condition was signaled by the hardware.
 	 */
-	if ((zi->i_flags & ZONEFS_ZONE_OFFLINE) ||
+	if ((z->z_flags & ZONEFS_ZONE_OFFLINE) ||
 	    (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_ZOL)) {
 		zonefs_warn(sb, "inode %lu: read/write access disabled\n",
 			    inode->i_ino);
-		if (!(zi->i_flags & ZONEFS_ZONE_OFFLINE))
-			zi->i_flags |= ZONEFS_ZONE_OFFLINE;
+		if (!(z->z_flags & ZONEFS_ZONE_OFFLINE))
+			z->z_flags |= ZONEFS_ZONE_OFFLINE;
 		zonefs_inode_update_mode(inode);
 		data_size = 0;
-	} else if ((zi->i_flags & ZONEFS_ZONE_READONLY) ||
+	} else if ((z->z_flags & ZONEFS_ZONE_READONLY) ||
 		   (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_ZRO)) {
 		zonefs_warn(sb, "inode %lu: write access disabled\n",
 			    inode->i_ino);
-		if (!(zi->i_flags & ZONEFS_ZONE_READONLY))
-			zi->i_flags |= ZONEFS_ZONE_READONLY;
+		if (!(z->z_flags & ZONEFS_ZONE_READONLY))
+			z->z_flags |= ZONEFS_ZONE_READONLY;
 		zonefs_inode_update_mode(inode);
 		data_size = isize;
 	} else if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_RO &&
@@ -299,8 +331,8 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 * close of the zone when the inode file is closed.
 	 */
 	if ((sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) &&
-	    (zi->i_flags & (ZONEFS_ZONE_READONLY | ZONEFS_ZONE_OFFLINE)))
-		zi->i_flags &= ~ZONEFS_ZONE_OPEN;
+	    (z->z_flags & (ZONEFS_ZONE_READONLY | ZONEFS_ZONE_OFFLINE)))
+		z->z_flags &= ~ZONEFS_ZONE_OPEN;
 
 	/*
 	 * If error=remount-ro was specified, any error result in remounting
@@ -317,8 +349,8 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
 	 */
 	zonefs_update_stats(inode, data_size);
 	zonefs_i_size_write(inode, data_size);
-	zi->i_wpoffset = data_size;
-	zonefs_account_active(inode);
+	z->z_wpoffset = data_size;
+	zonefs_inode_account_active(inode);
 
 	return 0;
 }
@@ -332,7 +364,7 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
  */
 void __zonefs_io_error(struct inode *inode, bool write)
 {
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
 	struct super_block *sb = inode->i_sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	unsigned int noio_flag;
@@ -348,8 +380,8 @@ void __zonefs_io_error(struct inode *inode, bool write)
 	 * files with aggregated conventional zones, for which the inode zone
 	 * size is always larger than the device zone size.
 	 */
-	if (zi->i_zone_size > bdev_zone_sectors(sb->s_bdev))
-		nr_zones = zi->i_zone_size >>
+	if (z->z_size > bdev_zone_sectors(sb->s_bdev))
+		nr_zones = z->z_size >>
 			(sbi->s_zone_sectors_shift + SECTOR_SHIFT);
 
 	/*
@@ -361,7 +393,7 @@ void __zonefs_io_error(struct inode *inode, bool write)
 	 * the GFP_NOIO context avoids both problems.
 	 */
 	noio_flag = memalloc_noio_save();
-	ret = blkdev_report_zones(sb->s_bdev, zi->i_zsector, nr_zones,
+	ret = blkdev_report_zones(sb->s_bdev, z->z_sector, nr_zones,
 				  zonefs_io_error_cb, &err);
 	if (ret != nr_zones)
 		zonefs_err(sb, "Get inode %lu zone information failed %d\n",
@@ -381,9 +413,7 @@ static struct inode *zonefs_alloc_inode(struct super_block *sb)
 
 	inode_init_once(&zi->i_vnode);
 	mutex_init(&zi->i_truncate_mutex);
-	zi->i_wpoffset = 0;
 	zi->i_wr_refcnt = 0;
-	zi->i_flags = 0;
 
 	return &zi->i_vnode;
 }
@@ -416,8 +446,8 @@ static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	buf->f_bavail = buf->f_bfree;
 
 	for (t = 0; t < ZONEFS_ZTYPE_MAX; t++) {
-		if (sbi->s_nr_files[t])
-			buf->f_files += sbi->s_nr_files[t] + 1;
+		if (sbi->s_zgroup[t].g_nr_zones)
+			buf->f_files += sbi->s_zgroup[t].g_nr_zones + 1;
 	}
 	buf->f_ffree = 0;
 
@@ -557,11 +587,11 @@ static const struct inode_operations zonefs_dir_inode_operations = {
 };
 
 static void zonefs_init_dir_inode(struct inode *parent, struct inode *inode,
-				  enum zonefs_ztype type)
+				  enum zonefs_ztype ztype)
 {
 	struct super_block *sb = parent->i_sb;
 
-	inode->i_ino = bdev_nr_zones(sb->s_bdev) + type + 1;
+	inode->i_ino = bdev_nr_zones(sb->s_bdev) + ztype + 1;
 	inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
 	inode->i_op = &zonefs_dir_inode_operations;
 	inode->i_fop = &simple_dir_operations;
@@ -573,79 +603,34 @@ static const struct inode_operations zonefs_file_inode_operations = {
 	.setattr	= zonefs_inode_setattr,
 };
 
-static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
-				  enum zonefs_ztype type)
+static void zonefs_init_file_inode(struct inode *inode,
+				   struct zonefs_zone *z)
 {
 	struct super_block *sb = inode->i_sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
-	struct zonefs_inode_info *zi = ZONEFS_I(inode);
-	int ret = 0;
-
-	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
-	inode->i_mode = S_IFREG | sbi->s_perm;
 
-	if (type == ZONEFS_ZTYPE_CNV)
-		zi->i_flags |= ZONEFS_ZONE_CNV;
-
-	zi->i_zsector = zone->start;
-	zi->i_zone_size = zone->len << SECTOR_SHIFT;
-	if (zi->i_zone_size > bdev_zone_sectors(sb->s_bdev) << SECTOR_SHIFT &&
-	    !(sbi->s_features & ZONEFS_F_AGGRCNV)) {
-		zonefs_err(sb,
-			   "zone size %llu doesn't match device's zone sectors %llu\n",
-			   zi->i_zone_size,
-			   bdev_zone_sectors(sb->s_bdev) << SECTOR_SHIFT);
-		return -EINVAL;
-	}
-
-	zi->i_max_size = min_t(loff_t, MAX_LFS_FILESIZE,
-			       zone->capacity << SECTOR_SHIFT);
-	zi->i_wpoffset = zonefs_check_zone_condition(inode, zone);
+	inode->i_private = z;
 
+	inode->i_ino = z->z_sector >> sbi->s_zone_sectors_shift;
+	inode->i_mode = S_IFREG | sbi->s_perm;
 	inode->i_uid = sbi->s_uid;
 	inode->i_gid = sbi->s_gid;
-	inode->i_size = zi->i_wpoffset;
-	inode->i_blocks = zi->i_max_size >> SECTOR_SHIFT;
+	inode->i_size = z->z_wpoffset;
+	inode->i_blocks = z->z_capacity >> SECTOR_SHIFT;
 
 	inode->i_op = &zonefs_file_inode_operations;
 	inode->i_fop = &zonefs_file_operations;
 	inode->i_mapping->a_ops = &zonefs_file_aops;
 
 	/* Update the inode access rights depending on the zone condition */
-	zi->i_flags |= ZONEFS_ZONE_INIT_MODE;
+	z->z_flags |= ZONEFS_ZONE_INIT_MODE;
 	zonefs_inode_update_mode(inode);
-
-	sb->s_maxbytes = max(zi->i_max_size, sb->s_maxbytes);
-	sbi->s_blocks += zi->i_max_size >> sb->s_blocksize_bits;
-	sbi->s_used_blocks += zi->i_wpoffset >> sb->s_blocksize_bits;
-
-	mutex_lock(&zi->i_truncate_mutex);
-
-	/*
-	 * For sequential zones, make sure that any open zone is closed first
-	 * to ensure that the initial number of open zones is 0, in sync with
-	 * the open zone accounting done when the mount option
-	 * ZONEFS_MNTOPT_EXPLICIT_OPEN is used.
-	 */
-	if (type == ZONEFS_ZTYPE_SEQ &&
-	    (zone->cond == BLK_ZONE_COND_IMP_OPEN ||
-	     zone->cond == BLK_ZONE_COND_EXP_OPEN)) {
-		ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_CLOSE);
-		if (ret)
-			goto unlock;
-	}
-
-	zonefs_account_active(inode);
-
-unlock:
-	mutex_unlock(&zi->i_truncate_mutex);
-
-	return ret;
 }
 
 static struct dentry *zonefs_create_inode(struct dentry *parent,
-					const char *name, struct blk_zone *zone,
-					enum zonefs_ztype type)
+					  const char *name,
+					  struct zonefs_zone *z,
+					  enum zonefs_ztype ztype)
 {
 	struct inode *dir = d_inode(parent);
 	struct dentry *dentry;
@@ -661,15 +646,10 @@ static struct dentry *zonefs_create_inode(struct dentry *parent,
 		goto dput;
 
 	inode->i_ctime = inode->i_mtime = inode->i_atime = dir->i_ctime;
-	if (zone) {
-		ret = zonefs_init_file_inode(inode, zone, type);
-		if (ret) {
-			iput(inode);
-			goto dput;
-		}
-	} else {
-		zonefs_init_dir_inode(dir, inode, type);
-	}
+	if (z)
+		zonefs_init_file_inode(inode, z);
+	else
+		zonefs_init_dir_inode(dir, inode, ztype);
 
 	d_add(dentry, inode);
 	dir->i_size++;
@@ -685,100 +665,51 @@ static struct dentry *zonefs_create_inode(struct dentry *parent,
 struct zonefs_zone_data {
 	struct super_block	*sb;
 	unsigned int		nr_zones[ZONEFS_ZTYPE_MAX];
+	sector_t		cnv_zone_start;
 	struct blk_zone		*zones;
 };
 
 /*
- * Create a zone group and populate it with zone files.
+ * Create the inodes for a zone group.
  */
-static int zonefs_create_zgroup(struct zonefs_zone_data *zd,
-				enum zonefs_ztype type)
+static int zonefs_create_zgroup_inodes(struct super_block *sb,
+				       enum zonefs_ztype ztype)
 {
-	struct super_block *sb = zd->sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
-	struct blk_zone *zone, *next, *end;
-	const char *zgroup_name;
-	char *file_name;
+	struct zonefs_zone_group *zgroup = &sbi->s_zgroup[ztype];
 	struct dentry *dir, *dent;
-	unsigned int n = 0;
-	int ret;
+	char *file_name;
+	int i, ret = 0;
+
+	if (!zgroup)
+		return -ENOMEM;
 
 	/* If the group is empty, there is nothing to do */
-	if (!zd->nr_zones[type])
+	if (!zgroup->g_nr_zones)
 		return 0;
 
 	file_name = kmalloc(ZONEFS_NAME_MAX, GFP_KERNEL);
 	if (!file_name)
 		return -ENOMEM;
 
-	if (type == ZONEFS_ZTYPE_CNV)
-		zgroup_name = "cnv";
-	else
-		zgroup_name = "seq";
-
-	dir = zonefs_create_inode(sb->s_root, zgroup_name, NULL, type);
+	dir = zonefs_create_inode(sb->s_root, zonefs_zgroup_name(ztype),
+				  NULL, ztype);
 	if (IS_ERR(dir)) {
 		ret = PTR_ERR(dir);
 		goto free;
 	}
 
-	/*
-	 * The first zone contains the super block: skip it.
-	 */
-	end = zd->zones + bdev_nr_zones(sb->s_bdev);
-	for (zone = &zd->zones[1]; zone < end; zone = next) {
-
-		next = zone + 1;
-		if (zonefs_zone_type(zone) != type)
-			continue;
-
-		/*
-		 * For conventional zones, contiguous zones can be aggregated
-		 * together to form larger files. Note that this overwrites the
-		 * length of the first zone of the set of contiguous zones
-		 * aggregated together. If one offline or read-only zone is
-		 * found, assume that all zones aggregated have the same
-		 * condition.
-		 */
-		if (type == ZONEFS_ZTYPE_CNV &&
-		    (sbi->s_features & ZONEFS_F_AGGRCNV)) {
-			for (; next < end; next++) {
-				if (zonefs_zone_type(next) != type)
-					break;
-				zone->len += next->len;
-				zone->capacity += next->capacity;
-				if (next->cond == BLK_ZONE_COND_READONLY &&
-				    zone->cond != BLK_ZONE_COND_OFFLINE)
-					zone->cond = BLK_ZONE_COND_READONLY;
-				else if (next->cond == BLK_ZONE_COND_OFFLINE)
-					zone->cond = BLK_ZONE_COND_OFFLINE;
-			}
-			if (zone->capacity != zone->len) {
-				zonefs_err(sb, "Invalid conventional zone capacity\n");
-				ret = -EINVAL;
-				goto free;
-			}
-		}
-
-		/*
-		 * Use the file number within its group as file name.
-		 */
-		snprintf(file_name, ZONEFS_NAME_MAX - 1, "%u", n);
-		dent = zonefs_create_inode(dir, file_name, zone, type);
+	for (i = 0; i < zgroup->g_nr_zones; i++) {
+		/* Use the zone number within its group as the file name */
+		snprintf(file_name, ZONEFS_NAME_MAX - 1, "%u", i);
+		dent = zonefs_create_inode(dir, file_name,
+					   &zgroup->g_zones[i], ztype);
 		if (IS_ERR(dent)) {
 			ret = PTR_ERR(dent);
-			goto free;
+			break;
 		}
-
-		n++;
 	}
 
-	zonefs_info(sb, "Zone group \"%s\" has %u file%s\n",
-		    zgroup_name, n, n > 1 ? "s" : "");
-
-	sbi->s_nr_files[type] = n;
-	ret = 0;
-
 free:
 	kfree(file_name);
 
@@ -789,21 +720,38 @@ static int zonefs_get_zone_info_cb(struct blk_zone *zone, unsigned int idx,
 				   void *data)
 {
 	struct zonefs_zone_data *zd = data;
+	struct super_block *sb = zd->sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+
+	/*
+	 * We do not care about the first zone: it contains the super block
+	 * and not exposed as a file.
+	 */
+	if (!idx)
+		return 0;
 
 	/*
-	 * Count the number of usable zones: the first zone at index 0 contains
-	 * the super block and is ignored.
+	 * Count the number of zones that will be exposed as files.
+	 * For sequential zones, we always have as many files as zones.
+	 * FOr conventional zones, the number of files depends on if we have
+	 * conventional zones aggregation enabled.
 	 */
 	switch (zone->type) {
 	case BLK_ZONE_TYPE_CONVENTIONAL:
-		zone->wp = zone->start + zone->len;
-		if (idx)
-			zd->nr_zones[ZONEFS_ZTYPE_CNV]++;
+		if (sbi->s_features & ZONEFS_F_AGGRCNV) {
+			/* One file per set of contiguous conventional zones */
+			if (!(sbi->s_zgroup[ZONEFS_ZTYPE_CNV].g_nr_zones) ||
+			    zone->start != zd->cnv_zone_start)
+				sbi->s_zgroup[ZONEFS_ZTYPE_CNV].g_nr_zones++;
+			zd->cnv_zone_start = zone->start + zone->len;
+		} else {
+			/* One file per zone */
+			sbi->s_zgroup[ZONEFS_ZTYPE_CNV].g_nr_zones++;
+		}
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
-		if (idx)
-			zd->nr_zones[ZONEFS_ZTYPE_SEQ]++;
+		sbi->s_zgroup[ZONEFS_ZTYPE_SEQ].g_nr_zones++;
 		break;
 	default:
 		zonefs_err(zd->sb, "Unsupported zone type 0x%x\n",
@@ -843,11 +791,173 @@ static int zonefs_get_zone_info(struct zonefs_zone_data *zd)
 	return 0;
 }
 
-static inline void zonefs_cleanup_zone_info(struct zonefs_zone_data *zd)
+static inline void zonefs_free_zone_info(struct zonefs_zone_data *zd)
 {
 	kvfree(zd->zones);
 }
 
+/*
+ * Create a zone group and populate it with zone files.
+ */
+static int zonefs_init_zgroup(struct super_block *sb,
+			      struct zonefs_zone_data *zd,
+			      enum zonefs_ztype ztype)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct zonefs_zone_group *zgroup = &sbi->s_zgroup[ztype];
+	struct blk_zone *zone, *next, *end;
+	struct zonefs_zone *z;
+	unsigned int n = 0;
+	int ret;
+
+	/* Allocate the zone group. If it is empty, we have nothing to do. */
+	if (!zgroup->g_nr_zones)
+		return 0;
+
+	zgroup->g_zones = kvcalloc(zgroup->g_nr_zones,
+				   sizeof(struct zonefs_zone), GFP_KERNEL);
+	if (!zgroup->g_zones)
+		return -ENOMEM;
+
+	/*
+	 * Initialize the zone groups using the device zone information.
+	 * We always skip the first zone as it contains the super block
+	 * and is not use to back a file.
+	 */
+	end = zd->zones + bdev_nr_zones(sb->s_bdev);
+	for (zone = &zd->zones[1]; zone < end; zone = next) {
+
+		next = zone + 1;
+		if (zonefs_zone_type(zone) != ztype)
+			continue;
+
+		if (WARN_ON_ONCE(n >= zgroup->g_nr_zones))
+			return -EINVAL;
+
+		/*
+		 * For conventional zones, contiguous zones can be aggregated
+		 * together to form larger files. Note that this overwrites the
+		 * length of the first zone of the set of contiguous zones
+		 * aggregated together. If one offline or read-only zone is
+		 * found, assume that all zones aggregated have the same
+		 * condition.
+		 */
+		if (ztype == ZONEFS_ZTYPE_CNV &&
+		    (sbi->s_features & ZONEFS_F_AGGRCNV)) {
+			for (; next < end; next++) {
+				if (zonefs_zone_type(next) != ztype)
+					break;
+				zone->len += next->len;
+				zone->capacity += next->capacity;
+				if (next->cond == BLK_ZONE_COND_READONLY &&
+				    zone->cond != BLK_ZONE_COND_OFFLINE)
+					zone->cond = BLK_ZONE_COND_READONLY;
+				else if (next->cond == BLK_ZONE_COND_OFFLINE)
+					zone->cond = BLK_ZONE_COND_OFFLINE;
+			}
+		}
+
+		z = &zgroup->g_zones[n];
+		if (ztype == ZONEFS_ZTYPE_CNV)
+			z->z_flags |= ZONEFS_ZONE_CNV;
+		z->z_sector = zone->start;
+		z->z_size = zone->len << SECTOR_SHIFT;
+		if (z->z_size > bdev_zone_sectors(sb->s_bdev) << SECTOR_SHIFT &&
+		    !(sbi->s_features & ZONEFS_F_AGGRCNV)) {
+			zonefs_err(sb,
+				"Invalid zone size %llu (device zone sectors %llu)\n",
+				z->z_size,
+				bdev_zone_sectors(sb->s_bdev) << SECTOR_SHIFT);
+			return -EINVAL;
+		}
+
+		z->z_capacity = min_t(loff_t, MAX_LFS_FILESIZE,
+				      zone->capacity << SECTOR_SHIFT);
+		z->z_wpoffset = zonefs_check_zone_condition(sb, z, zone);
+
+		sb->s_maxbytes = max(z->z_capacity, sb->s_maxbytes);
+		sbi->s_blocks += z->z_capacity >> sb->s_blocksize_bits;
+		sbi->s_used_blocks += z->z_wpoffset >> sb->s_blocksize_bits;
+
+		/*
+		 * For sequential zones, make sure that any open zone is closed
+		 * first to ensure that the initial number of open zones is 0,
+		 * in sync with the open zone accounting done when the mount
+		 * option ZONEFS_MNTOPT_EXPLICIT_OPEN is used.
+		 */
+		if (ztype == ZONEFS_ZTYPE_SEQ &&
+		    (zone->cond == BLK_ZONE_COND_IMP_OPEN ||
+		     zone->cond == BLK_ZONE_COND_EXP_OPEN)) {
+			ret = zonefs_zone_mgmt(sb, z, REQ_OP_ZONE_CLOSE);
+			if (ret)
+				return ret;
+		}
+
+		zonefs_account_active(sb, z);
+
+		n++;
+	}
+
+	if (WARN_ON_ONCE(n != zgroup->g_nr_zones))
+		return -EINVAL;
+
+	zonefs_info(sb, "Zone group \"%s\" has %u file%s\n",
+		    zonefs_zgroup_name(ztype),
+		    zgroup->g_nr_zones,
+		    zgroup->g_nr_zones > 1 ? "s" : "");
+
+	return 0;
+}
+
+static void zonefs_free_zgroups(struct super_block *sb)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	enum zonefs_ztype ztype;
+
+	if (!sbi)
+		return;
+
+	for (ztype = 0; ztype < ZONEFS_ZTYPE_MAX; ztype++) {
+		kvfree(sbi->s_zgroup[ztype].g_zones);
+		sbi->s_zgroup[ztype].g_zones = NULL;
+	}
+}
+
+/*
+ * Create a zone group and populate it with zone files.
+ */
+static int zonefs_init_zgroups(struct super_block *sb)
+{
+	struct zonefs_zone_data zd;
+	enum zonefs_ztype ztype;
+	int ret;
+
+	/* First get the device zone information */
+	memset(&zd, 0, sizeof(struct zonefs_zone_data));
+	zd.sb = sb;
+	ret = zonefs_get_zone_info(&zd);
+	if (ret)
+		goto cleanup;
+
+	/* Allocate and initialize the zone groups */
+	for (ztype = 0; ztype < ZONEFS_ZTYPE_MAX; ztype++) {
+		ret = zonefs_init_zgroup(sb, &zd, ztype);
+		if (ret) {
+			zonefs_info(sb,
+				    "Zone group \"%s\" initialization failed\n",
+				    zonefs_zgroup_name(ztype));
+			break;
+		}
+	}
+
+cleanup:
+	zonefs_free_zone_info(&zd);
+	if (ret)
+		zonefs_free_zgroups(sb);
+
+	return ret;
+}
+
 /*
  * Read super block information from the device.
  */
@@ -945,7 +1055,6 @@ static const struct super_operations zonefs_sops = {
  */
 static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct zonefs_zone_data zd;
 	struct zonefs_sb_info *sbi;
 	struct inode *inode;
 	enum zonefs_ztype t;
@@ -998,16 +1107,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	if (ret)
 		return ret;
 
-	memset(&zd, 0, sizeof(struct zonefs_zone_data));
-	zd.sb = sb;
-	ret = zonefs_get_zone_info(&zd);
-	if (ret)
-		goto cleanup;
-
-	ret = zonefs_sysfs_register(sb);
-	if (ret)
-		goto cleanup;
-
 	zonefs_info(sb, "Mounting %u zones", bdev_nr_zones(sb->s_bdev));
 
 	if (!sbi->s_max_wro_seq_files &&
@@ -1018,6 +1117,11 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 		sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
 	}
 
+	/* Initialize the zone groups */
+	ret = zonefs_init_zgroups(sb);
+	if (ret)
+		goto cleanup;
+
 	/* Create root directory inode */
 	ret = -ENOMEM;
 	inode = new_inode(sb);
@@ -1037,13 +1141,19 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 
 	/* Create and populate files in zone groups directories */
 	for (t = 0; t < ZONEFS_ZTYPE_MAX; t++) {
-		ret = zonefs_create_zgroup(&zd, t);
+		ret = zonefs_create_zgroup_inodes(sb, t);
 		if (ret)
-			break;
+			goto cleanup;
 	}
 
+	ret = zonefs_sysfs_register(sb);
+	if (ret)
+		goto cleanup;
+
+	return 0;
+
 cleanup:
-	zonefs_cleanup_zone_info(&zd);
+	zonefs_free_zgroups(sb);
 
 	return ret;
 }
@@ -1062,6 +1172,7 @@ static void zonefs_kill_super(struct super_block *sb)
 		d_genocide(sb->s_root);
 
 	zonefs_sysfs_unregister(sb);
+	zonefs_free_zgroups(sb);
 	kill_block_super(sb);
 	kfree(sbi);
 }
diff --git a/fs/zonefs/trace.h b/fs/zonefs/trace.h
index 42edcfd393ed..9969db3a9c7d 100644
--- a/fs/zonefs/trace.h
+++ b/fs/zonefs/trace.h
@@ -20,8 +20,9 @@
 #define show_dev(dev) MAJOR(dev), MINOR(dev)
 
 TRACE_EVENT(zonefs_zone_mgmt,
-	    TP_PROTO(struct inode *inode, enum req_op op),
-	    TP_ARGS(inode, op),
+	    TP_PROTO(struct super_block *sb, struct zonefs_zone *z,
+		     enum req_op op),
+	    TP_ARGS(sb, z, op),
 	    TP_STRUCT__entry(
 			     __field(dev_t, dev)
 			     __field(ino_t, ino)
@@ -30,12 +31,12 @@ TRACE_EVENT(zonefs_zone_mgmt,
 			     __field(sector_t, nr_sectors)
 	    ),
 	    TP_fast_assign(
-			   __entry->dev = inode->i_sb->s_dev;
-			   __entry->ino = inode->i_ino;
+			   __entry->dev = sb->s_dev;
+			   __entry->ino =
+				z->z_sector >> ZONEFS_SB(sb)->s_zone_sectors_shift;
 			   __entry->op = op;
-			   __entry->sector = ZONEFS_I(inode)->i_zsector;
-			   __entry->nr_sectors =
-				   ZONEFS_I(inode)->i_zone_size >> SECTOR_SHIFT;
+			   __entry->sector = z->z_sector;
+			   __entry->nr_sectors = z->z_size >> SECTOR_SHIFT;
 	    ),
 	    TP_printk("bdev=(%d,%d), ino=%lu op=%s, sector=%llu, nr_sectors=%llu",
 		      show_dev(__entry->dev), (unsigned long)__entry->ino,
@@ -58,9 +59,10 @@ TRACE_EVENT(zonefs_file_dio_append,
 	    TP_fast_assign(
 			   __entry->dev = inode->i_sb->s_dev;
 			   __entry->ino = inode->i_ino;
-			   __entry->sector = ZONEFS_I(inode)->i_zsector;
+			   __entry->sector = zonefs_inode_zone(inode)->z_sector;
 			   __entry->size = size;
-			   __entry->wpoffset = ZONEFS_I(inode)->i_wpoffset;
+			   __entry->wpoffset =
+				zonefs_inode_zone(inode)->z_wpoffset;
 			   __entry->ret = ret;
 	    ),
 	    TP_printk("bdev=(%d, %d), ino=%lu, sector=%llu, size=%zu, wpoffset=%llu, ret=%zu",
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 1a225f74015a..2d626e18b141 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -47,22 +47,39 @@ static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone)
 #define ZONEFS_ZONE_CNV		(1U << 31)
 
 /*
- * In-memory inode data.
+ * In-memory per-file inode zone data.
  */
-struct zonefs_inode_info {
-	struct inode		i_vnode;
+struct zonefs_zone {
+	/* Zone state flags */
+	unsigned int		z_flags;
 
-	/* File zone start sector (512B unit) */
-	sector_t		i_zsector;
+	/* Zone start sector (512B unit) */
+	sector_t		z_sector;
 
-	/* File zone write pointer position (sequential zones only) */
-	loff_t			i_wpoffset;
+	/* Zone size (bytes) */
+	loff_t			z_size;
 
-	/* File maximum size */
-	loff_t			i_max_size;
+	/* Zone capacity (file maximum size, bytes) */
+	loff_t			z_capacity;
 
-	/* File zone size */
-	loff_t			i_zone_size;
+	/* Write pointer offset in the zone (sequential zones only, bytes) */
+	loff_t			z_wpoffset;
+};
+
+/*
+ * In memory zone group information: all zones of a group are exposed
+ * as files, one file per zone.
+ */
+struct zonefs_zone_group {
+	unsigned int		g_nr_zones;
+	struct zonefs_zone	*g_zones;
+};
+
+/*
+ * In-memory inode data.
+ */
+struct zonefs_inode_info {
+	struct inode		i_vnode;
 
 	/*
 	 * To serialise fully against both syscall and mmap based IO and
@@ -81,7 +98,6 @@ struct zonefs_inode_info {
 
 	/* guarded by i_truncate_mutex */
 	unsigned int		i_wr_refcnt;
-	unsigned int		i_flags;
 };
 
 static inline struct zonefs_inode_info *ZONEFS_I(struct inode *inode)
@@ -89,24 +105,29 @@ static inline struct zonefs_inode_info *ZONEFS_I(struct inode *inode)
 	return container_of(inode, struct zonefs_inode_info, i_vnode);
 }
 
-static inline bool zonefs_zone_is_cnv(struct zonefs_inode_info *zi)
+static inline bool zonefs_zone_is_cnv(struct zonefs_zone *z)
+{
+	return z->z_flags & ZONEFS_ZONE_CNV;
+}
+
+static inline bool zonefs_zone_is_seq(struct zonefs_zone *z)
 {
-	return zi->i_flags & ZONEFS_ZONE_CNV;
+	return !zonefs_zone_is_cnv(z);
 }
 
-static inline bool zonefs_zone_is_seq(struct zonefs_inode_info *zi)
+static inline struct zonefs_zone *zonefs_inode_zone(struct inode *inode)
 {
-	return !zonefs_zone_is_cnv(zi);
+	return inode->i_private;
 }
 
 static inline bool zonefs_inode_is_cnv(struct inode *inode)
 {
-	return zonefs_zone_is_cnv(ZONEFS_I(inode));
+	return zonefs_zone_is_cnv(zonefs_inode_zone(inode));
 }
 
 static inline bool zonefs_inode_is_seq(struct inode *inode)
 {
-	return zonefs_zone_is_seq(ZONEFS_I(inode));
+	return zonefs_zone_is_seq(zonefs_inode_zone(inode));
 }
 
 /*
@@ -200,7 +221,7 @@ struct zonefs_sb_info {
 	uuid_t			s_uuid;
 	unsigned int		s_zone_sectors_shift;
 
-	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
+	struct zonefs_zone_group s_zgroup[ZONEFS_ZTYPE_MAX];
 
 	loff_t			s_blocks;
 	loff_t			s_used_blocks;
@@ -229,8 +250,8 @@ static inline struct zonefs_sb_info *ZONEFS_SB(struct super_block *sb)
 	pr_warn("zonefs (%s) WARNING: " format, sb->s_id, ## args)
 
 /* In super.c */
-void zonefs_account_active(struct inode *inode);
-int zonefs_zone_mgmt(struct inode *inode, enum req_op op);
+void zonefs_inode_account_active(struct inode *inode);
+int zonefs_inode_zone_mgmt(struct inode *inode, enum req_op op);
 void zonefs_i_size_write(struct inode *inode, loff_t isize);
 void zonefs_update_stats(struct inode *inode, loff_t new_isize);
 void __zonefs_io_error(struct inode *inode, bool write);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/7] zonefs: Dynamically create file inodes when needed
  2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
                   ` (4 preceding siblings ...)
  2023-01-10 13:08 ` [PATCH 5/7] zonefs: Separate zone information from inode information Damien Le Moal
@ 2023-01-10 13:08 ` Damien Le Moal
  2023-01-16 11:46   ` Johannes Thumshirn
  2023-01-10 13:08 ` [PATCH 7/7] zonefs: Cache zone group directory inodes Damien Le Moal
  6 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

Allocating and initializing all inodes and dentries for all files
results in a very large memory usage with high capacity zoned block
devices. For instance, with a 26 TB SMR HDD with over 96000 zones,
mounting the disk with zonefs results in about 130 MB of memory used,
the vast majority of this space being used for vfs inodes and dentries.

However, since a user will rarely access all zones at the same time,
dynamically creating file inodes and dentries on demand, similarly to
regular file systems, can significantly reduce memory usage.

This patch modifies mount processing to not create the inodes and
dentries for zone files. Instead, the directory inode operation
zonefs_lookup() and directory file operation zonefs_readdir() are
introduced to allocate and initialize inodes on-demand using the helper
functions zonefs_get_dir_inode() and zonefs_get_zgroup_inode().

Implementation of these functions is simple, relying on the static
nature of zonefs directories and files. Directoy inodes are linked to
the volume zone groups (struct zonefs_zone_group) they represent by
using the directory inode i_private field. This simplifies the
implementation of the lookup and readdir operations.

Unreferenced zone file inodes can be evicted from the inode cache at any
time. In such case, the only inode information that cannot be recreated
from the zone information that is saved in the zone group data
structures attached to the volume super block is the inode uid, gid and
access rights. These values may have been changed by the user. To keep
these attributes for the life time of the mount, as before, the inode
mode, uid and gid are saved in the inode zone information and the saved
values are used to initialize regular file inodes when an inode lookup
happens. The zone information mode, uid and gid are initialized in
zonefs_init_zgroup() using the default values.

With these changes, the static minimal memory usage of a zonefs volume
is mostly reduced to the array of zone information for each zone group.
For the 26 TB SMR hard-disk mentioned above, the memory usage after
mount becomes about 5.4 MB, a reduction by a factor of 24 from the
initial 130 MB memory use.

Co-developed-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/super.c  | 347 ++++++++++++++++++++++++++++++++-------------
 fs/zonefs/zonefs.h |   9 ++
 2 files changed, 257 insertions(+), 99 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 270ded209dde..7d70c327883e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -243,6 +243,7 @@ static void zonefs_inode_update_mode(struct inode *inode)
 	}
 
 	z->z_flags &= ~ZONEFS_ZONE_INIT_MODE;
+	z->z_mode = inode->i_mode;
 }
 
 struct zonefs_ioerr_data {
@@ -578,144 +579,283 @@ static int zonefs_inode_setattr(struct user_namespace *mnt_userns,
 
 	setattr_copy(&init_user_ns, inode, iattr);
 
+	if (S_ISREG(inode->i_mode)) {
+		struct zonefs_zone *z = zonefs_inode_zone(inode);
+
+		z->z_mode = inode->i_mode;
+		z->z_uid = inode->i_uid;
+		z->z_gid = inode->i_gid;
+	}
+
 	return 0;
 }
 
-static const struct inode_operations zonefs_dir_inode_operations = {
-	.lookup		= simple_lookup,
+static const struct inode_operations zonefs_file_inode_operations = {
 	.setattr	= zonefs_inode_setattr,
 };
 
-static void zonefs_init_dir_inode(struct inode *parent, struct inode *inode,
-				  enum zonefs_ztype ztype)
+static long zonefs_fname_to_fno(const struct qstr *fname)
 {
-	struct super_block *sb = parent->i_sb;
+	const char *name = fname->name;
+	unsigned int len = fname->len;
+	long fno = 0, shift = 1;
+	const char *rname;
+	char c = *name;
+	unsigned int i;
 
-	inode->i_ino = bdev_nr_zones(sb->s_bdev) + ztype + 1;
-	inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
-	inode->i_op = &zonefs_dir_inode_operations;
-	inode->i_fop = &simple_dir_operations;
-	set_nlink(inode, 2);
-	inc_nlink(parent);
-}
+	/*
+	 * File names are always a base-10 number string without any
+	 * leading 0s.
+	 */
+	if (!isdigit(c))
+		return -ENOENT;
 
-static const struct inode_operations zonefs_file_inode_operations = {
-	.setattr	= zonefs_inode_setattr,
-};
+	if (len > 1 && c == '0')
+		return -ENOENT;
 
-static void zonefs_init_file_inode(struct inode *inode,
-				   struct zonefs_zone *z)
+	if (len == 1)
+		return c - '0';
+
+	for (i = 0, rname = name + len - 1; i < len; i++, rname--) {
+		c = *rname;
+		if (!isdigit(c))
+			return -ENOENT;
+		fno += (c - '0') * shift;
+		shift *= 10;
+	}
+
+	return fno;
+}
+
+static struct inode *zonefs_get_file_inode(struct inode *dir,
+					   struct dentry *dentry)
 {
-	struct super_block *sb = inode->i_sb;
+	struct zonefs_zone_group *zgroup = dir->i_private;
+	struct super_block *sb = dir->i_sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct zonefs_zone *z;
+	struct inode *inode;
+	ino_t ino;
+	long fno;
 
-	inode->i_private = z;
+	/* Get the file number from the file name */
+	fno = zonefs_fname_to_fno(&dentry->d_name);
+	if (fno < 0)
+		return ERR_PTR(fno);
+
+	if (!zgroup->g_nr_zones || fno >= zgroup->g_nr_zones)
+		return ERR_PTR(-ENOENT);
 
-	inode->i_ino = z->z_sector >> sbi->s_zone_sectors_shift;
-	inode->i_mode = S_IFREG | sbi->s_perm;
-	inode->i_uid = sbi->s_uid;
-	inode->i_gid = sbi->s_gid;
+	z = &zgroup->g_zones[fno];
+	ino = z->z_sector >> sbi->s_zone_sectors_shift;
+	inode = iget_locked(sb, ino);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+	if (!(inode->i_state & I_NEW)) {
+		WARN_ON_ONCE(inode->i_private != z);
+		return inode;
+	}
+
+	inode->i_ino = ino;
+	inode->i_mode = z->z_mode;
+	inode->i_ctime = inode->i_mtime = inode->i_atime = dir->i_ctime;
+	inode->i_uid = z->z_uid;
+	inode->i_gid = z->z_gid;
 	inode->i_size = z->z_wpoffset;
 	inode->i_blocks = z->z_capacity >> SECTOR_SHIFT;
+	inode->i_private = z;
 
 	inode->i_op = &zonefs_file_inode_operations;
 	inode->i_fop = &zonefs_file_operations;
 	inode->i_mapping->a_ops = &zonefs_file_aops;
 
 	/* Update the inode access rights depending on the zone condition */
-	z->z_flags |= ZONEFS_ZONE_INIT_MODE;
 	zonefs_inode_update_mode(inode);
+
+	unlock_new_inode(inode);
+
+	return inode;
 }
 
-static struct dentry *zonefs_create_inode(struct dentry *parent,
-					  const char *name,
-					  struct zonefs_zone *z,
-					  enum zonefs_ztype ztype)
+static struct inode *zonefs_get_zgroup_inode(struct super_block *sb,
+					     enum zonefs_ztype ztype)
 {
-	struct inode *dir = d_inode(parent);
-	struct dentry *dentry;
+	struct inode *root = d_inode(sb->s_root);
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	struct inode *inode;
-	int ret = -ENOMEM;
+	ino_t ino = bdev_nr_zones(sb->s_bdev) + ztype + 1;
 
-	dentry = d_alloc_name(parent, name);
-	if (!dentry)
-		return ERR_PTR(ret);
-
-	inode = new_inode(parent->d_sb);
+	inode = iget_locked(sb, ino);
 	if (!inode)
-		goto dput;
+		return ERR_PTR(-ENOMEM);
+	if (!(inode->i_state & I_NEW))
+		return inode;
+
+	inode->i_ino = ino;
+	inode_init_owner(&init_user_ns, inode, root, S_IFDIR | 0555);
+	inode->i_size = sbi->s_zgroup[ztype].g_nr_zones;
+	inode->i_ctime = inode->i_mtime = inode->i_atime = root->i_ctime;
+	inode->i_private = &sbi->s_zgroup[ztype];
+	set_nlink(inode, 2);
 
-	inode->i_ctime = inode->i_mtime = inode->i_atime = dir->i_ctime;
-	if (z)
-		zonefs_init_file_inode(inode, z);
-	else
-		zonefs_init_dir_inode(dir, inode, ztype);
+	inode->i_op = &zonefs_dir_inode_operations;
+	inode->i_fop = &zonefs_dir_operations;
+
+	unlock_new_inode(inode);
 
-	d_add(dentry, inode);
-	dir->i_size++;
+	return inode;
+}
 
-	return dentry;
 
-dput:
-	dput(dentry);
+static struct inode *zonefs_get_dir_inode(struct inode *dir,
+					  struct dentry *dentry)
+{
+	struct super_block *sb = dir->i_sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	const char *name = dentry->d_name.name;
+	enum zonefs_ztype ztype;
+
+	/*
+	 * We only need to check for the "seq" directory and
+	 * the "cnv" directory if we have conventional zones.
+	 */
+	if (dentry->d_name.len != 3)
+		return ERR_PTR(-ENOENT);
+
+	for (ztype = 0; ztype < ZONEFS_ZTYPE_MAX; ztype++) {
+		if (sbi->s_zgroup[ztype].g_nr_zones &&
+		    memcmp(name, zonefs_zgroup_name(ztype), 3) == 0)
+			break;
+	}
+	if (ztype == ZONEFS_ZTYPE_MAX)
+		return ERR_PTR(-ENOENT);
 
-	return ERR_PTR(ret);
+	return zonefs_get_zgroup_inode(sb, ztype);
 }
 
-struct zonefs_zone_data {
-	struct super_block	*sb;
-	unsigned int		nr_zones[ZONEFS_ZTYPE_MAX];
-	sector_t		cnv_zone_start;
-	struct blk_zone		*zones;
-};
+static struct dentry *zonefs_lookup(struct inode *dir, struct dentry *dentry,
+				    unsigned int flags)
+{
+	struct inode *inode;
 
-/*
- * Create the inodes for a zone group.
- */
-static int zonefs_create_zgroup_inodes(struct super_block *sb,
-				       enum zonefs_ztype ztype)
+	if (dentry->d_name.len > ZONEFS_NAME_MAX)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	if (dir == d_inode(dir->i_sb->s_root))
+		inode = zonefs_get_dir_inode(dir, dentry);
+	else
+		inode = zonefs_get_file_inode(dir, dentry);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	return d_splice_alias(inode, dentry);
+}
+
+static int zonefs_readdir_root(struct file *file, struct dir_context *ctx)
 {
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
-	struct zonefs_zone_group *zgroup = &sbi->s_zgroup[ztype];
-	struct dentry *dir, *dent;
-	char *file_name;
-	int i, ret = 0;
+	enum zonefs_ztype ztype = ZONEFS_ZTYPE_CNV;
+	ino_t base_ino = bdev_nr_zones(sb->s_bdev) + 1;
 
-	if (!zgroup)
-		return -ENOMEM;
+	if (ctx->pos >= inode->i_size)
+		return 0;
 
-	/* If the group is empty, there is nothing to do */
-	if (!zgroup->g_nr_zones)
+	if (!dir_emit_dots(file, ctx))
 		return 0;
 
-	file_name = kmalloc(ZONEFS_NAME_MAX, GFP_KERNEL);
-	if (!file_name)
-		return -ENOMEM;
+	if (ctx->pos == 2) {
+		if (!sbi->s_zgroup[ZONEFS_ZTYPE_CNV].g_nr_zones)
+			ztype = ZONEFS_ZTYPE_SEQ;
 
-	dir = zonefs_create_inode(sb->s_root, zonefs_zgroup_name(ztype),
-				  NULL, ztype);
-	if (IS_ERR(dir)) {
-		ret = PTR_ERR(dir);
-		goto free;
+		if (!dir_emit(ctx, zonefs_zgroup_name(ztype), 3,
+			      base_ino + ztype, DT_DIR))
+			return 0;
+		ctx->pos++;
 	}
 
-	for (i = 0; i < zgroup->g_nr_zones; i++) {
-		/* Use the zone number within its group as the file name */
-		snprintf(file_name, ZONEFS_NAME_MAX - 1, "%u", i);
-		dent = zonefs_create_inode(dir, file_name,
-					   &zgroup->g_zones[i], ztype);
-		if (IS_ERR(dent)) {
-			ret = PTR_ERR(dent);
+	if (ctx->pos == 3 && ztype != ZONEFS_ZTYPE_SEQ) {
+		ztype = ZONEFS_ZTYPE_SEQ;
+		if (!dir_emit(ctx, zonefs_zgroup_name(ztype), 3,
+			      base_ino + ztype, DT_DIR))
+			return 0;
+		ctx->pos++;
+	}
+
+	return 0;
+}
+
+static int zonefs_readdir_zgroup(struct file *file,
+				 struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct zonefs_zone_group *zgroup = inode->i_private;
+	struct super_block *sb = inode->i_sb;
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct zonefs_zone *z;
+	int fname_len;
+	char *fname;
+	ino_t ino;
+	int f;
+
+	/*
+	 * The size of zone group directories is equal to the number
+	 * of zone files in the group and does note include the "." and
+	 * ".." entries. Hence the "+ 2" here.
+	 */
+	if (ctx->pos >= inode->i_size + 2)
+		return 0;
+
+	if (!dir_emit_dots(file, ctx))
+		return 0;
+
+	fname = kmalloc(ZONEFS_NAME_MAX, GFP_KERNEL);
+	if (!fname)
+		return -ENOMEM;
+
+	for (f = ctx->pos - 2; f < zgroup->g_nr_zones; f++) {
+		z = &zgroup->g_zones[f];
+		ino = z->z_sector >> sbi->s_zone_sectors_shift;
+		fname_len = snprintf(fname, ZONEFS_NAME_MAX - 1, "%u", f);
+		if (!dir_emit(ctx, fname, fname_len, ino, DT_REG))
 			break;
-		}
+		ctx->pos++;
 	}
 
-free:
-	kfree(file_name);
+	kfree(fname);
 
-	return ret;
+	return 0;
 }
 
+static int zonefs_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+
+	if (inode == d_inode(inode->i_sb->s_root))
+		return zonefs_readdir_root(file, ctx);
+
+	return zonefs_readdir_zgroup(file, ctx);
+}
+
+const struct inode_operations zonefs_dir_inode_operations = {
+	.lookup		= zonefs_lookup,
+	.setattr	= zonefs_inode_setattr,
+};
+
+const struct file_operations zonefs_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= zonefs_readdir,
+};
+
+struct zonefs_zone_data {
+	struct super_block	*sb;
+	unsigned int		nr_zones[ZONEFS_ZTYPE_MAX];
+	sector_t		cnv_zone_start;
+	struct blk_zone		*zones;
+};
+
 static int zonefs_get_zone_info_cb(struct blk_zone *zone, unsigned int idx,
 				   void *data)
 {
@@ -875,6 +1015,17 @@ static int zonefs_init_zgroup(struct super_block *sb,
 				      zone->capacity << SECTOR_SHIFT);
 		z->z_wpoffset = zonefs_check_zone_condition(sb, z, zone);
 
+		z->z_mode = S_IFREG | sbi->s_perm;
+		z->z_uid = sbi->s_uid;
+		z->z_gid = sbi->s_gid;
+
+		/*
+		 * Let zonefs_inode_update_mode() know that we will need
+		 * special initialization of the inode mode the first time
+		 * it is accessed.
+		 */
+		z->z_flags |= ZONEFS_ZONE_INIT_MODE;
+
 		sb->s_maxbytes = max(z->z_capacity, sb->s_maxbytes);
 		sbi->s_blocks += z->z_capacity >> sb->s_blocksize_bits;
 		sbi->s_used_blocks += z->z_wpoffset >> sb->s_blocksize_bits;
@@ -1057,7 +1208,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 {
 	struct zonefs_sb_info *sbi;
 	struct inode *inode;
-	enum zonefs_ztype t;
+	enum zonefs_ztype ztype;
 	int ret;
 
 	if (!bdev_is_zoned(sb->s_bdev)) {
@@ -1122,7 +1273,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	if (ret)
 		goto cleanup;
 
-	/* Create root directory inode */
+	/* Create the root directory inode */
 	ret = -ENOMEM;
 	inode = new_inode(sb);
 	if (!inode)
@@ -1132,20 +1283,20 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	inode->i_mode = S_IFDIR | 0555;
 	inode->i_ctime = inode->i_mtime = inode->i_atime = current_time(inode);
 	inode->i_op = &zonefs_dir_inode_operations;
-	inode->i_fop = &simple_dir_operations;
+	inode->i_fop = &zonefs_dir_operations;
+	inode->i_size = 2;
 	set_nlink(inode, 2);
+	for (ztype = 0; ztype < ZONEFS_ZTYPE_MAX; ztype++) {
+		if (sbi->s_zgroup[ztype].g_nr_zones) {
+			inc_nlink(inode);
+			inode->i_size++;
+		}
+	}
 
 	sb->s_root = d_make_root(inode);
 	if (!sb->s_root)
 		goto cleanup;
 
-	/* Create and populate files in zone groups directories */
-	for (t = 0; t < ZONEFS_ZTYPE_MAX; t++) {
-		ret = zonefs_create_zgroup_inodes(sb, t);
-		if (ret)
-			goto cleanup;
-	}
-
 	ret = zonefs_sysfs_register(sb);
 	if (ret)
 		goto cleanup;
@@ -1168,12 +1319,10 @@ static void zonefs_kill_super(struct super_block *sb)
 {
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 
-	if (sb->s_root)
-		d_genocide(sb->s_root);
+	kill_block_super(sb);
 
 	zonefs_sysfs_unregister(sb);
 	zonefs_free_zgroups(sb);
-	kill_block_super(sb);
 	kfree(sbi);
 }
 
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 2d626e18b141..f88466a4158b 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -64,6 +64,11 @@ struct zonefs_zone {
 
 	/* Write pointer offset in the zone (sequential zones only, bytes) */
 	loff_t			z_wpoffset;
+
+	/* Saved inode uid, gid and access rights */
+	umode_t			z_mode;
+	kuid_t			z_uid;
+	kgid_t			z_gid;
 };
 
 /*
@@ -265,6 +270,10 @@ static inline void zonefs_io_error(struct inode *inode, bool write)
 	mutex_unlock(&zi->i_truncate_mutex);
 }
 
+/* In super.c */
+extern const struct inode_operations zonefs_dir_inode_operations;
+extern const struct file_operations zonefs_dir_operations;
+
 /* In file.c */
 extern const struct address_space_operations zonefs_file_aops;
 extern const struct file_operations zonefs_file_operations;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 7/7] zonefs: Cache zone group directory inodes
  2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
                   ` (5 preceding siblings ...)
  2023-01-10 13:08 ` [PATCH 6/7] zonefs: Dynamically create file inodes when needed Damien Le Moal
@ 2023-01-10 13:08 ` Damien Le Moal
  2023-01-16 11:49   ` Johannes Thumshirn
  6 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-10 13:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Johannes Thumshirn, Jorgen Hansen

Since looking up any zone file inode requires looking up first the inode
for the directory representing the zone group of the file, ensuring that
the zone group inodes are always cached is desired. To do so, take an
extra reference on the zone groups directory inodes on mount, thus
avoiding the eviction of these inodes from the inode cache until the
volume is unmounted.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/super.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zonefs/zonefs.h |  1 +
 2 files changed, 49 insertions(+)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 7d70c327883e..010b53545e5b 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -1199,6 +1199,42 @@ static const struct super_operations zonefs_sops = {
 	.show_options	= zonefs_show_options,
 };
 
+static int zonefs_get_zgroup_inodes(struct super_block *sb)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	struct inode *dir_inode;
+	enum zonefs_ztype ztype;
+
+	for (ztype = 0; ztype < ZONEFS_ZTYPE_MAX; ztype++) {
+		if (!sbi->s_zgroup[ztype].g_nr_zones)
+			continue;
+
+		dir_inode = zonefs_get_zgroup_inode(sb, ztype);
+		if (IS_ERR(dir_inode))
+			return PTR_ERR(dir_inode);
+
+		sbi->s_zgroup[ztype].g_inode = dir_inode;
+	}
+
+	return 0;
+}
+
+static void zonefs_release_zgroup_inodes(struct super_block *sb)
+{
+	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
+	enum zonefs_ztype ztype;
+
+	if (!sbi)
+		return;
+
+	for (ztype = 0; ztype < ZONEFS_ZTYPE_MAX; ztype++) {
+		if (sbi->s_zgroup[ztype].g_inode) {
+			iput(sbi->s_zgroup[ztype].g_inode);
+			sbi->s_zgroup[ztype].g_inode = NULL;
+		}
+	}
+}
+
 /*
  * Check that the device is zoned. If it is, get the list of zones and create
  * sub-directories and files according to the device zone configuration and
@@ -1297,6 +1333,14 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	if (!sb->s_root)
 		goto cleanup;
 
+	/*
+	 * Take a reference on the zone groups directory inodes
+	 * to keep them in the inode cache.
+	 */
+	ret = zonefs_get_zgroup_inodes(sb);
+	if (ret)
+		goto cleanup;
+
 	ret = zonefs_sysfs_register(sb);
 	if (ret)
 		goto cleanup;
@@ -1304,6 +1348,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	return 0;
 
 cleanup:
+	zonefs_release_zgroup_inodes(sb);
 	zonefs_free_zgroups(sb);
 
 	return ret;
@@ -1319,6 +1364,9 @@ static void zonefs_kill_super(struct super_block *sb)
 {
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 
+	/* Release the reference on the zone group directory inodes */
+	zonefs_release_zgroup_inodes(sb);
+
 	kill_block_super(sb);
 
 	zonefs_sysfs_unregister(sb);
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index f88466a4158b..8175652241b5 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -76,6 +76,7 @@ struct zonefs_zone {
  * as files, one file per zone.
  */
 struct zonefs_zone_group {
+	struct inode		*g_inode;
 	unsigned int		g_nr_zones;
 	struct zonefs_zone	*g_zones;
 };
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] zonefs: Detect append writes at invalid locations
  2023-01-10 13:08 ` [PATCH 1/7] zonefs: Detect append writes at invalid locations Damien Le Moal
@ 2023-01-11 12:23   ` Johannes Thumshirn
  2023-01-11 22:08     ` Damien Le Moal
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-11 12:23 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

On 10.01.23 14:08, Damien Le Moal wrote:
> Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential
> files succeeds regardless of the zone write pointer position, as long as
> the target zone is not full. This means that if an external (buggy)
> application writes to the zone of a sequential file underneath the file
> system, subsequent file write() operation will succeed but the file size
> will not be correct and the file will contain invalid data written by
> another application.
> 
> Modify zonefs_file_dio_append() to check the written sector of an append
> write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a
> mismatch with the file zone wp offset field. This change triggers a call
> to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to
> not expose the unexpected data after the current inode size when the
> errors=remount-ro mode is used. Other error modes are correctly handled
> already.

This only happens on ZNS and null_blk, doesn't it? On SCSI the Zone Append
emulation should catch this error before.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/7] zonefs: Reorganize code
  2023-01-10 13:08 ` [PATCH 2/7] zonefs: Reorganize code Damien Le Moal
@ 2023-01-11 16:12   ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-11 16:12 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] zonefs: Detect append writes at invalid locations
  2023-01-11 12:23   ` Johannes Thumshirn
@ 2023-01-11 22:08     ` Damien Le Moal
  2023-01-12  5:18       ` Damien Le Moal
  0 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-11 22:08 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-fsdevel; +Cc: Jørgen Hansen

On 1/11/23 21:23, Johannes Thumshirn wrote:
> On 10.01.23 14:08, Damien Le Moal wrote:
>> Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential
>> files succeeds regardless of the zone write pointer position, as long as
>> the target zone is not full. This means that if an external (buggy)
>> application writes to the zone of a sequential file underneath the file
>> system, subsequent file write() operation will succeed but the file size
>> will not be correct and the file will contain invalid data written by
>> another application.
>>
>> Modify zonefs_file_dio_append() to check the written sector of an append
>> write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a
>> mismatch with the file zone wp offset field. This change triggers a call
>> to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to
>> not expose the unexpected data after the current inode size when the
>> errors=remount-ro mode is used. Other error modes are correctly handled
>> already.
> 
> This only happens on ZNS and null_blk, doesn't it? On SCSI the Zone Append
> emulation should catch this error before.

Yes. The zone append will fail with scsi sd emulation so this change is
not useful in that case. But null_blk and ZNS drives will not fail the
zone append, resulting in a bad file size without this check.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] zonefs: Detect append writes at invalid locations
  2023-01-11 22:08     ` Damien Le Moal
@ 2023-01-12  5:18       ` Damien Le Moal
  2023-01-12  7:33         ` Johannes Thumshirn
  0 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2023-01-12  5:18 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-fsdevel; +Cc: Jørgen Hansen

On 1/12/23 07:08, Damien Le Moal wrote:
> On 1/11/23 21:23, Johannes Thumshirn wrote:
>> On 10.01.23 14:08, Damien Le Moal wrote:
>>> Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential
>>> files succeeds regardless of the zone write pointer position, as long as
>>> the target zone is not full. This means that if an external (buggy)
>>> application writes to the zone of a sequential file underneath the file
>>> system, subsequent file write() operation will succeed but the file size
>>> will not be correct and the file will contain invalid data written by
>>> another application.
>>>
>>> Modify zonefs_file_dio_append() to check the written sector of an append
>>> write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a
>>> mismatch with the file zone wp offset field. This change triggers a call
>>> to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to
>>> not expose the unexpected data after the current inode size when the
>>> errors=remount-ro mode is used. Other error modes are correctly handled
>>> already.
>>
>> This only happens on ZNS and null_blk, doesn't it? On SCSI the Zone Append
>> emulation should catch this error before.
> 
> Yes. The zone append will fail with scsi sd emulation so this change is
> not useful in that case. But null_blk and ZNS drives will not fail the
> zone append, resulting in a bad file size without this check.

Let me retract that... For scsi sd, the zone append emulation will do its
job if an application writes to a zone under the file system. Then zonefs
issuing a zone append will also succeed and we end up with the bad inode size.

We would get a zone append failure in zonefs with scsi sd if the
corruption to the zone was done with a passthrough command as these will
not result in sd_zbc zone write pointer tracking doing its job.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] zonefs: Detect append writes at invalid locations
  2023-01-12  5:18       ` Damien Le Moal
@ 2023-01-12  7:33         ` Johannes Thumshirn
  2023-01-12  8:26           ` Damien Le Moal
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-12  7:33 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

On 12.01.23 06:18, Damien Le Moal wrote:
> On 1/12/23 07:08, Damien Le Moal wrote:
>> On 1/11/23 21:23, Johannes Thumshirn wrote:
>>> On 10.01.23 14:08, Damien Le Moal wrote:
>>>> Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential
>>>> files succeeds regardless of the zone write pointer position, as long as
>>>> the target zone is not full. This means that if an external (buggy)
>>>> application writes to the zone of a sequential file underneath the file
>>>> system, subsequent file write() operation will succeed but the file size
>>>> will not be correct and the file will contain invalid data written by
>>>> another application.
>>>>
>>>> Modify zonefs_file_dio_append() to check the written sector of an append
>>>> write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a
>>>> mismatch with the file zone wp offset field. This change triggers a call
>>>> to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to
>>>> not expose the unexpected data after the current inode size when the
>>>> errors=remount-ro mode is used. Other error modes are correctly handled
>>>> already.
>>>
>>> This only happens on ZNS and null_blk, doesn't it? On SCSI the Zone Append
>>> emulation should catch this error before.
>>
>> Yes. The zone append will fail with scsi sd emulation so this change is
>> not useful in that case. But null_blk and ZNS drives will not fail the
>> zone append, resulting in a bad file size without this check.
> 
> Let me retract that... For scsi sd, the zone append emulation will do its
> job if an application writes to a zone under the file system. Then zonefs
> issuing a zone append will also succeed and we end up with the bad inode size.
> 
> We would get a zone append failure in zonefs with scsi sd if the
> corruption to the zone was done with a passthrough command as these will
> not result in sd_zbc zone write pointer tracking doing its job.
> 

But then the error recovery procedures in sd_zbc should come into place.

Anyways for ZNS and null_blk this is an improvement:
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] zonefs: Detect append writes at invalid locations
  2023-01-12  7:33         ` Johannes Thumshirn
@ 2023-01-12  8:26           ` Damien Le Moal
  0 siblings, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2023-01-12  8:26 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-fsdevel; +Cc: Jørgen Hansen

On 1/12/23 16:33, Johannes Thumshirn wrote:
> On 12.01.23 06:18, Damien Le Moal wrote:
>> On 1/12/23 07:08, Damien Le Moal wrote:
>>> On 1/11/23 21:23, Johannes Thumshirn wrote:
>>>> On 10.01.23 14:08, Damien Le Moal wrote:
>>>>> Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential
>>>>> files succeeds regardless of the zone write pointer position, as long as
>>>>> the target zone is not full. This means that if an external (buggy)
>>>>> application writes to the zone of a sequential file underneath the file
>>>>> system, subsequent file write() operation will succeed but the file size
>>>>> will not be correct and the file will contain invalid data written by
>>>>> another application.
>>>>>
>>>>> Modify zonefs_file_dio_append() to check the written sector of an append
>>>>> write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a
>>>>> mismatch with the file zone wp offset field. This change triggers a call
>>>>> to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to
>>>>> not expose the unexpected data after the current inode size when the
>>>>> errors=remount-ro mode is used. Other error modes are correctly handled
>>>>> already.
>>>>
>>>> This only happens on ZNS and null_blk, doesn't it? On SCSI the Zone Append
>>>> emulation should catch this error before.
>>>
>>> Yes. The zone append will fail with scsi sd emulation so this change is
>>> not useful in that case. But null_blk and ZNS drives will not fail the
>>> zone append, resulting in a bad file size without this check.
>>
>> Let me retract that... For scsi sd, the zone append emulation will do its
>> job if an application writes to a zone under the file system. Then zonefs
>> issuing a zone append will also succeed and we end up with the bad inode size.
>>
>> We would get a zone append failure in zonefs with scsi sd if the
>> corruption to the zone was done with a passthrough command as these will
>> not result in sd_zbc zone write pointer tracking doing its job.
>>
> 
> But then the error recovery procedures in sd_zbc should come into place.

Yes, for the passthrough case, because it ends up generating a failed
write, which zonefs catches and handles. All good in that case. It remains
that the patch is also necessary for scsi if the corruption happens with
something like dd, which will be seem by sd_zbc. So subsequent zone append
from zonefs will succeed, even though zonefs is not aware that the wp
changed...

> 
> Anyways for ZNS and null_blk this is an improvement:
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/7] zonefs: Simplify IO error handling
  2023-01-10 13:08 ` [PATCH 3/7] zonefs: Simplify IO error handling Damien Le Moal
@ 2023-01-13 10:09   ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-13 10:09 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/7] zonefs: Reduce struct zonefs_inode_info size
  2023-01-10 13:08 ` [PATCH 4/7] zonefs: Reduce struct zonefs_inode_info size Damien Le Moal
@ 2023-01-13 10:11   ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-13 10:11 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/7] zonefs: Separate zone information from inode information
  2023-01-10 13:08 ` [PATCH 5/7] zonefs: Separate zone information from inode information Damien Le Moal
@ 2023-01-13 11:30   ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-13 11:30 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/7] zonefs: Dynamically create file inodes when needed
  2023-01-10 13:08 ` [PATCH 6/7] zonefs: Dynamically create file inodes when needed Damien Le Moal
@ 2023-01-16 11:46   ` Johannes Thumshirn
  2023-01-16 12:17     ` Damien Le Moal
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-16 11:46 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

On 10.01.23 14:08, Damien Le Moal wrote:
> Implementation of these functions is simple, relying on the static
> nature of zonefs directories and files. Directoy inodes are linked to
                                Directory ~^

Otherwise looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] zonefs: Cache zone group directory inodes
  2023-01-10 13:08 ` [PATCH 7/7] zonefs: Cache zone group directory inodes Damien Le Moal
@ 2023-01-16 11:49   ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2023-01-16 11:49 UTC (permalink / raw)
  To: Damien Le Moal, linux-fsdevel; +Cc: Jørgen Hansen

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/7] zonefs: Dynamically create file inodes when needed
  2023-01-16 11:46   ` Johannes Thumshirn
@ 2023-01-16 12:17     ` Damien Le Moal
  0 siblings, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2023-01-16 12:17 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-fsdevel; +Cc: Jørgen Hansen

On 1/16/23 20:46, Johannes Thumshirn wrote:
> On 10.01.23 14:08, Damien Le Moal wrote:
>> Implementation of these functions is simple, relying on the static
>> nature of zonefs directories and files. Directoy inodes are linked to
>                                 Directory ~^

Fixed. Thanks.

> 
> Otherwise looks good,
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-01-16 12:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-10 13:08 [PATCH 0/7] Reduce zonefs memory usage Damien Le Moal
2023-01-10 13:08 ` [PATCH 1/7] zonefs: Detect append writes at invalid locations Damien Le Moal
2023-01-11 12:23   ` Johannes Thumshirn
2023-01-11 22:08     ` Damien Le Moal
2023-01-12  5:18       ` Damien Le Moal
2023-01-12  7:33         ` Johannes Thumshirn
2023-01-12  8:26           ` Damien Le Moal
2023-01-10 13:08 ` [PATCH 2/7] zonefs: Reorganize code Damien Le Moal
2023-01-11 16:12   ` Johannes Thumshirn
2023-01-10 13:08 ` [PATCH 3/7] zonefs: Simplify IO error handling Damien Le Moal
2023-01-13 10:09   ` Johannes Thumshirn
2023-01-10 13:08 ` [PATCH 4/7] zonefs: Reduce struct zonefs_inode_info size Damien Le Moal
2023-01-13 10:11   ` Johannes Thumshirn
2023-01-10 13:08 ` [PATCH 5/7] zonefs: Separate zone information from inode information Damien Le Moal
2023-01-13 11:30   ` Johannes Thumshirn
2023-01-10 13:08 ` [PATCH 6/7] zonefs: Dynamically create file inodes when needed Damien Le Moal
2023-01-16 11:46   ` Johannes Thumshirn
2023-01-16 12:17     ` Damien Le Moal
2023-01-10 13:08 ` [PATCH 7/7] zonefs: Cache zone group directory inodes Damien Le Moal
2023-01-16 11:49   ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.