From: Hannes Reinecke <firstname.lastname@example.org>
To: Damien Le Moal <email@example.com>,
Christoph Hellwig <firstname.lastname@example.org>,
Johannes Thumshirn <email@example.com>,
Dave Chinner <firstname.lastname@example.org>,
"Darrick J . Wong" <email@example.com>
Cc: Matias Bjorling <firstname.lastname@example.org>
Subject: Re: [PATCH V3] fs: New zonefs file system
Date: Fri, 23 Aug 2019 12:12:09 +0200 [thread overview]
Message-ID: <email@example.com> (raw)
On 8/21/19 9:03 AM, Damien Le Moal wrote:
> zonefs is a very simple file system exposing each zone of a zoned
> block device as a file. zonefs is in fact closer to a raw block device
> access interface than to a full feature POSIX file system.
> The goal of zonefs is to simplify implementation of zoned block device
> raw access by applications by allowing switching to the well known POSIX
> file API rather than relying on direct block device file ioctls and
> read/write. Zonefs, for instance, greatly simplifies the implementation
> of LSM (log-structured merge) tree structures (such as used in RocksDB
> and LevelDB) on zoned block devices by allowing SSTables to be stored in
> a zone file similarly to a regular file system architecture, hence
> reducing the amount of change needed in the application.
> Zonefs on-disk metadata is reduced to a super block to store a magic
> number, a uuid and optional features flags and values. On mount, zonefs
> uses blkdev_report_zones() to obtain the device zone configuration and
> populates the mount point with a static file tree solely based on this
> information. E.g. file sizes come from zone write pointer offset managed
> by the device itself.
> The zone files created on mount have the following characteristics.
> 1) Files representing zones of the same type are grouped together
> under a common directory:
> * For conventional zones, the directory "cnv" is used.
> * For sequential write zones, the directory "seq" is used.
> These two directories are the only directories that exist in zonefs.
> Users cannot create other directories and cannot rename nor delete
> the "cnv" and "seq" directories.
> 2) The name of zone files is by default the number of the file within
> the zone type directory, in order of increasing zone start sector.
> 3) The size of conventional zone files is fixed to the device zone size.
> Conventional zone files cannot be truncated.
> 4) The size of sequential zone files represent the file zone write
> pointer position relative to the zone start sector. Truncating these
> files is allowed only down to 0, in wich case, the zone is reset to
> rewind the file zone write pointer position to the start of the zone.
> 5) All read and write operations to files are not allowed beyond the
> file zone size. Any access exceeding the zone size is failed with
> the -EFBIG error.
> 6) Creating, deleting, renaming or modifying any attribute of files
> and directories is not allowed. The only exception being the file
> size of sequential zone files which can be modified by write
> operations or truncation to 0.
> Several optional features of zonefs can be enabled at format time.
> * Conventional zone aggregation: contiguous conventional zones can be
> agregated into a single larger file instead of multiple per-zone
> * File naming: the default file number file name can be switched to
> using the base-10 value of the file zone start sector.
> * File ownership: The owner UID and GID of zone files is by default 0
> (root) but can be changed to any valid UID/GID.
> * File access permissions: the default 640 access permissions can be
> The mkzonefs tool is used to format zonefs. This tool is available
> on Github at: firstname.lastname@example.org:damien-lemoal/zonefs-tools.git.
> zonefs-tools includes a simple test suite which can be run against any
> zoned block device, including null_blk block device created with zoned
> Example: the following formats a host-managed SMR HDD with the
> conventional zone aggregation feature enabled.
> mkzonefs -o aggr_cnv /dev/sdX
> mount -t zonefs /dev/sdX /mnt
> ls -l /mnt/
> total 0
> dr-xr-xr-x 2 root root 0 Apr 11 13:00 cnv
> dr-xr-xr-x 2 root root 0 Apr 11 13:00 seq
> ls -l /mnt/cnv
> total 137363456
> -rw-rw---- 1 root root 140660178944 Apr 11 13:00 0
> ls -Fal -v /mnt/seq
> total 14511243264
> dr-xr-xr-x 2 root root 15942528 Jul 10 11:53 ./
> drwxr-xr-x 4 root root 1152 Jul 10 11:53 ../
> -rw-r----- 1 root root 0 Jul 10 11:53 0
> -rw-r----- 1 root root 33554432 Jul 10 13:43 1
> -rw-r----- 1 root root 0 Jul 10 11:53 2
> -rw-r----- 1 root root 0 Jul 10 11:53 3
> The aggregated conventional zone file can be used as a regular file.
> Operations such as the following work.
> mkfs.ext4 /mnt/cnv/0
> mount -o loop /mnt/cnv/0 /data
> Contains contributions from Johannes Thumshirn <email@example.com>
> and Christoph Hellwig <firstname.lastname@example.org>.
> Signed-off-by: Damien Le Moal <email@example.com>
> Changes from v2:
> * Addressed comments from Darrick: Typo, added checksum to super block,
> enhance cheks of the super block fields validity (used reserved bytes
> and unknown features bits)
> * Rebased on XFS tree iomap-for-next branch
> Changes from v1:
> * Rebased on latest iomap branch iomap-5.4-merge of XFS tree at
> * Addressed all comments from Dave Chinner and others
> MAINTAINERS | 10 +
> fs/Kconfig | 2 +
> fs/Makefile | 1 +
> fs/zonefs/Kconfig | 9 +
> fs/zonefs/Makefile | 4 +
> fs/zonefs/super.c | 1083 ++++++++++++++++++++++++++++++++++++
> fs/zonefs/zonefs.h | 177 ++++++
> include/uapi/linux/magic.h | 1 +
> 8 files changed, 1287 insertions(+)
> create mode 100644 fs/zonefs/Kconfig
> create mode 100644 fs/zonefs/Makefile
> create mode 100644 fs/zonefs/super.c
> create mode 100644 fs/zonefs/zonefs.h
[ .. ]
> @@ -261,6 +262,7 @@ source "fs/romfs/Kconfig"
> source "fs/pstore/Kconfig"
> source "fs/sysv/Kconfig"
> source "fs/ufs/Kconfig"
> +source "fs/ufs/Kconfig"
> endif # MISC_FILESYSTEMS
> diff --git a/fs/Makefile b/fs/Makefile
> index d60089fd689b..7d3c90e1ad79 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -130,3 +130,4 @@ obj-$(CONFIG_F2FS_FS) += f2fs/
> obj-$(CONFIG_CEPH_FS) += ceph/
> obj-$(CONFIG_PSTORE) += pstore/
> obj-$(CONFIG_EFIVAR_FS) += efivarfs/
> +obj-$(CONFIG_ZONEFS_FS) += zonefs/
> diff --git a/fs/zonefs/Kconfig b/fs/zonefs/Kconfig
> new file mode 100644
> index 000000000000..6490547e9763
> --- /dev/null
> +++ b/fs/zonefs/Kconfig
> @@ -0,0 +1,9 @@
> +config ZONEFS_FS
> + tristate "zonefs filesystem support"
> + depends on BLOCK
> + depends on BLK_DEV_ZONED
> + help
> + zonefs is a simple File System which exposes zones of a zoned block
> + device as files.
> + If unsure, say N.
> diff --git a/fs/zonefs/Makefile b/fs/zonefs/Makefile
> new file mode 100644
> index 000000000000..75a380aa1ae1
> --- /dev/null
> +++ b/fs/zonefs/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_ZONEFS_FS) += zonefs.o
> +zonefs-y := super.o
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> new file mode 100644
> index 000000000000..5521c21fd34b
> --- /dev/null
> +++ b/fs/zonefs/super.c
[ .. ]
That whole thing looks good to me (with my limited fs skills :-),
however, some things I'd like to have clarified:
- zone state handling:
While you do have some handling for offline zones, I'm missing a
handling during normal I/O. Surely a zone can go offline via other means
(like the admin calling nasty user-space programs), which then would
result in an I/O error in the filesystem.
Shouldn't we handle this case when doing error handling?
IE shouldn't we look at the zone state when doing a REPORT ZONES, and
update it if required?
Similarly: How do we present zones which are not accessible? Will they
still show up in the directory? I think they should, but we should be
returning an error to userspace like EPERM or somesuch.
- zone sizes:
From what I've seen sequential zones can be appended to, ie they'll
start off at 0 and will increase in size. Conventional zones, OTOH,
apparently always have a fixed size. Is that correct?
Dr. Hannes Reinecke Teamlead Storage & Networking
firstname.lastname@example.org +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer
next prev parent reply other threads:[~2019-08-23 10:12 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-21 7:03 [PATCH V3] fs: New zonefs file system Damien Le Moal
2019-08-21 14:58 ` Darrick J. Wong
2019-08-22 1:47 ` Damien Le Moal
2019-08-23 10:12 ` Hannes Reinecke [this message]
2019-08-26 4:32 ` Damien Le Moal
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).