linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Damien Le Moal <damien.lemoal@wdc.com>,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	Christoph Hellwig <hch@lst.de>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dave Chinner <david@fromorbit.com>,
	"Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Matias Bjorling <matias.bjorling@wdc.com>
Subject: Re: [PATCH V3] fs: New zonefs file system
Date: Fri, 23 Aug 2019 12:12:09 +0200	[thread overview]
Message-ID: <5b52c0d5-fc9c-f671-a31f-7b828c767788@suse.de> (raw)
In-Reply-To: <20190821070308.28665-1-damien.lemoal@wdc.com>

On 8/21/19 9:03 AM, Damien Le Moal wrote:
> zonefs is a very simple file system exposing each zone of a zoned
> block device as a file. zonefs is in fact closer to a raw block device
> access interface than to a full feature POSIX file system.
> 
> The goal of zonefs is to simplify implementation of zoned block device
> raw access by applications by allowing switching to the well known POSIX
> file API rather than relying on direct block device file ioctls and
> read/write. Zonefs, for instance, greatly simplifies the implementation
> of LSM (log-structured merge) tree structures (such as used in RocksDB
> and LevelDB) on zoned block devices by allowing SSTables to be stored in
> a zone file similarly to a regular file system architecture, hence
> reducing the amount of change needed in the application.
> 
> Zonefs on-disk metadata is reduced to a super block to store a magic
> number, a uuid and optional features flags and values. On mount, zonefs
> uses blkdev_report_zones() to obtain the device zone configuration and
> populates the mount point with a static file tree solely based on this
> information. E.g. file sizes come from zone write pointer offset managed
> by the device itself.
> 
> The zone files created on mount have the following characteristics.
> 1) Files representing zones of the same type are grouped together
>    under a common directory:
>   * For conventional zones, the directory "cnv" is used.
>   * For sequential write zones, the directory "seq" is used.
>   These two directories are the only directories that exist in zonefs.
>   Users cannot create other directories and cannot rename nor delete
>   the "cnv" and "seq" directories.
> 2) The name of zone files is by default the number of the file within
>    the zone type directory, in order of increasing zone start sector.
> 3) The size of conventional zone files is fixed to the device zone size.
>    Conventional zone files cannot be truncated.
> 4) The size of sequential zone files represent the file zone write
>    pointer position relative to the zone start sector. Truncating these
>    files is allowed only down to 0, in wich case, the zone is reset to
>    rewind the file zone write pointer position to the start of the zone.
> 5) All read and write operations to files are not allowed beyond the
>    file zone size. Any access exceeding the zone size is failed with
>    the -EFBIG error.
> 6) Creating, deleting, renaming or modifying any attribute of files
>    and directories is not allowed. The only exception being the file
>    size of sequential zone files which can be modified by write
>    operations or truncation to 0.
> 
> Several optional features of zonefs can be enabled at format time.
> * Conventional zone aggregation: contiguous conventional zones can be
>   agregated into a single larger file instead of multiple per-zone
>   files.
> * File naming: the default file number file name can be switched to
>   using the base-10 value of the file zone start sector.
> * File ownership: The owner UID and GID of zone files is by default 0
>   (root) but can be changed to any valid UID/GID.
> * File access permissions: the default 640 access permissions can be
>   changed.
> 
> The mkzonefs tool is used to format zonefs. This tool is available
> on Github at: git@github.com:damien-lemoal/zonefs-tools.git.
> zonefs-tools includes a simple test suite which can be run against any
> zoned block device, including null_blk block device created with zoned
> mode.
> 
> Example: the following formats a host-managed SMR HDD with the
> conventional zone aggregation feature enabled.
> 
> mkzonefs -o aggr_cnv /dev/sdX
> mount -t zonefs /dev/sdX /mnt
> ls -l /mnt/
> total 0
> dr-xr-xr-x 2 root root 0 Apr 11 13:00 cnv
> dr-xr-xr-x 2 root root 0 Apr 11 13:00 seq
> 
> ls -l /mnt/cnv
> total 137363456
> -rw-rw---- 1 root root 140660178944 Apr 11 13:00 0
> 
> ls -Fal -v /mnt/seq
> total 14511243264
> dr-xr-xr-x 2 root root 15942528 Jul 10 11:53 ./
> drwxr-xr-x 4 root root     1152 Jul 10 11:53 ../
> -rw-r----- 1 root root        0 Jul 10 11:53 0
> -rw-r----- 1 root root 33554432 Jul 10 13:43 1
> -rw-r----- 1 root root        0 Jul 10 11:53 2
> -rw-r----- 1 root root        0 Jul 10 11:53 3
> ...
> 
> The aggregated conventional zone file can be used as a regular file.
> Operations such as the following work.
> 
> mkfs.ext4 /mnt/cnv/0
> mount -o loop /mnt/cnv/0 /data
> 
> Contains contributions from Johannes Thumshirn <jthumshirn@suse.de>
> and Christoph Hellwig <hch@lst.de>.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
> Changes from v2:
> * Addressed comments from Darrick: Typo, added checksum to super block,
>   enhance cheks of the super block fields validity (used reserved bytes
>   and unknown features bits)
> * Rebased on XFS tree iomap-for-next branch
> 
> Changes from v1:
> * Rebased on latest iomap branch iomap-5.4-merge of XFS tree at
>   git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git
> * Addressed all comments from Dave Chinner and others
> 
>  MAINTAINERS                |   10 +
>  fs/Kconfig                 |    2 +
>  fs/Makefile                |    1 +
>  fs/zonefs/Kconfig          |    9 +
>  fs/zonefs/Makefile         |    4 +
>  fs/zonefs/super.c          | 1083 ++++++++++++++++++++++++++++++++++++
>  fs/zonefs/zonefs.h         |  177 ++++++
>  include/uapi/linux/magic.h |    1 +
>  8 files changed, 1287 insertions(+)
>  create mode 100644 fs/zonefs/Kconfig
>  create mode 100644 fs/zonefs/Makefile
>  create mode 100644 fs/zonefs/super.c
>  create mode 100644 fs/zonefs/zonefs.h
> 
[ .. ]
> @@ -261,6 +262,7 @@ source "fs/romfs/Kconfig"
>  source "fs/pstore/Kconfig"
>  source "fs/sysv/Kconfig"
>  source "fs/ufs/Kconfig"
> +source "fs/ufs/Kconfig"
>  
>  endif # MISC_FILESYSTEMS
>  
Hmm?
Duplicate line?

> diff --git a/fs/Makefile b/fs/Makefile
> index d60089fd689b..7d3c90e1ad79 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -130,3 +130,4 @@ obj-$(CONFIG_F2FS_FS)		+= f2fs/
>  obj-$(CONFIG_CEPH_FS)		+= ceph/
>  obj-$(CONFIG_PSTORE)		+= pstore/
>  obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
> +obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
> diff --git a/fs/zonefs/Kconfig b/fs/zonefs/Kconfig
> new file mode 100644
> index 000000000000..6490547e9763
> --- /dev/null
> +++ b/fs/zonefs/Kconfig
> @@ -0,0 +1,9 @@
> +config ZONEFS_FS
> +	tristate "zonefs filesystem support"
> +	depends on BLOCK
> +	depends on BLK_DEV_ZONED
> +	help
> +	  zonefs is a simple File System which exposes zones of a zoned block
> +	  device as files.
> +
> +	  If unsure, say N.
> diff --git a/fs/zonefs/Makefile b/fs/zonefs/Makefile
> new file mode 100644
> index 000000000000..75a380aa1ae1
> --- /dev/null
> +++ b/fs/zonefs/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_ZONEFS_FS) += zonefs.o
> +
> +zonefs-y	:= super.o
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> new file mode 100644
> index 000000000000..5521c21fd34b
> --- /dev/null
> +++ b/fs/zonefs/super.c
[ .. ]

That whole thing looks good to me (with my limited fs skills :-),
however, some things I'd like to have clarified:

- zone state handling:
While you do have some handling for offline zones, I'm missing a
handling during normal I/O. Surely a zone can go offline via other means
(like the admin calling nasty user-space programs), which then would
result in an I/O error in the filesystem.
Shouldn't we handle this case when doing error handling?
IE shouldn't we look at the zone state when doing a REPORT ZONES, and
update it if required?
Similarly: How do we present zones which are not accessible? Will they
still show up in the directory? I think they should, but we should be
returning an error to userspace like EPERM or somesuch.

- zone sizes:
From what I've seen sequential zones can be appended to, ie they'll
start off at 0 and will increase in size. Conventional zones, OTOH,
apparently always have a fixed size. Is that correct?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

  parent reply	other threads:[~2019-08-23 10:12 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21  7:03 [PATCH V3] fs: New zonefs file system Damien Le Moal
2019-08-21 14:58 ` Darrick J. Wong
2019-08-22  1:47   ` Damien Le Moal
2019-08-23 10:12 ` Hannes Reinecke [this message]
2019-08-26  4:32   ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b52c0d5-fc9c-f671-a31f-7b828c767788@suse.de \
    --to=hare@suse.de \
    --cc=damien.lemoal@wdc.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jthumshirn@suse.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=matias.bjorling@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).