Re: [PATCH RFC] fs: New zonefs file system

From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Hannes Reinecke <hare@suse.de>
Subject: Re: [PATCH RFC] fs: New zonefs file system
Date: Tue, 16 Jul 2019 11:21:45 +0000	[thread overview]
Message-ID: <BYAPR04MB5816B0550DE134326E782447E7CE0@BYAPR04MB5816.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20190715011935.GM7689@dread.disaster.area

Dave,

On 2019/07/15 10:20, Dave Chinner wrote:
> Just a few quick things as I read through this to see how it uses
> iomap....
> 
> On Fri, Jul 12, 2019 at 12:00:17PM +0900, Damien Le Moal wrote:
>> +static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>> +			      unsigned int flags, struct iomap *iomap)
>> +{

[...]

>> +	/* An IO cannot exceed the zone size */
>> +	if (offset >= max_isize)
>> +		return -EFBIG;
> 
> So a write() call that is for a length longer than max_isize is
> going to end up being a short write? i.e. iomap_apply() will loop
> mapping the inode until either we reach the end of the user write
> or we hit max_isize?
> 
> How is userspace supposed to tell the difference between a short
> write and a write that overruns max_isize?

offset >= max_isize is already checked when zonefs_file_write_iter() is called
and the write size truncated in that function too so that a write never goes
beyond the zone size. This check here is probably redundant and not needed. I
left it temporarily to make sure I was using iomap correctly and to check bugs.

>> +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
>> +			     struct inode *inode, loff_t offset)
>> +{
>> +	if (offset >= wpc->iomap.offset &&
>> +	    offset < wpc->iomap.offset + wpc->iomap.length)
>> +		return 0;
>> +
>> +	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
>> +	return zonefs_iomap_begin(inode, offset, INT_MAX, 0, &wpc->iomap);
> 
> Why is the write length set to INT_MAX here? What happens when we
> get a zone that is larger than 2GB? i.e. the length parameter is a
> loff_t, not an int....

Yes indeed. Since no access can go beyond the file size,
zonefs_file_max_size(inode) is the right value here I think, even though
LLONG_MAX would work too.

>> +static int zonefs_truncate_seqfile(struct inode *inode)
>> +{
>> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>> +	int ret;
>> +
>> +	/* Serialize against page faults */
>> +	down_write(&zi->i_mmap_sem);
>> +
>> +	ret = blkdev_reset_zones(inode->i_sb->s_bdev,
>> +				 zonefs_file_addr(inode) >> SECTOR_SHIFT,
>> +				 zonefs_file_max_size(inode) >> SECTOR_SHIFT,
>> +				 GFP_KERNEL);
> 
> Not sure GFP_KERNEL is safe here. This is called holding a
> filesystem lock here, so it's not immediately clear to me if this
> can deadlock through memory reclaim or not...

Conventional zones can be written with buffered I/Os and mmap, so recursing into
the FS on the memory allocation while holding i_mmap_sem is a possibility I
think. I changed this to GFP_NOFS.

>> +static int zonefs_inode_setattr(struct dentry *dentry, struct iattr *iattr)
>> +{
>> +	struct inode *inode = d_inode(dentry);
>> +	int ret;
>> +
>> +	ret = setattr_prepare(dentry, iattr);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if ((iattr->ia_valid & ATTR_UID &&
>> +	     !uid_eq(iattr->ia_uid, inode->i_uid)) ||
>> +	    (iattr->ia_valid & ATTR_GID &&
>> +	     !gid_eq(iattr->ia_gid, inode->i_gid))) {
>> +		ret = dquot_transfer(inode, iattr);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	if (iattr->ia_valid & ATTR_SIZE) {
>> +		/* The size of conventional zone files cannot be changed */
>> +		if (zonefs_file_is_conv(inode))
>> +			return -EPERM;
>> +
>> +		/*
>> +		 * For sequential zone files, we can only allow truncating to
>> +		 * 0 size which is equivalent to a zone reset.
>> +		 */
>> +		if (iattr->ia_size != 0)
>> +			return -EPERM;
>> +
>> +		ret = zonefs_truncate_seqfile(inode);
>> +		if (ret)
>> +			return ret;
> 
> Ok, so we are calling zonefs_truncate_seqfile() holding the i_rwsem
> as well. That does tend to imply GFP_NOFS should probably be used
> for the blkdev_reset_zones() call.

Yes. Fixed. Thanks.

>> +/*
>> + * Open a file.
>> + */
>> +static int zonefs_file_open(struct inode *inode, struct file *file)
>> +{
>> +	/*
>> +	 * Note: here we can do an explicit open of the file zone,
>> +	 * on the first open of the inode. The explicit close can be
>> +	 * done on the last release (close) call for the inode.
>> +	 */
>> +
>> +	return generic_file_open(inode, file);
>> +}
> 
> Why is a wrapper needed for this?

It is not for now. As the comment says, we may want to have a wrapper like this
in the future to do a device level zone open/close in sync with file
open/release. This is not useful for SMR HDDs but probably will for NVMe zoned
namespace devices.

>> +static int zonefs_file_fsync(struct file *file, loff_t start, loff_t end,
>> +			     int datasync)
>> +{
>> +	struct inode *inode = file_inode(file);
>> +	int ret;
>> +
>> +	/*
>> +	 * Since only direct writes are allowed in sequential files, we only
>> +	 * need a device flush for these files.
>> +	 */
>> +	if (zonefs_file_is_seq(inode))
>> +		goto flush;
>> +
>> +	ret = file_write_and_wait_range(file, start, end);
>> +	if (ret == 0)
>> +		ret = file_check_and_advance_wb_err(file);
>> +	if (ret)
>> +		return ret;
> 
>> +
>> +flush:
>> +	return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
> 
> The goto can be avoided in this case simply:
> 
> 	if (zonefs_file_is_conv(inode)) {
> 		/* do flush */
> 	}
> 	return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);

Good point. Fixed.

>> +static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
>> +	loff_t max_pos = zonefs_file_max_size(inode);
>> +	size_t count;
>> +	ssize_t ret = 0;
>> +
>> +	/*
>> +	 * Check that the read operation does not go beyond the maximum
>> +	 * file size.
>> +	 */
>> +	if (iocb->ki_pos >= zonefs_file_max_size(inode))
>> +		return -EFBIG;
>> +
>> +	/*
>> +	 * For sequential zones, limit reads to written data.
>> +	 */
>> +	if (zonefs_file_is_seq(inode))
>> +		max_pos = i_size_read(inode);
>> +	if (iocb->ki_pos >= max_pos)
>> +		return 0;
> 
> Isn't this true for both types of zone at this point? i.e. at this
> point:
> 
> 	max_pos = i_size_read(inode);
> 	if (iocb->ki_pos >= max_pos)
> 		return 0;
> 
> because i_size is either the zonefs_file_max_size() for conventional
> zones (which we've already checked) or it's the write pointer for
> a sequential zone. i.e. it's the max position for either case.

Yes, correct. I fixed it.

> 
>> +	iov_iter_truncate(to, max_pos - iocb->ki_pos);
>> +	count = iov_iter_count(to);
>> +	if (!count)
>> +		return 0;
> 
> The iov_iter should never be zero length here, because that implies
> the position was >= max_pos and that will be caught by the above
> checks...

Absolutely. Fixed.

> 
>> +	/* Direct IO reads must be aligned to device physical sector size */
>> +	if ((iocb->ki_flags & IOCB_DIRECT) &&
>> +	    ((iocb->ki_pos | count) & sbi->s_blocksize_mask))
>> +		return -EINVAL;
>> +
>> +	if (iocb->ki_flags & IOCB_NOWAIT) {
>> +		if (!inode_trylock_shared(inode))
>> +			return -EAGAIN;
>> +	} else {
>> +		inode_lock_shared(inode);
>> +	}
> 
> IIUC, write IO completion takes the inode lock to serialise file
> size updates for sequential zones. In that case, shouldn't this lock
> be taken before we do the EOF checks above?

Yes. Fixed.

>> +/*
>> + * We got a write error: get the sequenial zone information from the device to
>> + * figure out where the zone write pointer is and verify the inode size against
>> + * it.
>> + */
>> +static int zonefs_write_failed(struct inode *inode, int error)
>> +{
>> +	struct super_block *sb = inode->i_sb;
>> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>> +	sector_t sector = zi->i_addr >> SECTOR_SHIFT;
>> +	unsigned int noio_flag;
>> +	struct blk_zone zone;
>> +	int n = 1, ret;
>> +
>> +	zonefs_warn(sb, "Updating inode zone %llu info\n", sector);
>> +
>> +	noio_flag = memalloc_noio_save();
>> +	ret = blkdev_report_zones(sb->s_bdev, sector, &zone, &n);
>> +	memalloc_noio_restore(noio_flag);
> 
> What deadlock does the memalloc_noio_save() avoid? There should be a
> comment explaining what problem memalloc_noio_save() avoids
> everywhere it is used like this. If it isn't safe to do GFP_KERNEL
> allocations here under the i_rwsem, then why would it be safe to
> do GFP_KERNEL allocations in the truncate code under the i_rwsem?

Similarly to the truncate (zone reset) code, GFP_NOFS is safer here. So I
changed the call to memalloc_nofs_save/restore and added a comment.

>> +
>> +	if (!n)
>> +		ret = -EIO;
>> +	if (ret) {
>> +		zonefs_err(sb, "Get zone %llu report failed %d\n",
>> +			   sector, ret);
>> +		return ret;
>> +	}
>> +
>> +	zi->i_wpoffset = (zone.wp - zone.start) << SECTOR_SHIFT;
>> +	if (i_size_read(inode) != zi->i_wpoffset) {
>> +		i_size_write(inode, zi->i_wpoffset);
>> +		truncate_pagecache(inode, zi->i_wpoffset);
>> +	}
> 
> This looks .... dangerous. If the write pointer was advanced, but
> the data wasn't written properly, this causes stale data exposure on
> write failure. i.e. it's not failsafe.

It is safe because the device will return zeros or the format pattern for sector
reads that are beyond the zone write pointer. No stale data will be exposed, as
long as the device can be trusted to correctly handle reads beyond the write
pointer.

> 
> I suspect that on a sequential zone write failure and the write
> pointer does not equal the offset of the write, we should consider
> the zone corrupt. Also, this is for direct IO completion for
> sequential writes, yes? So what does the page cache truncation
> acheive given that only direct IO writes are allowed to these files?

Yes, the page cache truncation can be removed.

>> +static int zonefs_dio_seqwrite_end_io(struct kiocb *iocb, ssize_t size,
>> +				      unsigned int flags)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	int ret;
>> +
>> +	inode_lock(inode);
>> +	if (size < 0)
>> +		ret = zonefs_write_failed(inode, size);
>> +	else
>> +		ret = zonefs_update_size(inode, iocb->ki_pos + size);
>> +	inode_unlock(inode);
>> +	return ret;
> 
> Shouldn't this have a check that it's being called on a sequential
> zone inode?

Yes, a check can be added but the only caller being zonefs_file_dio_aio_write()
after checking that the inode is not a conventional zone file, I thought it was
not needed.

>> +static ssize_t zonefs_file_dio_aio_write(struct kiocb *iocb,
>> +					 struct iov_iter *from)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>> +	size_t count;
>> +
>> +	/*
>> +	 * The size of conventional zone files is fixed to the zone size.
>> +	 * So only direct writes to sequential zones need adjusting the
>> +	 * inode size on IO completion.
>> +	 */
>> +	if (zonefs_file_is_conv(inode))
>> +		return iomap_dio_rw(iocb, from, &zonefs_iomap_ops, NULL);
>> +
>> +	/* Enforce append only sequential writes */
>> +	count = iov_iter_count(from);
>> +	if (iocb->ki_pos != zi->i_wpoffset) {
>> +		zonefs_err(inode->i_sb,
>> +			   "Unaligned write at %llu + %zu (wp %llu)\n",
>> +			   iocb->ki_pos, count, zi->i_wpoffset);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (is_sync_kiocb(iocb)) {
>> +		/*
>> +		 * Don't use the end_io callback for synchronous iocbs,
>> +		 * as we'd deadlock on i_rwsem.  Instead perform the same
>> +		 * actions manually here.
>> +		 */
>> +		count = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, NULL);
>> +		if (count < 0)
>> +			return zonefs_write_failed(inode, count);
>> +		zonefs_update_size(inode, iocb->ki_pos);
>> +		return count;
> 
> Urk. This locking is nasty, and doesn't avoid the problem.....

Do you mean the stale data exposure problem ? As said above, the device
guarantees that stale data will not be returned for reads after write pointer.
Old data (data written before the last zone reset) is not readable.

> 
>> +	}
>> +
>> +	return iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
>> +			    zonefs_dio_seqwrite_end_io);
> 
> ... because I think this can deadlock.
> 
> AFAIA, the rule is that IO completion callbacks cannot take
> a lock that is held across IO submission. The reason is that
> IO can complete so fast that the submission code runs the
> completion. i.e. iomap_dio_rw() can be the function that calls
> iomap_dio_complete() and runs the IO completion.
> 
> In which case, this will deadlock because we are already holding the
> i_rwsem and the end_io completion will try to take it again.

Arrg ! Yes ! Beginner's mistake. Will rework this.
> 
> 
> 
>> +}
>> +
>> +static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
>> +	size_t count;
>> +	ssize_t ret;
>> +
>> +	/*
>> +	 * Check that the read operation does not go beyond the file
>> +	 * zone boundary.
>> +	 */
>> +	if (iocb->ki_pos >= zonefs_file_max_size(inode))
>> +		return -EFBIG;
>> +	iov_iter_truncate(from, zonefs_file_max_size(inode) - iocb->ki_pos);
>> +	count = iov_iter_count(from);
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	/*
>> +	 * Direct IO writes are mandatory for sequential zones so that write IO
>> +	 * order is preserved. The direct writes also must be aligned to
>> +	 * device physical sector size.
>> +	 */
>> +	if (iocb->ki_flags & IOCB_DIRECT) {
>> +		if ((iocb->ki_pos | count) & sbi->s_blocksize_mask)
>> +			return -EINVAL;
>> +	} else {
>> +		if (zonefs_file_is_seq(inode))
>> +			return -EOPNOTSUPP;
> 
> zonefs_iomap_begin() returns -EIO in this case and issues a warning.
> This seems somewhat inconsistent....

Yes. The check in zonefs_iomap_begin() is probably redundant. Will remove it.

>> +	}
>> +
>> +	if (iocb->ki_flags & IOCB_NOWAIT) {
>> +		if (!inode_trylock(inode))
>> +			return -EAGAIN;
>> +	} else {
>> +		inode_lock(inode);
>> +	}
>> +
>> +	ret = generic_write_checks(iocb, from);
>> +	if (ret <= 0)
>> +		goto out;
> 
> Shouldn't this be done before the iov_iter is truncated?

Yes. Will move it.

> 
>> +
>> +	if (iocb->ki_flags & IOCB_DIRECT)
>> +		ret = zonefs_file_dio_aio_write(iocb, from);
>> +	else
>> +		ret = iomap_file_buffered_write(iocb, from, &zonefs_iomap_ops);
>> +
>> +out:
>> +	inode_unlock(inode);
>> +
>> +	if (ret > 0 && (!(iocb->ki_flags & IOCB_DIRECT))) {
>> +		iocb->ki_pos += ret;
>> +		ret = generic_write_sync(iocb, ret);
>> +	}
> 
> Hmmm. The split of checks and doing stuff between direct IO and
> buffered IO seems a bit arbitrary. e.g. the "sequential zones can
> only do append writes" is in zonefs_file_dio_aio_write(), but we
> do a check that "sequential zones can only do direct IO" here.
> 
> And then we have the sync code that can only occur on buffered IO,
> which we don't have a wrapper function for but really should.  And I
> suspect that the locking is going to have to change here because of
> the direct IO issues, so maybe it would be best to split this up
> similar to the way XFS has two completely separate functions for the
> two paths....

Yes, very good idea. I will clean this up to clarify all cases.

>> +static struct kmem_cache *zonefs_inode_cachep;
>> +
>> +static struct inode *zonefs_alloc_inode(struct super_block *sb)
>> +{
>> +	struct zonefs_inode_info *zi;
>> +
>> +	zi = kmem_cache_alloc(zonefs_inode_cachep, GFP_KERNEL);
>> +	if (!zi)
>> +		return NULL;
>> +
>> +	init_rwsem(&zi->i_mmap_sem);
>> +	inode_init_once(&zi->i_vnode);
>> +
>> +	return &zi->i_vnode;
>> +}
>> +
>> +static void zonefs_destroy_cb(struct rcu_head *head)
>> +{
>> +	struct inode *inode = container_of(head, struct inode, i_rcu);
>> +
>> +	kmem_cache_free(zonefs_inode_cachep, ZONEFS_I(inode));
>> +}
>> +
>> +static void zonefs_destroy_inode(struct inode *inode)
>> +{
>> +	call_rcu(&inode->i_rcu, zonefs_destroy_cb);
>> +}
> 
> If this is all the inode destructor is, then implement ->free_inode
> instead. i.e.
> 
> zonefs_free_inode(inode)
> {
> 	kmem_cache_free(zonefs_inode_cachep, ZONEFS_I(inode));
> }
> 
> and the VFS takes care of the RCU freeing of the inode.

Yes, I remember now that the rcu free inode code was moved as generic in VFS a
while back. I will clean this up.

> 
>> +/*
>> + * File system stat.
>> + */
>> +static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf)
>> +{
>> +	struct super_block *sb = dentry->d_sb;
>> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>> +	sector_t nr_sectors = sb->s_bdev->bd_part->nr_sects;
>> +	enum zonefs_ztype t;
>> +
>> +	buf->f_type = ZONEFS_MAGIC;
>> +	buf->f_bsize = dentry->d_sb->s_blocksize;
>> +	buf->f_namelen = ZONEFS_NAME_MAX;
>> +
>> +	buf->f_blocks = nr_sectors >> (sb->s_blocksize_bits - SECTOR_SHIFT);
>> +	buf->f_bfree = 0;
>> +	buf->f_bavail = 0;
>> +
>> +	buf->f_files = sbi->s_nr_zones[ZONEFS_ZTYPE_ALL] - 1;
>> +	for (t = ZONEFS_ZTYPE_ALL; t < ZONEFS_ZTYPE_MAX; t++) {
>> +		if (sbi->s_nr_zones[t])
>> +			buf->f_files++;
>> +	}
>> +	buf->f_ffree = 0;
>> +
>> +	/* buf->f_fsid = 0; uuid, see ext2 */
> 
> This doesn't tell me anything useful. Does it mean "we should use
> the uuid like ext2" or something else? is it a "TODO:" item?

Basically yes. Forgot to code it after too much copy-paste :)
Will fix that.

>> +static char *zgroups_name[ZONEFS_ZTYPE_MAX] = {
>> +	NULL,
>> +	"cnv",
>> +	"seq"
>> +};
> 
> What's the reason for a NULL in the first entry?

This is for ZONEFS_ZTYPE_ALL, which does not have a directory. The NULL is only
there because the array is using ZONEFS_ZTYPE_MAX. This can be simplified.

> 
>> +
>> +/*
>> + * Create a zone group and populate it with zone files.
>> + */
>> +static int zonefs_create_zgroup(struct super_block *sb, struct blk_zone *zones,
>> +				enum zonefs_ztype type)
>> +{
>> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>> +	struct blk_zone *zone, *next, *end;
>> +	char name[ZONEFS_NAME_MAX];
>> +	unsigned int nr_files = 0;
>> +	struct dentry *dir;
>> +
>> +	/* If the group is empty, nothing to do */
>> +	if (!sbi->s_nr_zones[type])
>> +		return 0;
>> +
>> +	dir = zonefs_create_inode(sb->s_root, zgroups_name[type], NULL);
>> +	if (!dir)
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * Note: The first zone contains the super block: skip it.
>> +	 */
>> +	end = zones + sbi->s_nr_zones[ZONEFS_ZTYPE_ALL];
>> +	for (zone = &zones[1]; zone < end; zone = next) {
>> +
>> +		next = zone + 1;
>> +		if (zonefs_zone_type(zone) != type)
>> +			continue;
>> +
>> +		/* Ignore offline zones */
>> +		if (zonefs_zone_offline(zone))
>> +			continue;
>> +
>> +		/*
>> +		 * For conventional zones, contiguous zones can be aggregated
>> +		 * together to form larger files.
>> +		 * Note that this overwrites the length of the first zone of
>> +		 * the set of contiguous zones aggregated together.
>> +		 * Only zones with the same condition can be agreggated so that
>> +		 * offline zones are excluded and readonly zones are aggregated
>> +		 * together into a read only file.
>> +		 */
>> +		if (type == ZONEFS_ZTYPE_CNV &&
>> +		    zonefs_has_feature(sbi, ZONEFS_F_AGRCNV)) {
>> +			for (; next < end; next++) {
>> +				if (zonefs_zone_type(next) != type ||
>> +				    next->cond != zone->cond)
>> +					break;
>> +				zone->len += next->len;
>> +			}
>> +		}
>> +
>> +		if (zonefs_has_feature(sbi, ZONEFS_F_STARTSECT_NAME))
>> +			/* Use zone start sector as file names */
>> +			snprintf(name, ZONEFS_NAME_MAX - 1, "%llu",
>> +				 zone->start);
>> +		else
>> +			/* Use file number as file names */
>> +			snprintf(name, ZONEFS_NAME_MAX - 1, "%u", nr_files);
>> +		nr_files++;
>> +
>> +		if (!zonefs_create_inode(dir, name, zone))
>> +			return -ENOMEM;
> 
> I guess this means partial setup due to failure needs to be torn
> down by the kill_super() code?

Yes. And the call to d_genocide(sb->s_root) does that unless I missed something.

>> +static struct blk_zone *zonefs_get_zone_info(struct super_block *sb)
>> +{
>> +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>> +	struct block_device *bdev = sb->s_bdev;
>> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
>> +	unsigned int i, n, nr_zones = 0;
>> +	struct blk_zone *zones, *zone;
>> +	sector_t sector = 0;
>> +	int ret;
>> +
>> +	sbi->s_blocksize_mask = sb->s_blocksize - 1;
>> +	sbi->s_nr_zones[ZONEFS_ZTYPE_ALL] = blkdev_nr_zones(bdev);
>> +	zones = kvcalloc(sbi->s_nr_zones[ZONEFS_ZTYPE_ALL],
>> +			 sizeof(struct blk_zone), GFP_KERNEL);
>> +	if (!zones)
>> +		return ERR_PTR(-ENOMEM);
> 
> Hmmm. That's a big allocation. That might be several megabytes for a
> typical 16TB SMR drive, right? It might be worth adding a comment
> indicating just how large this is, because it's somewhat unusual in
> kernel space, even for temporary storage.

Yes. A full zone report (64B per zone) will take to 4-5MB on a 15+ TB disk. I
will comment this. Doing the big allocation to allow having all zones on hand
before creating the inodes makes the code easier. But this can be optimized with
a partial report in a loop too. I will rewrite this code to avoid the big
temporary allocation.

>> + * defined in linux/blkzoned.h, that is, BLK_ZONE_TYPE_SEQWRITE_REQ and
>> + * BLK_ZONE_TYPE_SEQWRITE_PREF.
>> + */
>> +enum zonefs_ztype {
>> +	ZONEFS_ZTYPE_ALL = 0,
>> +	ZONEFS_ZTYPE_CNV,
>> +	ZONEFS_ZTYPE_SEQ,
>> +	ZONEFS_ZTYPE_MAX,
>> +};
> 
> What is ZONEFS_ZTYPE_ALL supposed to be used for?

It is defined only to allow thee declaration of sbi->s_nr_zones as an array of
ZONEFS_ZTYPE_MAX unsigned int, so that we get 3 counters: total number of zones
is always larger or equal to conv zones + seq zones. Initially, this made the
code somewhat cleaner, but I realize now not really easier to understand :)

I will remove it and make things more obvious.

>> +static inline bool zonefs_zone_offline(struct blk_zone *zone)
>> +{
>> +	return zone->cond == BLK_ZONE_COND_OFFLINE;
>> +}
>> +
>> +static inline bool zonefs_zone_readonly(struct blk_zone *zone)
>> +{
>> +	return zone->cond == BLK_ZONE_COND_READONLY;
>> +}
> 
> These should be block layer helpers as the operate on blk_zone,
> not zonefs structures.

Yes. Will fix that.

>> +/*
>> + * Address (byte offset) on disk of a file zone.
>> + */
>> +static inline loff_t zonefs_file_addr(struct inode *inode)
>> +{
>> +	return ZONEFS_I(inode)->i_addr;
>> +}
> 
> so it's a disk address, but it's encoded in bytes rather than sectors
> so that makes it an offset. That's kinda confusing coming from a
> filesystem that makes a clear distinction between these two things.

Yes. This simplifies a little the code, but it is not obvious. I will rework this.

>> +
>> +/*
>> + * Maximum possible size of a file (i.e. the zone size).
>> + */
>> +static inline loff_t zonefs_file_max_size(struct inode *inode)
>> +{
>> +	return ZONEFS_I(inode)->i_max_size;
>> +}
>> +
>> +/*
>> + * On-disk super block (block 0).
>> + */
>> +struct zonefs_super {
>> +
>> +	/* Magic number */
>> +	__le32		s_magic;		/*    4 */
>> +
>> +	/* Metadata version number */
>> +	__le32		s_version;		/*    8 */
>> +
>> +	/* Features */
>> +	__le64		s_features;		/*   16 */
> 
> On-disk version numbers are kinda redundant when you have
> fine grained feature fields to indicate the on-disk layout...

Yes. I guess. I was kind of reserving the s_features field for optional things,
hence the version number for on-disk mandatory things. But I may as well use it
for everything, both mandatory format features and options.

>> +/*
>> + * Feature flags.
>> + */
>> +enum zonefs_features {
>> +	/*
>> +	 * Use a zone start sector value as file name.
>> +	 */
>> +	ZONEFS_F_STARTSECT_NAME,
>> +	/*
>> +	 * Aggregate contiguous conventional zones into a single file.
>> +	 */
>> +	ZONEFS_F_AGRCNV,
>> +	/*
>> +	 * Use super block specified UID for files instead of default.
>> +	 */
>> +	ZONEFS_F_UID,
>> +	/*
>> +	 * Use super block specified GID for files instead of default.
>> +	 */
>> +	ZONEFS_F_GID,
>> +	/*
>> +	 * Use super block specified file permissions instead of default 640.
>> +	 */
>> +	ZONEFS_F_PERM,
>> +};
> 
> Are these the on-disk feature bit definitions, or just used in
> memory? Or both?

These are used for both on-disk super block and in-memory sb info.

Thank you very much for the detailed review. I will rework this and repost.

Best regards.

-- 
Damien Le Moal
Western Digital Research