From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v2 2/3] zonefs: open/close zone on file open/close
Date: Wed, 9 Sep 2020 12:50:07 +0000 [thread overview]
Message-ID: <CY4PR04MB3751CFEAF07FAFF3A2E48156E7260@CY4PR04MB3751.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20200909102614.40585-3-johannes.thumshirn@wdc.com
On 2020/09/09 19:26, Johannes Thumshirn wrote:
> NVMe Zoned Namespace introduced the concept of active zones, which are
> zones in the implicit open, explicit open or closed condition. Drives may
> have a limit on the number of zones that can be simultaneously active.
> This potential limitation translate into a risk for applications to see
> write IO errors due to this limit if the zone of a file being written to is
> not already active when a write request is issued.
>
> To avoid these potential errors, the zone of a file can explicitly be made
> active using an open zone command when the file is open for the first
> time. If the zone open command succeeds, the application is then
> guaranteed that write requests can be processed. This indirect management
> of active zones relies on the maximum number of open zones of a drive,
> which is always lower or equal to the maximum number of active zones.
>
> On the first open of a sequential zone file, send a REQ_OP_ZONE_OPEN
> command to the block device. Conversely, on the last release of a zone
> file and send a REQ_OP_ZONE_CLOSE to the device if the zone is not full or
> empty.
>
> As truncating a zone file to 0 or max can deactivate a zone as well, we
> need to serialize against truncates and also be careful not to close a
> zone as the file may still be open for writing, e.g. the user called
> ftruncate(). If the zone file is not open and a process does a truncate(),
> then no close operation is needed.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/zonefs/super.c | 160 +++++++++++++++++++++++++++++++++++++++++++--
> fs/zonefs/zonefs.h | 10 +++
> 2 files changed, 166 insertions(+), 4 deletions(-)
>
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index dc828bd1210b..07717df2fac9 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -44,6 +44,80 @@ static inline int zonefs_zone_mgmt(struct inode *inode,
> return 0;
> }
>
> +static int zonefs_open_zone(struct inode *inode)
> +{
> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
> + struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> + int ret = 0;
> +
> + mutex_lock(&zi->i_truncate_mutex);
> +
> + zi->i_wr_refcnt++;
> + if (zi->i_wr_refcnt == 1) {
> +
> + if (atomic_inc_return(&sbi->s_open_zones) > sbi->s_max_open_zones) {
> + atomic_dec(&sbi->s_open_zones);
> + ret = -EBUSY;
> + goto unlock;
> + }
> +
> + if (i_size_read(inode) < zi->i_max_size) {
> + ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
> + if (ret) {
> + zi->i_wr_refcnt--;
> + atomic_dec(&sbi->s_open_zones);
> + goto unlock;
> + }
> + zi->i_flags |= ZONEFS_ZONE_OPEN;
> + }
> + }
> +
> +unlock:
> + mutex_unlock(&zi->i_truncate_mutex);
> +
> + return ret;
> +}
> +
> +static int zonefs_close_zone(struct inode *inode)
> +{
> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
> + int ret = 0;
> +
> + mutex_lock(&zi->i_truncate_mutex);
> +
> + zi->i_wr_refcnt--;
> + if (!zi->i_wr_refcnt) {
> + struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> +
> + if (zi->i_flags & ZONEFS_ZONE_OPEN) {
> + ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_CLOSE);
> + if (ret)
> + goto unlock;
> + zi->i_flags &= ~ZONEFS_ZONE_OPEN;
> + }
> +
> + atomic_dec(&sbi->s_open_zones);
> + }
> +
> +unlock:
> + mutex_unlock(&zi->i_truncate_mutex);
> +
> + return ret;
> +}
> +
> +static inline void zonefs_i_size_write(struct inode *inode, loff_t isize)
> +{
> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
> +
> + i_size_write(inode, isize);
> + /*
> + * A full zone is no longer open/active and does not need
> + * explicit closing.
> + */
> + if (isize >= zi->i_max_size)
> + zi->i_flags &= ~ZONEFS_ZONE_OPEN;
> +}
> +
> static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned int flags, struct iomap *iomap,
> struct iomap *srcmap)
> @@ -335,7 +409,7 @@ static int zonefs_io_error_cb(struct blk_zone *zone, unsigned int idx,
> * invalid data.
> */
> zonefs_update_stats(inode, data_size);
> - i_size_write(inode, data_size);
> + zonefs_i_size_write(inode, data_size);
> zi->i_wpoffset = data_size;
>
> return 0;
> @@ -421,6 +495,25 @@ static int zonefs_file_truncate(struct inode *inode, loff_t isize)
> if (ret)
> goto unlock;
>
> + /*
> + * If the mount option ZONEFS_MNTOPT_EXPLICIT_OPEN is set,
> + * take care of open zones.
> + */
> + if (zi->i_flags & ZONEFS_ZONE_OPEN) {
> + /*
> + * Truncating a zone to EMPTY or FULL is the equivalent of
> + * closing the zone. For a truncation to 0, we need to
> + * re-open the zone to ensure new writes can be processed.
> + * For a truncation to the maximum file size, the zone is
> + * closed and writes cannot be accepted anymore, so clear
> + * the open flag.
> + */
> + if (!isize)
> + ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
> + else
> + zi->i_flags &= ~ZONEFS_ZONE_OPEN;
> + }
> +
> zonefs_update_stats(inode, isize);
> truncate_setsize(inode, isize);
> zi->i_wpoffset = isize;
> @@ -599,7 +692,7 @@ static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
> mutex_lock(&zi->i_truncate_mutex);
> if (i_size_read(inode) < iocb->ki_pos + size) {
> zonefs_update_stats(inode, iocb->ki_pos + size);
> - i_size_write(inode, iocb->ki_pos + size);
> + zonefs_i_size_write(inode, iocb->ki_pos + size);
> }
> mutex_unlock(&zi->i_truncate_mutex);
> }
> @@ -880,8 +973,55 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> return ret;
> }
>
> +static inline bool zonefs_file_use_exp_open(struct inode *inode, struct file *file)
> +{
> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
> + struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> +
> + if (!(sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN))
> + return false;
> +
> + if (zi->i_ztype != ZONEFS_ZTYPE_SEQ)
> + return false;
> +
> + if (!(file->f_mode & FMODE_WRITE))
> + return false;
> +
> + return true;
> +}
> +
> +static int zonefs_file_open(struct inode *inode, struct file *file)
> +{
> + int ret;
> +
> + ret = generic_file_open(inode, file);
> + if (ret)
> + return ret;
> +
> + if (zonefs_file_use_exp_open(inode, file))
> + return zonefs_open_zone(inode);
> +
> + return 0;
> +}
> +
> +static int zonefs_file_release(struct inode *inode, struct file *file)
> +{
> + /*
> + * If we explicitly open a zone we must close it again as well, but the
> + * zone management operation can fail (either due to an IO error or as
> + * the zone has gone offline or read-only). Make sure we don't fail the
> + * close(2) for user-space.
> + */
> + if (zonefs_file_use_exp_open(inode, file))
> + if (zonefs_close_zone(inode))
> + zonefs_io_error(inode, false);
OK. But zonefs_io_error() needs to clear the zone open flag if the zone went
read-only or offline. Otherwise the zone flag will be left as is, and also you
will end up getting an error on close if the zone already transitioned to
read-only or offline during IOs.
> +
> + return 0;
> +}
> +
> static const struct file_operations zonefs_file_operations = {
> - .open = generic_file_open,
> + .open = zonefs_file_open,
> + .release = zonefs_file_release,
> .fsync = zonefs_file_fsync,
> .mmap = zonefs_file_mmap,
> .llseek = zonefs_file_llseek,
> @@ -905,6 +1045,7 @@ static struct inode *zonefs_alloc_inode(struct super_block *sb)
> inode_init_once(&zi->i_vnode);
> mutex_init(&zi->i_truncate_mutex);
> init_rwsem(&zi->i_mmap_sem);
> + zi->i_wr_refcnt = 0;
>
> return &zi->i_vnode;
> }
> @@ -955,7 +1096,7 @@ static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf)
>
> enum {
> Opt_errors_ro, Opt_errors_zro, Opt_errors_zol, Opt_errors_repair,
> - Opt_err,
> + Opt_explicit_open, Opt_err,
> };
>
> static const match_table_t tokens = {
> @@ -963,6 +1104,7 @@ static const match_table_t tokens = {
> { Opt_errors_zro, "errors=zone-ro"},
> { Opt_errors_zol, "errors=zone-offline"},
> { Opt_errors_repair, "errors=repair"},
> + { Opt_explicit_open, "explicit-open" },
> { Opt_err, NULL}
> };
>
> @@ -999,6 +1141,9 @@ static int zonefs_parse_options(struct super_block *sb, char *options)
> sbi->s_mount_opts &= ~ZONEFS_MNTOPT_ERRORS_MASK;
> sbi->s_mount_opts |= ZONEFS_MNTOPT_ERRORS_REPAIR;
> break;
> + case Opt_explicit_open:
> + sbi->s_mount_opts |= ZONEFS_MNTOPT_EXPLICIT_OPEN;
> + break;
> default:
> return -EINVAL;
> }
> @@ -1418,6 +1563,13 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
> sbi->s_gid = GLOBAL_ROOT_GID;
> sbi->s_perm = 0640;
> sbi->s_mount_opts = ZONEFS_MNTOPT_ERRORS_RO;
> + sbi->s_max_open_zones = bdev_max_open_zones(sb->s_bdev);
> + atomic_set(&sbi->s_open_zones, 0);
> + if (!sbi->s_max_open_zones &&
> + sbi->s_mount_opts & ZONEFS_MNTOPT_EXPLICIT_OPEN) {
> + zonefs_info(sb, "No open zones limit. Ignoring explicit_open mount option\n");
> + sbi->s_mount_opts &= ~ZONEFS_MNTOPT_EXPLICIT_OPEN;
> + }
>
> ret = zonefs_read_super(sb);
> if (ret)
> diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
> index 55b39970acb2..51141907097c 100644
> --- a/fs/zonefs/zonefs.h
> +++ b/fs/zonefs/zonefs.h
> @@ -38,6 +38,8 @@ static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone)
> return ZONEFS_ZTYPE_SEQ;
> }
>
> +#define ZONEFS_ZONE_OPEN (1 << 0)
> +
> /*
> * In-memory inode data.
> */
> @@ -74,6 +76,10 @@ struct zonefs_inode_info {
> */
> struct mutex i_truncate_mutex;
> struct rw_semaphore i_mmap_sem;
> +
> + /* guarded by i_truncate_mutex */
> + unsigned int i_wr_refcnt;
> + unsigned int i_flags;
> };
>
> static inline struct zonefs_inode_info *ZONEFS_I(struct inode *inode)
> @@ -154,6 +160,7 @@ enum zonefs_features {
> #define ZONEFS_MNTOPT_ERRORS_MASK \
> (ZONEFS_MNTOPT_ERRORS_RO | ZONEFS_MNTOPT_ERRORS_ZRO | \
> ZONEFS_MNTOPT_ERRORS_ZOL | ZONEFS_MNTOPT_ERRORS_REPAIR)
> +#define ZONEFS_MNTOPT_EXPLICIT_OPEN (1 << 4) /* Explicit open/close of zones on open/close */
>
> /*
> * In-memory Super block information.
> @@ -175,6 +182,9 @@ struct zonefs_sb_info {
>
> loff_t s_blocks;
> loff_t s_used_blocks;
> +
> + unsigned int s_max_open_zones;
> + atomic_t s_open_zones;
> };
>
> static inline struct zonefs_sb_info *ZONEFS_SB(struct super_block *sb)
>
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2020-09-09 14:35 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-09 10:26 [PATCH v2 0/3] zonefs: introduce -o explicit-open mount option Johannes Thumshirn
2020-09-09 10:26 ` [PATCH v2 1/3] zonefs: introduce helper for zone management Johannes Thumshirn
2020-09-09 12:43 ` Damien Le Moal
2020-09-09 14:43 ` Johannes Thumshirn
2020-09-09 10:26 ` [PATCH v2 2/3] zonefs: open/close zone on file open/close Johannes Thumshirn
2020-09-09 12:50 ` Damien Le Moal [this message]
2020-09-09 10:26 ` [PATCH v2 3/3] zonefs: document the explicit-open mount option Johannes Thumshirn
2020-09-09 12:52 ` Damien Le Moal
2020-09-09 12:52 ` [PATCH v2 0/3] zonefs: introduce -o " Damien Le Moal
2020-09-09 14:43 ` Johannes Thumshirn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CY4PR04MB3751CFEAF07FAFF3A2E48156E7260@CY4PR04MB3751.namprd04.prod.outlook.com \
--to=damien.lemoal@wdc.com \
--cc=Johannes.Thumshirn@wdc.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).