All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: jane.chu@oracle.com, linux-xfs@vger.kernel.org,
	hch@infradead.org, dan.j.williams@intel.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 3/5] vfs: add a zero-initialization mode to fallocate
Date: Tue, 21 Sep 2021 10:44:31 +1000	[thread overview]
Message-ID: <20210921004431.GO1756565@dread.disaster.area> (raw)
In-Reply-To: <163192866125.417973.7293598039998376121.stgit@magnolia>

On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a new mode to fallocate to zero-initialize all the storage backing a
> file.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/open.c                   |    5 +++++
>  include/linux/falloc.h      |    1 +
>  include/uapi/linux/falloc.h |    9 +++++++++
>  3 files changed, 15 insertions(+)
> 
> 
> diff --git a/fs/open.c b/fs/open.c
> index daa324606a41..230220b8f67a 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	    (mode & ~FALLOC_FL_INSERT_RANGE))
>  		return -EINVAL;
>  
> +	/* Zeroinit should only be used by itself and keep size must be set. */
> +	if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> +	    (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> +		return -EINVAL;
> +
>  	/* Unshare range should only be used with allocate mode. */
>  	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
>  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..4597b416667b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -29,6 +29,7 @@ struct space_resv {
>  					 FALLOC_FL_PUNCH_HOLE |		\
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
> +					 FALLOC_FL_ZEROINIT_RANGE |	\
>  					 FALLOC_FL_INSERT_RANGE |	\
>  					 FALLOC_FL_UNSHARE_RANGE)
>  
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..8144403b6102 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,13 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> + * writing zeros to it.  Subsequent read and writes should not fail due to any
> + * previous media errors.  Blocks must be not be shared or require copy on
> + * write.  Holes and unwritten extents are left untouched.  This mode must be
> + * used with FALLOC_FL_KEEP_SIZE.
> + */
> +#define FALLOC_FL_ZEROINIT_RANGE	0x80

Hmmmm.

I think this wants to be a behavioural modifier for existing
operations rather than an operation unto itself. i.e. similar to how
KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
the guarantees ALLOC provides userspace.

In this case, the change of behaviour over ZERO_RANGE is that we
want physical zeros to be written instead of the filesystem
optimising away the physical zeros by manipulating the layout
of the file.

There's been requests in the past for a way to make ALLOC also
behave like this - in the case that users want fully allocated space
to be preallocated so their applications don't take unwritten extent
conversion penalties on first writes. Databases are an example here,
where setup of a new WAL file isn't performance critical, but writes
to the WAL are and the WAL files are write-once. Hence they always
take unwritten conversion penalties and the only way around that is
to physically zero the files before use...

So it seems to me what we actually need here is a "write zeroes"
modifier to fallocate() operations to tell the filesystem that the
application really wants it to write zeroes over that range, not
just guarantee space has been physically allocated....

Then we have and API that looks like:

	ALLOC		- allocate space efficiently
	ALLOC | INIT	- allocate space by writing zeros to it
	ZERO		- zero data and preallocate space efficiently
	ZERO | INIT	- zero range by writing zeros to it

Which seems to cater for all the cases I know of where physically
writing zeros instead of allocating unwritten extents is the
preferred behaviour of fallocate()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2021-09-21  0:46 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-18  1:30 [PATCHSET RFC v2 jane 0/5] vfs: enable userspace to reset damaged file storage Darrick J. Wong
2021-09-18  1:30 ` [PATCH 1/5] dax: prepare pmem for use by zero-initializing contents and clearing poisons Darrick J. Wong
2021-09-18 16:54   ` riteshh
2021-09-20 17:22     ` Darrick J. Wong
2021-09-21  4:07       ` riteshh
2021-09-22 18:26         ` Darrick J. Wong
2021-09-22 19:47           ` riteshh
2021-09-22 20:26           ` Dan Williams
2021-09-21  8:34   ` Christoph Hellwig
2021-09-22 18:10     ` Darrick J. Wong
2021-09-18  1:30 ` [PATCH 2/5] iomap: use accelerated zeroing on a block device to zero a file range Darrick J. Wong
2021-09-18 16:55   ` riteshh
2021-09-21  8:29   ` Christoph Hellwig
2021-09-22 18:53     ` Darrick J. Wong
2021-09-21 22:33   ` Dave Chinner
2021-09-22 18:54     ` Darrick J. Wong
2021-09-18  1:31 ` [PATCH 3/5] vfs: add a zero-initialization mode to fallocate Darrick J. Wong
2021-09-18 16:58   ` riteshh
2021-09-20 17:52   ` Eric Biggers
2021-09-20 18:06     ` Darrick J. Wong
2021-09-21  0:44   ` Dave Chinner [this message]
2021-09-21  8:31     ` Christoph Hellwig
2021-09-22  2:16       ` Dan Williams
2021-09-22  2:38         ` Darrick J. Wong
2021-09-22  3:59           ` Dave Chinner
2021-09-22  4:13             ` Darrick J. Wong
2021-09-22  5:49               ` Dave Chinner
2021-09-22 21:27                 ` Darrick J. Wong
2021-09-23  0:02                   ` Darrick J. Wong
2021-09-23  0:44                     ` Darrick J. Wong
2021-09-23  1:42                     ` Dave Chinner
2021-09-23  2:43                       ` Dan Williams
2021-09-23  5:42                         ` Dan Williams
2021-09-23 22:54                           ` Dave Chinner
2021-09-24  1:18                             ` Dan Williams
2021-09-24  1:21                               ` Jane Chu
2021-09-24  1:35                                 ` Darrick J. Wong
2021-09-27 21:07                                   ` Dave Chinner
2021-09-27 21:57                                     ` Jane Chu
2021-09-28  0:08                                       ` Dan Williams
2021-09-22  5:28     ` riteshh
2021-09-18  1:31 ` [PATCH 4/5] xfs: implement FALLOC_FL_ZEROINIT_RANGE Darrick J. Wong
2021-09-18  1:31 ` [PATCH 5/5] ext4: " Darrick J. Wong
2021-09-18 17:07   ` riteshh
2021-09-20 18:11     ` Darrick J. Wong
2021-09-21  6:10       ` riteshh
2021-09-18 18:05 ` [PATCHSET RFC v2 jane 0/5] vfs: enable userspace to reset damaged file storage Dan Williams
2021-09-23  0:51 ` Darrick J. Wong
2021-09-23  1:17   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210921004431.GO1756565@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jane.chu@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.