All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Andres Freund <andres@anarazel.de>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-ext4@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?
Date: Sat, 9 Jan 2021 07:32:41 +1100	[thread overview]
Message-ID: <20210108203241.GI331610@dread.disaster.area> (raw)
In-Reply-To: <20210106234009.b6gbzl7bjm2evxj6@alap3.anarazel.de>

On Wed, Jan 06, 2021 at 03:40:09PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2021-01-07 09:52:01 +1100, Dave Chinner wrote:
> > On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> > > Which brings me to $subject:
> > > 
> > > Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> > > doesn't convert extents into unwritten extents, but instead uses
> > > blkdev_issue_zeroout() if supported?  Mostly interested in xfs/ext4
> > > myself, but ...
> > 
> > We have explicit requests from users (think initialising large VM
> > images) that FALLOC_FL_ZERO_RANGE must never fall back to writing
> > zeroes manually.
> 
> That behaviour makes a lot of sense for quite a few use cases - I wasn't
> trying to make it sound like it should not be available. Nor that
> FALLOC_FL_ZERO_RANGE should behave differently.
> 
> 
> > IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write
> > zeros, we have users who explicitly don't want it to do this.
> 
> Right - which is why I was asking for a variant of FALLOC_FL_ZERO_RANGE
> (jokingly named FALLOC_FL_ZERO_RANGE_BUT_REALLY in the subject), rather
> than changing the behaviour.
> 
> 
> > Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the
> > filesystem to convert an unwritten range of zeros to a written range
> > by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero
> > the range and fill holes using metadata manipulation, followed by
> > FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real
> > written zeros.
> 
> Yep, something like that would do the trick. Perhaps
> FALLOC_FL_MATERIALIZE_RANGE?

[ FWIW, I really dislike the "RANGE" part of fallocate flag names.
It's redundant (fallocate always operates on a range!) and just
makes names unnecessarily longer. ]

I used "convert range" as the name explicitly because it has
specific meaning for extent space manipulation. i.e. we "convert"
extents from one state to another. "write range" is also has
explicit meaning, in that it will convert extents from unwritten to
written data.

In comparison, "materialise" is something undefined, and could be
easily thought to take something ephemeral (such as a hole) and turn
it into something real (an allocated extent). We wouldn't want this
operation to allocate space, so I think "materialise" is just too
much magic to encoding into an API for an explicit, well defined
state change.

We also have people asking for ZERO_RANGE to just flip existing
extents from written to unwritten (rather than the punch/preallocate
we do now). This is also a "convert" operation, just in the other
direction (from data to zeros rather than from zeros to data).

The observation I'm making here is that these "convert" oeprations
will both makes SEEK_HOLE/SEEK_DATA behave differently for the
underlying data. preallocated space is considered a HOLE, written
zeros are considered DATA. So we do expose the ability to check that
a "convert" operation has actually changed the state of the
underlying extents in either direction...

CONVERT_TO_DATA/CONVERT_TO_ZERO as an operational pair whose
behaviour is visible and easily testable via SEEK_HOLE/SEEK_DATA
makes a lot more sense to me. Also defining them to fail fast if
unwritten extents are not supported by the filesystem (i.e. they
should -never- physically write anything) would also allow
applications to fall back to ZERO_RANGE on filesystems that don't
support unwritten extents to explicitly write zeros if
CONVERT_TO_ZERO fails....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

      reply	other threads:[~2021-01-08 20:33 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-30  6:28 fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents? Andres Freund
2021-01-04 18:19 ` Darrick J. Wong
2021-01-04 19:10   ` Andres Freund
2021-01-04 19:57     ` Avi Kivity
2021-01-12 18:16       ` Christoph Hellwig
2021-01-12 18:39         ` Andreas Dilger
2021-01-12 18:43           ` Christoph Hellwig
2021-01-12 18:51             ` Andreas Dilger
2021-01-12 21:14               ` Darrick J. Wong
2021-01-12 21:36                 ` Andres Freund
2021-01-13  7:44                   ` Avi Kivity
2021-01-19  3:44                     ` Andreas Dilger
2021-01-04 19:17 ` Theodore Ts'o
2021-01-04 19:24   ` Matthew Wilcox
2021-01-04 20:29   ` Andres Freund
2021-01-04 22:40   ` Eric Sandeen
2021-01-06 22:52 ` Dave Chinner
2021-01-06 23:40   ` Andres Freund
2021-01-08 20:32     ` Dave Chinner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210108203241.GI331610@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=andres@anarazel.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.