All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	Jane Chu <jane.chu@oracle.com>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 3/5] vfs: add a zero-initialization mode to fallocate
Date: Thu, 23 Sep 2021 11:42:09 +1000	[thread overview]
Message-ID: <20210923014209.GW1756565@dread.disaster.area> (raw)
In-Reply-To: <20210923000255.GO570615@magnolia>

On Wed, Sep 22, 2021 at 05:02:55PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 22, 2021 at 02:27:25PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote:
> > > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > > > > > I think this wants to be a behavioural modifier for existing
> > > > > > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > > > > > the guarantees ALLOC provides userspace.
> > > > > > > > >
> > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > > > > > want physical zeros to be written instead of the filesystem
> > > > > > > > > optimising away the physical zeros by manipulating the layout
> > > > > > > > > of the file.
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > > Then we have and API that looks like:
> > > > > > > > >
> > > > > > > > >       ALLOC           - allocate space efficiently
> > > > > > > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > > > > > > >       ZERO            - zero data and preallocate space efficiently
> > > > > > > > >       ZERO | INIT     - zero range by writing zeros to it
> > > > > > > > >
> > > > > > > > > Which seems to cater for all the cases I know of where physically
> > > > > > > > > writing zeros instead of allocating unwritten extents is the
> > > > > > > > > preferred behaviour of fallocate()....
> > > > > > > >
> > > > > > > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > > > > > > up with a better idea offhand.
> > > > > > > 
> > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > > > > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > > > > > reads.
> > > > > 
> > > > > Yes, that's the semantic we want, but FUA already defines specific
> > > > > data integrity behaviour in the storage stack w.r.t. volatile
> > > > > caches.
> > > > > 
> > > > > Also, FUA is associated with devices - it's low level storage jargon
> > > > > and so is not really appropriate to call a user interface operation
> > > > > FUA where users have no idea what a "unit" or "access" actually
> > > > > means.
> > > > > 
> > > > > Hence we should not overload this name with some other operation
> > > > > that does not have (and should not have) explicit data integrity
> > > > > requirements. That will just cause confusion for everyone.
> > > > > 
> > > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > > > > > already exists at that file range?
> > > > > 
> > > > > IMO that doesn't work as a behavioural modifier for ALLOC because
> > > > > the ALLOC semantics are explicitly "don't touch existing user
> > > > > data"...
> > > > 
> > > > Well since you can't preallocate /and/ zerorange at the same time...
> > > > 
> > > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
> > > > #define FALLOC_FL_ZERO_EXISTING		(0x80)
> > > 
> > > Except we also want the newly allocated regions (i.e. where holes
> > > were) in that range being zeroed to have zeroes written to them as
> > > well, yes? Otherwise we end up with a combination of unwritten
> > > extents and physical zeroes, and you can't use
> > > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT
> 
> Ooookay.  This is drifting further from the original problem of wanting
> to write a buffer of zeroes to an already-mapped extent.

Yup, that's because fallocate() is a multiplexed set of generic
operations. Adding one-off functionality for a specific use case
ends up breaking down the problem into how it maps as a generic
operation to the rest of the functionality that fallocate()
provides.

Please don't get frustrated here - the discussion needs to be had as
to why fallocate() is the wrong place to be managing *storage
hardware state* - the API and scope creep once the operation is
broken down to a generic behaviour demonstrate this clearly.

> What if part of the region is shared?  If the goal is to make a read
> return zeroes, then the shared extent must be punched, right?  Should
> the new region be preallocated?

[....]

These are implementation details once the API has been defined.

And, yes, I agree, it's nuts that just adding "DAX clear poison"
effectively degenerates into "rewrite the entire fallocate()
implementation in XFS", but that's what adding "write physical
zeroes rather than minimising IO overhead" semantics to generic
fallocate() operations results in.

Fundamentally, fallocate() is not for managing storage hardware
state. It's for optimising away bulk data operations by replacing
them with fast filesystem metadata manipulations and/or hardware
acceleration of the bulk data operation.

So, given that for clearing pmem hardware poison, we already know
it requires writing zeros to the poisoned range and where
in the file that we need to write zeroes to. So what exactly is the
problem with doing this:

	iov.iov_base = zero_buf;
	iov.iov_len = zero_len;
	pwritev2(fd, &iov, 1, zero_offset, RWF_SYNC | RWF_CLEAR_HWERRORS);

Where RWF_CLEAR_HWERRORS tells the low level hardware IO code to
clear error states before performing the write of the user data
provided?

The API doesn't get much simpler or explict, and the
RWF_CLEAR_HWERRORS flag is trivial to propagate all the way down
into the DAX IO code where it can be performed appropriately before
performing the write of the new user data into that range. And it's
way simpler than plumbing this sort of things into fallocate().


We need an API that provides a one-off, single data write semantic
to be defined and implemented to manage pmem hardware state. We
already have that in pwritev2().

Hence this discussion leads me to conclude that fallocate() simply
isn't the right interface to clear storage hardware poison state and
it's much simpler for everyone - kernel and userspace - to provide a
pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
clear hardware error state before issuing this user write to the
hardware.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2021-09-23  1:42 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-18  1:30 [PATCHSET RFC v2 jane 0/5] vfs: enable userspace to reset damaged file storage Darrick J. Wong
2021-09-18  1:30 ` [PATCH 1/5] dax: prepare pmem for use by zero-initializing contents and clearing poisons Darrick J. Wong
2021-09-18 16:54   ` riteshh
2021-09-20 17:22     ` Darrick J. Wong
2021-09-21  4:07       ` riteshh
2021-09-22 18:26         ` Darrick J. Wong
2021-09-22 19:47           ` riteshh
2021-09-22 20:26           ` Dan Williams
2021-09-21  8:34   ` Christoph Hellwig
2021-09-22 18:10     ` Darrick J. Wong
2021-09-18  1:30 ` [PATCH 2/5] iomap: use accelerated zeroing on a block device to zero a file range Darrick J. Wong
2021-09-18 16:55   ` riteshh
2021-09-21  8:29   ` Christoph Hellwig
2021-09-22 18:53     ` Darrick J. Wong
2021-09-21 22:33   ` Dave Chinner
2021-09-22 18:54     ` Darrick J. Wong
2021-09-18  1:31 ` [PATCH 3/5] vfs: add a zero-initialization mode to fallocate Darrick J. Wong
2021-09-18 16:58   ` riteshh
2021-09-20 17:52   ` Eric Biggers
2021-09-20 18:06     ` Darrick J. Wong
2021-09-21  0:44   ` Dave Chinner
2021-09-21  8:31     ` Christoph Hellwig
2021-09-22  2:16       ` Dan Williams
2021-09-22  2:38         ` Darrick J. Wong
2021-09-22  3:59           ` Dave Chinner
2021-09-22  4:13             ` Darrick J. Wong
2021-09-22  5:49               ` Dave Chinner
2021-09-22 21:27                 ` Darrick J. Wong
2021-09-23  0:02                   ` Darrick J. Wong
2021-09-23  0:44                     ` Darrick J. Wong
2021-09-23  1:42                     ` Dave Chinner [this message]
2021-09-23  2:43                       ` Dan Williams
2021-09-23  5:42                         ` Dan Williams
2021-09-23 22:54                           ` Dave Chinner
2021-09-24  1:18                             ` Dan Williams
2021-09-24  1:21                               ` Jane Chu
2021-09-24  1:35                                 ` Darrick J. Wong
2021-09-27 21:07                                   ` Dave Chinner
2021-09-27 21:57                                     ` Jane Chu
2021-09-28  0:08                                       ` Dan Williams
2021-09-22  5:28     ` riteshh
2021-09-18  1:31 ` [PATCH 4/5] xfs: implement FALLOC_FL_ZEROINIT_RANGE Darrick J. Wong
2021-09-18  1:31 ` [PATCH 5/5] ext4: " Darrick J. Wong
2021-09-18 17:07   ` riteshh
2021-09-20 18:11     ` Darrick J. Wong
2021-09-21  6:10       ` riteshh
2021-09-18 18:05 ` [PATCHSET RFC v2 jane 0/5] vfs: enable userspace to reset damaged file storage Dan Williams
2021-09-23  0:51 ` Darrick J. Wong
2021-09-23  1:17   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210923014209.GW1756565@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jane.chu@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.