All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: "Kiselev, Oleg" <okiselev@amazon.com>
Cc: Andreas Dilger <adilger@dilger.ca>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices
Date: Wed, 22 Sep 2021 23:57:35 -0400	[thread overview]
Message-ID: <YUv7LzBIOodL6xyW@mit.edu> (raw)
In-Reply-To: <0A4B11C1-A119-4733-A841-683889E9DC7B@amazon.com>

On Thu, Sep 23, 2021 at 03:31:00AM +0000, Kiselev, Oleg wrote:
> Wouldn't it make more sense to use "write-same" of 0 instead of
> writing a page of zeros and task the layers that do thin
> provisioning and return 0 on read from unallocated blocks to check
> if a block exists before writing zeros to it?

The problem is we have absolutely no idea what "write-same" of 0 will
actually do in terms of whether it will consume storage for various
thinly provisioned devices.  We also have no idea what the performance
might be.  It might be the same speed as explicitly passing in
zero-filled buffers and sending DMA requests to a hard drive.  (e.g.,
potentially very S-L-O-W.)

That's technically true for "discard" as well, except there's a vague
understanding that discard will generally be faster than writing all
zeros --- it's just that it might also be a no-op, or it might
randomly be a no-op, depending on the phase of the moon, or anything
other random variable, including whether "the storage device feels
like it or not".

Bottom line --- unfortunately, the SATA/SCSI standards authors were
mealy-mouthed and made discard something which is completely useless
for our purposes.  And since we don't know anything about the
performance of write same and what it might do from the perspective of
thin-provisioned storage, we can't really depend on it either.

The problem is mke2fs really does need to care about the performance
of discard or write same.  Users want mke2fs to be fast, especially
during the distro installation process.  That's why we implemented the
lazy inode table initialization feature in the first place.  So
reading all each block from the inode table to see if it's zero might
be slow, and so we might be better off just doing the lazy itable init
instead.

Hence, I think Sarthak's approach of giving an explicit hint is a good
approach.

The other approach we can use is to depend on metadata checksums, and
the fact that a new file system will use a different UUID for the seed
for the checksum.  Unfortunately, in order to make this work well, we
need to change e2fsck so that if the checksum doesn't work out ---
especially if all of the checksums in an inode table block are
incorrect --- we need to assume that it means we should just presume
that the inode table block is from an old instance of the file system,
and return a zero-filled block when reading that inode table block.
(Right now, e2fsck still offers the chance to just fix the checksum,
back when we were worried there might be bugs in the metadata checksum
code.)

But I don't think the two approaches are mutually exclusive.  The
approach of an explicit hint is a "safe" and a lot easier to review.

Cheers,

					- Ted

  reply	other threads:[~2021-09-23  3:57 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-21  3:42 [PATCH] mke2fs: Add extended option for prezeroed storage devices Sarthak Kukreti
2021-09-21 21:39 ` Andreas Dilger
2021-09-23  3:31   ` Kiselev, Oleg
2021-09-23  3:57     ` Theodore Ts'o [this message]
2021-09-27 10:39   ` [PATCH v2] " Sarthak Kukreti
2021-10-05  3:49     ` Sarthak Kukreti
2021-10-25  4:25     ` Theodore Ts'o
2022-03-11  9:49       ` Gwendal Grignou
2021-09-27 10:43   ` [PATCH] " Sarthak Kukreti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YUv7LzBIOodL6xyW@mit.edu \
    --to=tytso@mit.edu \
    --cc=adilger@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=okiselev@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.