All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mke2fs: Add extended option for prezeroed storage devices
@ 2021-09-21  3:42 Sarthak Kukreti
  2021-09-21 21:39 ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Sarthak Kukreti @ 2021-09-21  3:42 UTC (permalink / raw)
  To: linux-ext4; +Cc: gwendal, tytso, Sarthak Kukreti

From: Sarthak Kukreti <sarthakkukreti@chromium.org>

This patch adds an extended option "assume_storage_prezeroed" to
mke2fs. When enabled, this option acts as a hint to mke2fs that
the underlying block device was zeroed before mke2fs was called.
This allows mke2fs to optimize out the zeroing of the inode
table and the journal, which speeds up the filesystem creation
time.

Additionally, on thinly provisioned storage devices (like Ceph,
dm-thin), reads on unmapped extents return zero. This property
allows mke2fs (with assume_storage_prezeroed) to avoid
pre-allocating metadata space for inode tables for the entire
filesystem and saves space that would normally be preallocated
for zero inode tables.

Testing on ChromeOS (running linux kernel 4.19) with dm-thin
and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':

- Time taken by mke2fs drops from 1.07s to 0.08s.
- Avoiding zeroing out the inode table and journal reduces the
  initial metadata space allocation from 0.48% to 0.01%.
- Lazy inode table zeroing results in a further 1.45% of logical
  volume space getting allocated for inode tables, even if not file
  data is added to the filesystem. With assume_storage_prezeroed,
  the metadata allocation remains at 0.01%.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 misc/mke2fs.8.in |  6 ++++++
 misc/mke2fs.c    | 21 ++++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index c0b53245..b82f8445 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -364,6 +364,12 @@ This speeds up file system initialization noticeably, but carries some
 small risk if the system crashes before the journal has been overwritten
 entirely one time.  If the option value is omitted, it defaults to 1 to
 enable lazy journal inode zeroing.
+.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+If enabled,
+.BR mke2fs
+assumes that the storage device has been prezeroed, skips zeroing the journal
+and inode tables, and annotates the block group flags to signal that the inode
+table has been zeroed.
 .TP
 .B no_copy_xattrs
 Normally
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 04b2fbce..5293d9b0 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -95,6 +95,7 @@ int	journal_size;
 int	journal_flags;
 int	journal_fc_size;
 static int	lazy_itable_init;
+static int	assume_storage_prezeroed;
 static int	packed_meta_blocks;
 int		no_copy_xattrs;
 static char	*bad_blocks_filename = NULL;
@@ -1012,6 +1013,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
 				lazy_itable_init = strtoul(arg, &p, 0);
 			else
 				lazy_itable_init = 1;
+		} else if (!strcmp(token, "assume_storage_prezeroed")) {
+			if (arg)
+				assume_storage_prezeroed = strtoul(arg, &p, 0);
+			else
+				assume_storage_prezeroed = 1;
 		} else if (!strcmp(token, "lazy_journal_init")) {
 			if (arg)
 				journal_flags |= strtoul(arg, &p, 0) ?
@@ -1115,7 +1121,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
 			"\tnodiscard\n"
 			"\tencoding=<encoding>\n"
 			"\tencoding_flags=<flags>\n"
-			"\tquotatype=<quota type(s) to be enabled>\n\n"),
+			"\tquotatype=<quota type(s) to be enabled>\n"
+			"\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
 			badopt ? badopt : "");
 		free(buf);
 		exit(1);
@@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
 		io_channel_set_options(fs->io, opt_string);
 	}
 
+	if (assume_storage_prezeroed) {
+	  if (verbose)
+			printf("%s",
+				       _("Assuming the storage device is prezeroed "
+                         "- skipping inode table and journal wipe\n"));
+
+	  lazy_itable_init = 1;
+	  itable_zeroed = 1;
+	  zero_hugefile = 0;
+	  journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
+	}
+
 	/* Can't undo discard ... */
 	if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
 		retval = mke2fs_discard_device(fs);
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices
  2021-09-21  3:42 [PATCH] mke2fs: Add extended option for prezeroed storage devices Sarthak Kukreti
@ 2021-09-21 21:39 ` Andreas Dilger
  2021-09-23  3:31   ` Kiselev, Oleg
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Andreas Dilger @ 2021-09-21 21:39 UTC (permalink / raw)
  To: Sarthak Kukreti; +Cc: linux-ext4, gwendal, tytso

[-- Attachment #1: Type: text/plain, Size: 2999 bytes --]

On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
> 
> From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> 
> This patch adds an extended option "assume_storage_prezeroed" to
> mke2fs. When enabled, this option acts as a hint to mke2fs that
> the underlying block device was zeroed before mke2fs was called.
> This allows mke2fs to optimize out the zeroing of the inode
> table and the journal, which speeds up the filesystem creation
> time.
> 
> Additionally, on thinly provisioned storage devices (like Ceph,
> dm-thin),

... and newly-created sparse loopback files

> reads on unmapped extents return zero. This property
> allows mke2fs (with assume_storage_prezeroed) to avoid
> pre-allocating metadata space for inode tables for the entire
> filesystem and saves space that would normally be preallocated
> for zero inode tables.
> 
> Testing on ChromeOS (running linux kernel 4.19) with dm-thin
> and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
> 
> - Time taken by mke2fs drops from 1.07s to 0.08s.
> - Avoiding zeroing out the inode table and journal reduces the
>  initial metadata space allocation from 0.48% to 0.01%.
> - Lazy inode table zeroing results in a further 1.45% of logical
>  volume space getting allocated for inode tables, even if not file
>  data is added to the filesystem. With assume_storage_prezeroed,
>  the metadata allocation remains at 0.01%.

This seems beneficial, but I'm wondering if this could also be
done automatically when TRIM/DISCARD is used by mke2fs to erase
a device?

One safe option to do this automatically would be to start by
*reading* the disk blocks and check if they are all zero, and only
switch to zero-block writes if any block is found with non-zero
data.  That would avoid the extra space usage from zero-block
writes in the above cases, and also work for the huge majority of
users that won't know the "assume_storage_prezeroed" option even
exits, though it won't necessarily reduce the runtime.

> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 04b2fbce..5293d9b0 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
> 		io_channel_set_options(fs->io, opt_string);
> 	}
> 
> +	if (assume_storage_prezeroed) {
> +	  if (verbose)
> +			printf("%s",
> +				       _("Assuming the storage device is prezeroed "
> +                         "- skipping inode table and journal wipe\n"));
> +
> +	  lazy_itable_init = 1;
> +	  itable_zeroed = 1;
> +	  zero_hugefile = 0;
> +	  journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
> +	}

Indentation appears to be broken here - only 2 spaces instead of a tab.

This is also missing any kind of test case.  Since a large number of
the e2fsck test cases are using loopback filesystems created on a sparse
file, this would both be good test cases, as well as reducing time/space
used during testing.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices
  2021-09-21 21:39 ` Andreas Dilger
@ 2021-09-23  3:31   ` Kiselev, Oleg
  2021-09-23  3:57     ` Theodore Ts'o
  2021-09-27 10:39   ` [PATCH v2] " Sarthak Kukreti
  2021-09-27 10:43   ` [PATCH] " Sarthak Kukreti
  2 siblings, 1 reply; 9+ messages in thread
From: Kiselev, Oleg @ 2021-09-23  3:31 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

Wouldn't it make more sense to use "write-same" of 0 instead of writing a page of zeros and task the layers that do thin provisioning and return 0 on read from unallocated blocks to check if a block exists before writing zeros to it?

On 9/21/21, 2:40 PM, "Andreas Dilger" <adilger@dilger.ca> wrote:

    On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
    > 
    > From: Sarthak Kukreti <sarthakkukreti@chromium.org>
    > 
    > This patch adds an extended option "assume_storage_prezeroed" to
    > mke2fs. When enabled, this option acts as a hint to mke2fs that
    > the underlying block device was zeroed before mke2fs was called.
    > This allows mke2fs to optimize out the zeroing of the inode
    > table and the journal, which speeds up the filesystem creation
    > time.
    > 
    > Additionally, on thinly provisioned storage devices (like Ceph,
    > dm-thin),

    ... and newly-created sparse loopback files

    > reads on unmapped extents return zero. This property
    > allows mke2fs (with assume_storage_prezeroed) to avoid
    > pre-allocating metadata space for inode tables for the entire
    > filesystem and saves space that would normally be preallocated
    > for zero inode tables.
    > 
    > Testing on ChromeOS (running linux kernel 4.19) with dm-thin
    > and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
    > 
    > - Time taken by mke2fs drops from 1.07s to 0.08s.
    > - Avoiding zeroing out the inode table and journal reduces the
    >  initial metadata space allocation from 0.48% to 0.01%.
    > - Lazy inode table zeroing results in a further 1.45% of logical
    >  volume space getting allocated for inode tables, even if not file
    >  data is added to the filesystem. With assume_storage_prezeroed,
    >  the metadata allocation remains at 0.01%.

    This seems beneficial, but I'm wondering if this could also be
    done automatically when TRIM/DISCARD is used by mke2fs to erase
    a device?

    One safe option to do this automatically would be to start by
    *reading* the disk blocks and check if they are all zero, and only
    switch to zero-block writes if any block is found with non-zero
    data.  That would avoid the extra space usage from zero-block
    writes in the above cases, and also work for the huge majority of
    users that won't know the "assume_storage_prezeroed" option even
    exits, though it won't necessarily reduce the runtime.

    > diff --git a/misc/mke2fs.c b/misc/mke2fs.c
    > index 04b2fbce..5293d9b0 100644
    > --- a/misc/mke2fs.c
    > +++ b/misc/mke2fs.c
    > @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
    > 		io_channel_set_options(fs->io, opt_string);
    > 	}
    > 
    > +	if (assume_storage_prezeroed) {
    > +	  if (verbose)
    > +			printf("%s",
    > +				       _("Assuming the storage device is prezeroed "
    > +                         "- skipping inode table and journal wipe\n"));
    > +
    > +	  lazy_itable_init = 1;
    > +	  itable_zeroed = 1;
    > +	  zero_hugefile = 0;
    > +	  journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
    > +	}

    Indentation appears to be broken here - only 2 spaces instead of a tab.

    This is also missing any kind of test case.  Since a large number of
    the e2fsck test cases are using loopback filesystems created on a sparse
    file, this would both be good test cases, as well as reducing time/space
    used during testing.

    Cheers, Andreas







^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices
  2021-09-23  3:31   ` Kiselev, Oleg
@ 2021-09-23  3:57     ` Theodore Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2021-09-23  3:57 UTC (permalink / raw)
  To: Kiselev, Oleg; +Cc: Andreas Dilger, linux-ext4

On Thu, Sep 23, 2021 at 03:31:00AM +0000, Kiselev, Oleg wrote:
> Wouldn't it make more sense to use "write-same" of 0 instead of
> writing a page of zeros and task the layers that do thin
> provisioning and return 0 on read from unallocated blocks to check
> if a block exists before writing zeros to it?

The problem is we have absolutely no idea what "write-same" of 0 will
actually do in terms of whether it will consume storage for various
thinly provisioned devices.  We also have no idea what the performance
might be.  It might be the same speed as explicitly passing in
zero-filled buffers and sending DMA requests to a hard drive.  (e.g.,
potentially very S-L-O-W.)

That's technically true for "discard" as well, except there's a vague
understanding that discard will generally be faster than writing all
zeros --- it's just that it might also be a no-op, or it might
randomly be a no-op, depending on the phase of the moon, or anything
other random variable, including whether "the storage device feels
like it or not".

Bottom line --- unfortunately, the SATA/SCSI standards authors were
mealy-mouthed and made discard something which is completely useless
for our purposes.  And since we don't know anything about the
performance of write same and what it might do from the perspective of
thin-provisioned storage, we can't really depend on it either.

The problem is mke2fs really does need to care about the performance
of discard or write same.  Users want mke2fs to be fast, especially
during the distro installation process.  That's why we implemented the
lazy inode table initialization feature in the first place.  So
reading all each block from the inode table to see if it's zero might
be slow, and so we might be better off just doing the lazy itable init
instead.

Hence, I think Sarthak's approach of giving an explicit hint is a good
approach.

The other approach we can use is to depend on metadata checksums, and
the fact that a new file system will use a different UUID for the seed
for the checksum.  Unfortunately, in order to make this work well, we
need to change e2fsck so that if the checksum doesn't work out ---
especially if all of the checksums in an inode table block are
incorrect --- we need to assume that it means we should just presume
that the inode table block is from an old instance of the file system,
and return a zero-filled block when reading that inode table block.
(Right now, e2fsck still offers the chance to just fix the checksum,
back when we were worried there might be bugs in the metadata checksum
code.)

But I don't think the two approaches are mutually exclusive.  The
approach of an explicit hint is a "safe" and a lot easier to review.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2] mke2fs: Add extended option for prezeroed storage devices
  2021-09-21 21:39 ` Andreas Dilger
  2021-09-23  3:31   ` Kiselev, Oleg
@ 2021-09-27 10:39   ` Sarthak Kukreti
  2021-10-05  3:49     ` Sarthak Kukreti
  2021-10-25  4:25     ` Theodore Ts'o
  2021-09-27 10:43   ` [PATCH] " Sarthak Kukreti
  2 siblings, 2 replies; 9+ messages in thread
From: Sarthak Kukreti @ 2021-09-27 10:39 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger, gwendal, tytso, okiselev

This patch adds an extended option "assume_storage_prezeroed" to
mke2fs. When enabled, this option acts as a hint to mke2fs that
the underlying block device was zeroed before mke2fs was called.
This allows mke2fs to optimize out the zeroing of the inode
table and the journal, which speeds up the filesystem creation
time.

Additionally, on thinly provisioned storage devices (like Ceph,
dm-thin, newly created sparse loopback files), reads on unmapped extents
return zero. This property allows mke2fs (with assume_storage_prezeroed)
to avoid pre-allocating metadata space for inode tables for the entire
filesystem and saves space that would normally be preallocated
for zero inode tables.

Tests
-----
1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4
filesystem drops the time taken by mke2fs from 0.09s to 0.04s
and reduces the initial metadata space allocation (stat on
sparse file) from 139736 blocks (545M) to 8672 blocks (34M).

2) On ChromeOS (running linux kernel 4.19) with dm-thin
and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':

- Time taken by mke2fs drops from 1.07s to 0.08s.
- Avoiding zeroing out the inode table and journal reduces the
  initial metadata space allocation from 0.48% to 0.01%.
- Lazy inode table zeroing results in a further 1.45% of logical
  volume space getting allocated for inode tables, even if no file
  data is added to the filesystem. With assume_storage_prezeroed,
  the metadata allocation remains at 0.01%.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
--
Changes in v2: Added regression test, fixed indentation.
---
 misc/mke2fs.8.in                        |  7 ++++++
 misc/mke2fs.c                           | 21 ++++++++++++++++-
 tests/m_assume_storage_prezeroed/expect |  2 ++
 tests/m_assume_storage_prezeroed/script | 31 +++++++++++++++++++++++++
 4 files changed, 60 insertions(+), 1 deletion(-)
 create mode 100644 tests/m_assume_storage_prezeroed/expect
 create mode 100644 tests/m_assume_storage_prezeroed/script

diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index c0b53245..5c6ea5ec 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten
 entirely one time.  If the option value is omitted, it defaults to 1 to
 enable lazy journal inode zeroing.
 .TP
+.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+If enabled,
+.BR mke2fs
+assumes that the storage device has been prezeroed, skips zeroing the journal
+and inode tables, and annotates the block group flags to signal that the inode
+table has been zeroed.
+.TP
 .B no_copy_xattrs
 Normally
 .B mke2fs
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 04b2fbce..24c69966 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -95,6 +95,7 @@ int	journal_size;
 int	journal_flags;
 int	journal_fc_size;
 static int	lazy_itable_init;
+static int	assume_storage_prezeroed;
 static int	packed_meta_blocks;
 int		no_copy_xattrs;
 static char	*bad_blocks_filename = NULL;
@@ -1012,6 +1013,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
 				lazy_itable_init = strtoul(arg, &p, 0);
 			else
 				lazy_itable_init = 1;
+		} else if (!strcmp(token, "assume_storage_prezeroed")) {
+			if (arg)
+				assume_storage_prezeroed = strtoul(arg, &p, 0);
+			else
+				assume_storage_prezeroed = 1;
 		} else if (!strcmp(token, "lazy_journal_init")) {
 			if (arg)
 				journal_flags |= strtoul(arg, &p, 0) ?
@@ -1115,7 +1121,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
 			"\tnodiscard\n"
 			"\tencoding=<encoding>\n"
 			"\tencoding_flags=<flags>\n"
-			"\tquotatype=<quota type(s) to be enabled>\n\n"),
+			"\tquotatype=<quota type(s) to be enabled>\n"
+			"\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
 			badopt ? badopt : "");
 		free(buf);
 		exit(1);
@@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
 		io_channel_set_options(fs->io, opt_string);
 	}
 
+	if (assume_storage_prezeroed) {
+		if (verbose)
+			printf("%s",
+			       _("Assuming the storage device is prezeroed "
+			       "- skipping inode table and journal wipe\n"));
+
+		lazy_itable_init = 1;
+		itable_zeroed = 1;
+		zero_hugefile = 0;
+		journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
+	}
+
 	/* Can't undo discard ... */
 	if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
 		retval = mke2fs_discard_device(fs);
diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect
new file mode 100644
index 00000000..2ca3784a
--- /dev/null
+++ b/tests/m_assume_storage_prezeroed/expect
@@ -0,0 +1,2 @@
+2384
+336
diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script
new file mode 100644
index 00000000..0745fb28
--- /dev/null
+++ b/tests/m_assume_storage_prezeroed/script
@@ -0,0 +1,31 @@
+test_description="test prezeroed storage metadata allocation"
+FILE_SIZE=16M
+
+LOG=$test_name.log
+OUT=$test_name.out
+EXP=$test_dir/expect
+
+dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
+dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
+
+$MKE2FS -o Linux -t ext4 -O has_journal $TMPFILE.1 >> $LOG 2>&1
+stat -c "%b" $TMPFILE.1 > $OUT
+
+$MKE2FS -o Linux -t ext4 -O has_journal -E assume_storage_prezeroed=1 $TMPFILE.2 >> $LOG 2>&1
+stat -c "%b" $TMPFILE.2 >> $OUT
+
+rm -f $TMPFILE.1 $TMPFILE.2
+
+cmp -s $OUT $EXP
+status=$?
+
+if [ "$status" = 0 ] ; then
+	echo "$test_name: $test_description: ok"
+	touch $test_name.ok
+else
+	echo "$test_name: $test_description: failed"
+	cat $LOG > $test_name.failed
+	diff $EXP $OUT >> $test_name.failed
+fi
+
+unset LOG OUT EXP FILE_SIZE
\ No newline at end of file
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices
  2021-09-21 21:39 ` Andreas Dilger
  2021-09-23  3:31   ` Kiselev, Oleg
  2021-09-27 10:39   ` [PATCH v2] " Sarthak Kukreti
@ 2021-09-27 10:43   ` Sarthak Kukreti
  2 siblings, 0 replies; 9+ messages in thread
From: Sarthak Kukreti @ 2021-09-27 10:43 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4, Gwendal Grignou, Theodore Ts'o

Thanks for reviewing the patch, Andreas!

On Tue, Sep 21, 2021 at 2:39 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
> > is
> > From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> >
...
> > Additionally, on thinly provisioned storage devices (like Ceph,
> > dm-thin),
>
> ... and newly-created sparse loopback files
>
Thanks for pointing that out, added to the commit message in v2.
...
> > Testing on ChromeOS (running linux kernel 4.19) with dm-thin
> > and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
> >
> > - Time taken by mke2fs drops from 1.07s to 0.08s.
> > - Avoiding zeroing out the inode table and journal reduces the
> >  initial metadata space allocation from 0.48% to 0.01%.
> > - Lazy inode table zeroing results in a further 1.45% of logical
> >  volume space getting allocated for inode tables, even if not file
> >  data is added to the filesystem. With assume_storage_prezeroed,
> >  the metadata allocation remains at 0.01%.
>
> This seems beneficial, but I'm wondering if this could also be
> done automatically when TRIM/DISCARD is used by mke2fs to erase
> a device?
>
> One safe option to do this automatically would be to start by
> *reading* the disk blocks and check if they are all zero, and only
> switch to zero-block writes if any block is found with non-zero
> data.  That would avoid the extra space usage from zero-block
> writes in the above cases, and also work for the huge majority of
> users that won't know the "assume_storage_prezeroed" option even
> exits, though it won't necessarily reduce the runtime.
>
I agree with Ted (quoting a reply on a forked thread below) that
reading all inode table blocks on the device will slow down mke2fs a
lot depending on the storage medium and size. Maybe it can be done
instead at first mount in conjunction with lazy_itable_init ie. ext4
reads the block and only issues a zero-out if the block is not already
zero? Even so, an explicit hint would be compatible with this
approach: it avoids (unnecessarily) reading through all the inode
table blocks as long as the hint was passed at creation time.

On Wed, Sep 22, 2021 at 8:57 PM Theodore Ts'o <tytso@mit.edu> wrote:
> The problem is mke2fs really does need to care about the performance
> of discard or write same.  Users want mke2fs to be fast, especially
> during the distro installation process.  That's why we implemented the
> lazy inode table initialization feature in the first place.  So
> reading all each block from the inode table to see if it's zero might
> be slow, and so we might be better off just doing the lazy itable init
> instead.
...
> > +     if (assume_storage_prezeroed) {
> > +       if (verbose)
> > +                     printf("%s",
> > +                                    _("Assuming the storage device is prezeroed "
> > +                         "- skipping inode table and journal wipe\n"));
> > +
> > +       lazy_itable_init = 1;
> > +       itable_zeroed = 1;
> > +       zero_hugefile = 0;
> > +       journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
> > +     }
>
> Indentation appears to be broken here - only 2 spaces instead of a tab.
>
> This is also missing any kind of test case.  Since a large number of
> the e2fsck test cases are using loopback filesystems created on a sparse
> file, this would both be good test cases, as well as reducing time/space
> used during testing.
>
Oops, thanks for catching that! Fixed in v2 and I added a test case
for this option. I was playing around with adding the option as a
default to tests/mke2fs.conf.in; that didn't affect the overall test
run time much (a lot of the tests seem to be dd'ing entire files and
not using sparse files).

Best
Sarthak

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mke2fs: Add extended option for prezeroed storage devices
  2021-09-27 10:39   ` [PATCH v2] " Sarthak Kukreti
@ 2021-10-05  3:49     ` Sarthak Kukreti
  2021-10-25  4:25     ` Theodore Ts'o
  1 sibling, 0 replies; 9+ messages in thread
From: Sarthak Kukreti @ 2021-10-05  3:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: Andreas Dilger, Gwendal Grignou, Theodore Ts'o, okiselev

Hi all,

Thanks for the discussions on the original patch. I wanted to circle
back and see if you had any further comments/concerns on the second
version of the patchset.

Best
Sarthak

On Mon, Sep 27, 2021 at 3:44 AM Sarthak Kukreti
<sarthakkukreti@chromium.org> wrote:
>
> This patch adds an extended option "assume_storage_prezeroed" to
> mke2fs. When enabled, this option acts as a hint to mke2fs that
> the underlying block device was zeroed before mke2fs was called.
> This allows mke2fs to optimize out the zeroing of the inode
> table and the journal, which speeds up the filesystem creation
> time.
>
> Additionally, on thinly provisioned storage devices (like Ceph,
> dm-thin, newly created sparse loopback files), reads on unmapped extents
> return zero. This property allows mke2fs (with assume_storage_prezeroed)
> to avoid pre-allocating metadata space for inode tables for the entire
> filesystem and saves space that would normally be preallocated
> for zero inode tables.
>
> Tests
> -----
> 1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4
> filesystem drops the time taken by mke2fs from 0.09s to 0.04s
> and reduces the initial metadata space allocation (stat on
> sparse file) from 139736 blocks (545M) to 8672 blocks (34M).
>
> 2) On ChromeOS (running linux kernel 4.19) with dm-thin
> and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
>
> - Time taken by mke2fs drops from 1.07s to 0.08s.
> - Avoiding zeroing out the inode table and journal reduces the
>   initial metadata space allocation from 0.48% to 0.01%.
> - Lazy inode table zeroing results in a further 1.45% of logical
>   volume space getting allocated for inode tables, even if no file
>   data is added to the filesystem. With assume_storage_prezeroed,
>   the metadata allocation remains at 0.01%.
>
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> --
> Changes in v2: Added regression test, fixed indentation.
> ---
>  misc/mke2fs.8.in                        |  7 ++++++
>  misc/mke2fs.c                           | 21 ++++++++++++++++-
>  tests/m_assume_storage_prezeroed/expect |  2 ++
>  tests/m_assume_storage_prezeroed/script | 31 +++++++++++++++++++++++++
>  4 files changed, 60 insertions(+), 1 deletion(-)
>  create mode 100644 tests/m_assume_storage_prezeroed/expect
>  create mode 100644 tests/m_assume_storage_prezeroed/script
>
> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
> index c0b53245..5c6ea5ec 100644
> --- a/misc/mke2fs.8.in
> +++ b/misc/mke2fs.8.in
> @@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten
>  entirely one time.  If the option value is omitted, it defaults to 1 to
>  enable lazy journal inode zeroing.
>  .TP
> +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
> +If enabled,
> +.BR mke2fs
> +assumes that the storage device has been prezeroed, skips zeroing the journal
> +and inode tables, and annotates the block group flags to signal that the inode
> +table has been zeroed.
> +.TP
>  .B no_copy_xattrs
>  Normally
>  .B mke2fs
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 04b2fbce..24c69966 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -95,6 +95,7 @@ int   journal_size;
>  int    journal_flags;
>  int    journal_fc_size;
>  static int     lazy_itable_init;
> +static int     assume_storage_prezeroed;
>  static int     packed_meta_blocks;
>  int            no_copy_xattrs;
>  static char    *bad_blocks_filename = NULL;
> @@ -1012,6 +1013,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                                 lazy_itable_init = strtoul(arg, &p, 0);
>                         else
>                                 lazy_itable_init = 1;
> +               } else if (!strcmp(token, "assume_storage_prezeroed")) {
> +                       if (arg)
> +                               assume_storage_prezeroed = strtoul(arg, &p, 0);
> +                       else
> +                               assume_storage_prezeroed = 1;
>                 } else if (!strcmp(token, "lazy_journal_init")) {
>                         if (arg)
>                                 journal_flags |= strtoul(arg, &p, 0) ?
> @@ -1115,7 +1121,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                         "\tnodiscard\n"
>                         "\tencoding=<encoding>\n"
>                         "\tencoding_flags=<flags>\n"
> -                       "\tquotatype=<quota type(s) to be enabled>\n\n"),
> +                       "\tquotatype=<quota type(s) to be enabled>\n"
> +                       "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
>                         badopt ? badopt : "");
>                 free(buf);
>                 exit(1);
> @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
>                 io_channel_set_options(fs->io, opt_string);
>         }
>
> +       if (assume_storage_prezeroed) {
> +               if (verbose)
> +                       printf("%s",
> +                              _("Assuming the storage device is prezeroed "
> +                              "- skipping inode table and journal wipe\n"));
> +
> +               lazy_itable_init = 1;
> +               itable_zeroed = 1;
> +               zero_hugefile = 0;
> +               journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
> +       }
> +
>         /* Can't undo discard ... */
>         if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
>                 retval = mke2fs_discard_device(fs);
> diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect
> new file mode 100644
> index 00000000..2ca3784a
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/expect
> @@ -0,0 +1,2 @@
> +2384
> +336
> diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script
> new file mode 100644
> index 00000000..0745fb28
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/script
> @@ -0,0 +1,31 @@
> +test_description="test prezeroed storage metadata allocation"
> +FILE_SIZE=16M
> +
> +LOG=$test_name.log
> +OUT=$test_name.out
> +EXP=$test_dir/expect
> +
> +dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +
> +$MKE2FS -o Linux -t ext4 -O has_journal $TMPFILE.1 >> $LOG 2>&1
> +stat -c "%b" $TMPFILE.1 > $OUT
> +
> +$MKE2FS -o Linux -t ext4 -O has_journal -E assume_storage_prezeroed=1 $TMPFILE.2 >> $LOG 2>&1
> +stat -c "%b" $TMPFILE.2 >> $OUT
> +
> +rm -f $TMPFILE.1 $TMPFILE.2
> +
> +cmp -s $OUT $EXP
> +status=$?
> +
> +if [ "$status" = 0 ] ; then
> +       echo "$test_name: $test_description: ok"
> +       touch $test_name.ok
> +else
> +       echo "$test_name: $test_description: failed"
> +       cat $LOG > $test_name.failed
> +       diff $EXP $OUT >> $test_name.failed
> +fi
> +
> +unset LOG OUT EXP FILE_SIZE
> \ No newline at end of file
> --
> 2.31.0
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mke2fs: Add extended option for prezeroed storage devices
  2021-09-27 10:39   ` [PATCH v2] " Sarthak Kukreti
  2021-10-05  3:49     ` Sarthak Kukreti
@ 2021-10-25  4:25     ` Theodore Ts'o
  2022-03-11  9:49       ` Gwendal Grignou
  1 sibling, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2021-10-25  4:25 UTC (permalink / raw)
  To: Sarthak Kukreti; +Cc: linux-ext4, adilger, gwendal, okiselev

I tried running the regression test, and it was failing for me; it
showed that even with -E assume_stoarge_prezeroed, the size of the
$TMPFILE.1 and $TMPFILE.2 was the same.  Looking into this, it was
because in lib/ext2fs/unix_io.c, when the file is a plain file
io_channel_discard_zeroes_data() returns true, since it assumes that
we can use PUNCH_HOLE to implement unix_io_discard(), which is
guaranteed to work.

So I had to change the regression test to use losetup, which also
meant that the test had to run as root....

Anyway, this is what I've checked into e2fsprogs.

      	       	    	       	  - Ted

commit bd2e72c5c5521b561d20a881c843a64a5832721a
Author: Sarthak Kukreti <sarthakkukreti@chromium.org>
Date:   Mon Sep 27 03:39:10 2021 -0700

    mke2fs: add extended option for prezeroed storage devices
    
    This patch adds an extended option "assume_storage_prezeroed" to
    mke2fs. When enabled, this option acts as a hint to mke2fs that the
    underlying block device was zeroed before mke2fs was called.  This
    allows mke2fs to optimize out the zeroing of the inode table and the
    journal, which speeds up the filesystem creation time.
    
    Additionally, on thinly provisioned storage devices (like Ceph,
    dm-thin, newly created sparse loopback files), reads on unmapped
    extents return zero. This property allows mke2fs (with
    assume_storage_prezeroed) to avoid pre-allocating metadata space for
    inode tables for the entire filesystem and saves space that would
    normally be preallocated for zero inode tables.
    
    Tests
    -----
    1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4
    filesystem drops the time taken by mke2fs from 0.09s to 0.04s
    and reduces the initial metadata space allocation (stat on
    sparse file) from 139736 blocks (545M) to 8672 blocks (34M).
    
    2) On ChromeOS (running linux kernel 4.19) with dm-thin
    and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
    
    - Time taken by mke2fs drops from 1.07s to 0.08s.
    - Avoiding zeroing out the inode table and journal reduces the
      initial metadata space allocation from 0.48% to 0.01%.
    - Lazy inode table zeroing results in a further 1.45% of logical
      volume space getting allocated for inode tables, even if no file
      data is added to the filesystem. With assume_storage_prezeroed,
      the metadata allocation remains at 0.01%.
    
    [ Fixed regression test to work on newer versions of e2fsprogs -- TYT ]
    
    Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index b378e4d7..30f97bb5 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten
 entirely one time.  If the option value is omitted, it defaults to 1 to
 enable lazy journal inode zeroing.
 .TP
+.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+If enabled,
+.BR mke2fs
+assumes that the storage device has been prezeroed, skips zeroing the journal
+and inode tables, and annotates the block group flags to signal that the inode
+table has been zeroed.
+.TP
 .B no_copy_xattrs
 Normally
 .B mke2fs
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index c955b318..76b8b8c6 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -96,6 +96,7 @@ int	journal_flags;
 int	journal_fc_size;
 static e2_blkcnt_t	orphan_file_blocks;
 static int	lazy_itable_init;
+static int	assume_storage_prezeroed;
 static int	packed_meta_blocks;
 int		no_copy_xattrs;
 static char	*bad_blocks_filename = NULL;
@@ -1013,6 +1014,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
 				lazy_itable_init = strtoul(arg, &p, 0);
 			else
 				lazy_itable_init = 1;
+		} else if (!strcmp(token, "assume_storage_prezeroed")) {
+			if (arg)
+				assume_storage_prezeroed = strtoul(arg, &p, 0);
+			else
+				assume_storage_prezeroed = 1;
 		} else if (!strcmp(token, "lazy_journal_init")) {
 			if (arg)
 				journal_flags |= strtoul(arg, &p, 0) ?
@@ -1131,7 +1137,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
 			"\tnodiscard\n"
 			"\tencoding=<encoding>\n"
 			"\tencoding_flags=<flags>\n"
-			"\tquotatype=<quota type(s) to be enabled>\n\n"),
+			"\tquotatype=<quota type(s) to be enabled>\n"
+			"\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
 			badopt ? badopt : "");
 		free(buf);
 		exit(1);
@@ -3125,6 +3132,18 @@ int main (int argc, char *argv[])
 		io_channel_set_options(fs->io, opt_string);
 	}
 
+	if (assume_storage_prezeroed) {
+		if (verbose)
+			printf("%s",
+			       _("Assuming the storage device is prezeroed "
+			       "- skipping inode table and journal wipe\n"));
+
+		lazy_itable_init = 1;
+		itable_zeroed = 1;
+		zero_hugefile = 0;
+		journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
+	}
+
 	/* Can't undo discard ... */
 	if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
 		retval = mke2fs_discard_device(fs);
diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect
new file mode 100644
index 00000000..b735e242
--- /dev/null
+++ b/tests/m_assume_storage_prezeroed/expect
@@ -0,0 +1,2 @@
+> 10000
+224
diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script
new file mode 100644
index 00000000..1a8d8463
--- /dev/null
+++ b/tests/m_assume_storage_prezeroed/script
@@ -0,0 +1,63 @@
+test_description="test prezeroed storage metadata allocation"
+FILE_SIZE=16M
+
+LOG=$test_name.log
+OUT=$test_name.out
+EXP=$test_dir/expect
+
+if test "$(id -u)" -ne 0 ; then
+    echo "$test_name: $test_description: skipped (not root)"
+elif ! command -v losetup >/dev/null ; then
+    echo "$test_name: $test_description: skipped (no losetup)"
+else
+    dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
+    dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
+
+    LOOP1=$(losetup --show --sector-size 4096 -f $TMPFILE.1)
+    if [ ! -b "$LOOP1" ]; then
+        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
+        rm -f $TMPFILE.1 $TMPFILE.2
+        exit 0
+    fi
+    LOOP2=$(losetup --show --sector-size 4096 -f $TMPFILE.2)
+    if [ ! -b "$LOOP2" ]; then
+        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
+        rm -f $TMPFILE.1 $TMPFILE.2
+	losetup -d $LOOP1
+        exit 0
+    fi
+
+    echo $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
+    $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
+    sync
+    stat $TMPFILE.1 >> $LOG 2>&1
+    SZ=$(stat -c "%b" $TMPFILE.1)
+    if test $SZ -gt 10000 ; then
+	echo "> 10000" > $OUT
+    else
+	echo "$SZ" > $OUT
+    fi
+
+    echo $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
+    $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
+    sync
+    stat $TMPFILE.2 >> $LOG 2>&1
+    stat -c "%b" $TMPFILE.2 >> $OUT
+
+    losetup -d $LOOP1
+    losetup -d $LOOP2
+    rm -f $TMPFILE.1 $TMPFILE.2
+
+    cmp -s $OUT $EXP
+    status=$?
+
+    if [ "$status" = 0 ] ; then
+	echo "$test_name: $test_description: ok"
+	touch $test_name.ok
+    else
+	echo "$test_name: $test_description: failed"
+	cat $LOG > $test_name.failed
+	diff $EXP $OUT >> $test_name.failed
+    fi
+fi
+unset LOG OUT EXP FILE_SIZE LOOP1 LOOP2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mke2fs: Add extended option for prezeroed storage devices
  2021-10-25  4:25     ` Theodore Ts'o
@ 2022-03-11  9:49       ` Gwendal Grignou
  0 siblings, 0 replies; 9+ messages in thread
From: Gwendal Grignou @ 2022-03-11  9:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Sarthak Kukreti, linux-ext4, adilger, okiselev

Ted,

I noticed Sarthak's patch is not in e2fsprogs-1.46.5 December release.
His patch is in the |master| branch (commit bd2e72c5c552 ("mke2fs: add
extended option for prezeroed storage devices")) since September, but
not in the |maint| branch. Other patches were not included as well -
see below. Is it expected?

git log --cherry-mark --oneline --left-right  origin/master...origin/maint
< 96185e9b (origin/next, origin/master, origin/HEAD) Merge branch
'maint' into next
< f85b4526 tune2fs: implement support for set/get label iocts
< 8adeabee Merge branch 'maint' into next
< 02827d06 ext2fs: avoid re-reading inode multiple times
< bd2e72c5 mke2fs: add extended option for prezeroed storage devices
< a8f52588 dumpe2fs, debugfs, e2image: Add support for orphan file
< 795101dd tune2fs: Add support for orphan_file feature
< d0c52ffb e2fsck: Add support for handling orphan file
< 818da4a9 mke2fs: Add support for orphan_file feature
< 1d551c68 libext2fs: Support for orphan file feature

Gwendal.

On Sun, Oct 24, 2021 at 9:25 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> I tried running the regression test, and it was failing for me; it
> showed that even with -E assume_stoarge_prezeroed, the size of the
> $TMPFILE.1 and $TMPFILE.2 was the same.  Looking into this, it was
> because in lib/ext2fs/unix_io.c, when the file is a plain file
> io_channel_discard_zeroes_data() returns true, since it assumes that
> we can use PUNCH_HOLE to implement unix_io_discard(), which is
> guaranteed to work.
>
> So I had to change the regression test to use losetup, which also
> meant that the test had to run as root....
>
> Anyway, this is what I've checked into e2fsprogs.
>
>                                   - Ted
>
> commit bd2e72c5c5521b561d20a881c843a64a5832721a
> Author: Sarthak Kukreti <sarthakkukreti@chromium.org>
> Date:   Mon Sep 27 03:39:10 2021 -0700
>
>     mke2fs: add extended option for prezeroed storage devices
>
>     This patch adds an extended option "assume_storage_prezeroed" to
>     mke2fs. When enabled, this option acts as a hint to mke2fs that the
>     underlying block device was zeroed before mke2fs was called.  This
>     allows mke2fs to optimize out the zeroing of the inode table and the
>     journal, which speeds up the filesystem creation time.
>
>     Additionally, on thinly provisioned storage devices (like Ceph,
>     dm-thin, newly created sparse loopback files), reads on unmapped
>     extents return zero. This property allows mke2fs (with
>     assume_storage_prezeroed) to avoid pre-allocating metadata space for
>     inode tables for the entire filesystem and saves space that would
>     normally be preallocated for zero inode tables.
>
>     Tests
>     -----
>     1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4
>     filesystem drops the time taken by mke2fs from 0.09s to 0.04s
>     and reduces the initial metadata space allocation (stat on
>     sparse file) from 139736 blocks (545M) to 8672 blocks (34M).
>
>     2) On ChromeOS (running linux kernel 4.19) with dm-thin
>     and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
>
>     - Time taken by mke2fs drops from 1.07s to 0.08s.
>     - Avoiding zeroing out the inode table and journal reduces the
>       initial metadata space allocation from 0.48% to 0.01%.
>     - Lazy inode table zeroing results in a further 1.45% of logical
>       volume space getting allocated for inode tables, even if no file
>       data is added to the filesystem. With assume_storage_prezeroed,
>       the metadata allocation remains at 0.01%.
>
>     [ Fixed regression test to work on newer versions of e2fsprogs -- TYT ]
>
>     Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>
> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
> index b378e4d7..30f97bb5 100644
> --- a/misc/mke2fs.8.in
> +++ b/misc/mke2fs.8.in
> @@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten
>  entirely one time.  If the option value is omitted, it defaults to 1 to
>  enable lazy journal inode zeroing.
>  .TP
> +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
> +If enabled,
> +.BR mke2fs
> +assumes that the storage device has been prezeroed, skips zeroing the journal
> +and inode tables, and annotates the block group flags to signal that the inode
> +table has been zeroed.
> +.TP
>  .B no_copy_xattrs
>  Normally
>  .B mke2fs
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index c955b318..76b8b8c6 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -96,6 +96,7 @@ int   journal_flags;
>  int    journal_fc_size;
>  static e2_blkcnt_t     orphan_file_blocks;
>  static int     lazy_itable_init;
> +static int     assume_storage_prezeroed;
>  static int     packed_meta_blocks;
>  int            no_copy_xattrs;
>  static char    *bad_blocks_filename = NULL;
> @@ -1013,6 +1014,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                                 lazy_itable_init = strtoul(arg, &p, 0);
>                         else
>                                 lazy_itable_init = 1;
> +               } else if (!strcmp(token, "assume_storage_prezeroed")) {
> +                       if (arg)
> +                               assume_storage_prezeroed = strtoul(arg, &p, 0);
> +                       else
> +                               assume_storage_prezeroed = 1;
>                 } else if (!strcmp(token, "lazy_journal_init")) {
>                         if (arg)
>                                 journal_flags |= strtoul(arg, &p, 0) ?
> @@ -1131,7 +1137,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                         "\tnodiscard\n"
>                         "\tencoding=<encoding>\n"
>                         "\tencoding_flags=<flags>\n"
> -                       "\tquotatype=<quota type(s) to be enabled>\n\n"),
> +                       "\tquotatype=<quota type(s) to be enabled>\n"
> +                       "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
>                         badopt ? badopt : "");
>                 free(buf);
>                 exit(1);
> @@ -3125,6 +3132,18 @@ int main (int argc, char *argv[])
>                 io_channel_set_options(fs->io, opt_string);
>         }
>
> +       if (assume_storage_prezeroed) {
> +               if (verbose)
> +                       printf("%s",
> +                              _("Assuming the storage device is prezeroed "
> +                              "- skipping inode table and journal wipe\n"));
> +
> +               lazy_itable_init = 1;
> +               itable_zeroed = 1;
> +               zero_hugefile = 0;
> +               journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
> +       }
> +
>         /* Can't undo discard ... */
>         if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
>                 retval = mke2fs_discard_device(fs);
> diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect
> new file mode 100644
> index 00000000..b735e242
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/expect
> @@ -0,0 +1,2 @@
> +> 10000
> +224
> diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script
> new file mode 100644
> index 00000000..1a8d8463
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/script
> @@ -0,0 +1,63 @@
> +test_description="test prezeroed storage metadata allocation"
> +FILE_SIZE=16M
> +
> +LOG=$test_name.log
> +OUT=$test_name.out
> +EXP=$test_dir/expect
> +
> +if test "$(id -u)" -ne 0 ; then
> +    echo "$test_name: $test_description: skipped (not root)"
> +elif ! command -v losetup >/dev/null ; then
> +    echo "$test_name: $test_description: skipped (no losetup)"
> +else
> +    dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +    dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +
> +    LOOP1=$(losetup --show --sector-size 4096 -f $TMPFILE.1)
> +    if [ ! -b "$LOOP1" ]; then
> +        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
> +        rm -f $TMPFILE.1 $TMPFILE.2
> +        exit 0
> +    fi
> +    LOOP2=$(losetup --show --sector-size 4096 -f $TMPFILE.2)
> +    if [ ! -b "$LOOP2" ]; then
> +        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
> +        rm -f $TMPFILE.1 $TMPFILE.2
> +       losetup -d $LOOP1
> +        exit 0
> +    fi
> +
> +    echo $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
> +    $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
> +    sync
> +    stat $TMPFILE.1 >> $LOG 2>&1
> +    SZ=$(stat -c "%b" $TMPFILE.1)
> +    if test $SZ -gt 10000 ; then
> +       echo "> 10000" > $OUT
> +    else
> +       echo "$SZ" > $OUT
> +    fi
> +
> +    echo $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
> +    $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
> +    sync
> +    stat $TMPFILE.2 >> $LOG 2>&1
> +    stat -c "%b" $TMPFILE.2 >> $OUT
> +
> +    losetup -d $LOOP1
> +    losetup -d $LOOP2
> +    rm -f $TMPFILE.1 $TMPFILE.2
> +
> +    cmp -s $OUT $EXP
> +    status=$?
> +
> +    if [ "$status" = 0 ] ; then
> +       echo "$test_name: $test_description: ok"
> +       touch $test_name.ok
> +    else
> +       echo "$test_name: $test_description: failed"
> +       cat $LOG > $test_name.failed
> +       diff $EXP $OUT >> $test_name.failed
> +    fi
> +fi
> +unset LOG OUT EXP FILE_SIZE LOOP1 LOOP2
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-03-11  9:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-21  3:42 [PATCH] mke2fs: Add extended option for prezeroed storage devices Sarthak Kukreti
2021-09-21 21:39 ` Andreas Dilger
2021-09-23  3:31   ` Kiselev, Oleg
2021-09-23  3:57     ` Theodore Ts'o
2021-09-27 10:39   ` [PATCH v2] " Sarthak Kukreti
2021-10-05  3:49     ` Sarthak Kukreti
2021-10-25  4:25     ` Theodore Ts'o
2022-03-11  9:49       ` Gwendal Grignou
2021-09-27 10:43   ` [PATCH] " Sarthak Kukreti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.