All of lore.kernel.org
 help / color / mirror / Atom feed
* Overwrite faster than fallocate
@ 2022-06-17 16:38 Santosh S
  2022-06-17 22:12 ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Santosh S @ 2022-06-17 16:38 UTC (permalink / raw)
  To: linux-ext4

Dear ext4 developers,

This is my test - preallocate a large file (2G) and then do sequential
4K direct-io writes to that file, with fdatasync after every write.
I am preallocating using fallocate mode 0. I noticed that if the 2G
file is pre-written rather than fallocate'd I get more than twice the
throughput. I could reproduce this with fio. The storage is nvme.
Kernel version is 5.3.18 on Suse.

1. Clear the location
# rm -rf /mnt/nvme1n1/*

2. Run fio using fallocate
# taskset -c 0 ./fio -directory=/mnt/nvme1n1 -ioengine=io_uring
-fdatasync=1 -direct=1 -rw=write -iodepth=128 -iodepth_batch=64
-iodepth_batch_complete=64 -fallocate=native -bs=4k -size=2G -thread=1
-time_based=0 -numjobs=1 -group_reporting -output=fio.out
-name=fiotest

3. Results
write: IOPS=188k, BW=732MiB/s (768MB/s)(2048MiB/2796msec)

4. Run the same test again, this time the file already exists from the
previous run.
write: IOPS=420k, BW=1640MiB/s (1719MB/s)(2048MiB/1249msec)

It doesn't matter if I pass -fallocate to fio or not in step 4.

When I run ftrace (and if I am understanding the o/p correctly) I see
that in the first run ext4_convert_unwritten_extents() seems to be
taking a lot of time. This call is not present in the second run.

 110)  <...>-11449   | # 1102.026 us |      } /*
ext4_convert_unwritten_extents [ext4] */
 110)  <...>-11449   |   0.117 us    |      ext4_release_io_end [ext4]();
 110)  <...>-11449   | # 1102.421 us |    } /* ext4_put_io_end [ext4] */
 110)  <...>-11449   | # 1102.599 us |  } /* ext4_end_io_dio [ext4] */

Am I doing something wrong or is this difference expected? Any
suggestion to get a better throughput without actually pre-writing the
file.

Thank you for your time,
Santosh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Overwrite faster than fallocate
  2022-06-17 16:38 Overwrite faster than fallocate Santosh S
@ 2022-06-17 22:12 ` Theodore Ts'o
  2022-06-17 23:56   ` Santosh S
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2022-06-17 22:12 UTC (permalink / raw)
  To: Santosh S; +Cc: linux-ext4

On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
> Dear ext4 developers,
> 
> This is my test - preallocate a large file (2G) and then do sequential
> 4K direct-io writes to that file, with fdatasync after every write.
> I am preallocating using fallocate mode 0. I noticed that if the 2G
> file is pre-written rather than fallocate'd I get more than twice the
> throughput. I could reproduce this with fio. The storage is nvme.
> Kernel version is 5.3.18 on Suse.
>
> Am I doing something wrong or is this difference expected? Any
> suggestion to get a better throughput without actually pre-writing the
> file.

This is, alas, expected.  The reason for this is because when you use
fallocate, the extent is marked as uninitialized, so that when you
read from the those newly allocated blocks, you don't see previously
written data belonging to deleted files.  These files could contain
someone else's e-mail, or medical information, etc.  So if we didn't
do this, it would be a walking, talking HIPPA or PCI violation.

So when you write to an fallocated region, and then call fdatasync(2),
we need to update the metadata blocks to clear the uninitialized bit
so that when you read from the file after a crash, you actually get
the data that was written.  So the fdatasync(2) operation is quite the
heavyweight operation, since it requries journal commit because of the
required metadata update.  When you do an overwrite, there is no need
to force a metadata update and journal update, which is why write(2)
plus fdatasync(2) is much lighter weight when you do an overwrite.

What enterprise databases (e.g., Oracle Enterprise Database and IBM's
Informix DB) tend to do is to use fallocate a chunk of space (say,
16MB or 32MB), because for Legacy Unix OS's, this tends enable some
file system's block allocators to be more likely to allocate a
contiguous block range, and then immediate write zero's on that 16 or
32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
tree once to make that 16MB or 32MB to be marked initialized to the
database's tablespace file, so you only pay the metadata update once,
instead of every few dozen kilobytes as you write each database commit
into the tablespace file.

There is also an old, out of tree patch which enables an fallocate
mode called "no hide stale", which marks the extent tree blcoks which
are allocated using fallocate(2) as initialized.  This substantially
speeds things up, but it is potentially a walking, talking, HIPPA or
PCI violation in that revealing previously written data is considered
a horrible security violation by most file system developers.

If you know, say, that a cluster file system is the only user of the
file system, and all data is written encrypted at rest using a
per-user key, such that exposing stale data is not a security
disaster, the "no hide stale" flag could be "safe" in that highly
specialized user case.

But that assumes that file system authors can trust application
writers not to do something stupid and insecure, and historically,
file system authors (possibly with good reason, given bitter past
experience) don't trust application writesr to do something which is
very easy, and gooses performance, even if it has terrible side
effects on either data robustness or data security.

Effectively, the no hide stale flag could be considered an "Attractive
Nuisance"[1] and so support for this feature has never been accepted
into the mainline kernel, and never to any distro kernels, since the
distribution companies don't want to be held liable for making an
"acctive nuisance" that might enable application authors from shooting
themselves in the foot.

[1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine

In any case, the technique of fallocatE(2) plus zero-fill-write plus
fdatasync(2) isn't *that* slow, and is only needed when you are first
extending the tablespace file.  In the steady state, most database
applications tend to be overwriting space, so this isn't an issue.

In any case, if you need to get that last 5% or so of performance ---
say, if you are are an enterprise database company interested in
taking a full page advertisement on the back cover of Business Week
Magazine touting how your enterprise database benchmarks are better
than the competition --- the simple solution is to use a raw block
device.  Of course, most end users want the convenience of the file
system, but that's not the point if you are engaging in
benchmarketing.   :-)

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Overwrite faster than fallocate
  2022-06-17 22:12 ` Theodore Ts'o
@ 2022-06-17 23:56   ` Santosh S
  2022-06-18  0:41     ` Santosh S
  2022-06-20 18:52     ` Andreas Dilger
  0 siblings, 2 replies; 8+ messages in thread
From: Santosh S @ 2022-06-17 23:56 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

 On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
> > Dear ext4 developers,
> >
> > This is my test - preallocate a large file (2G) and then do sequential
> > 4K direct-io writes to that file, with fdatasync after every write.
> > I am preallocating using fallocate mode 0. I noticed that if the 2G
> > file is pre-written rather than fallocate'd I get more than twice the
> > throughput. I could reproduce this with fio. The storage is nvme.
> > Kernel version is 5.3.18 on Suse.
> >
> > Am I doing something wrong or is this difference expected? Any
> > suggestion to get a better throughput without actually pre-writing the
> > file.
>
> This is, alas, expected.  The reason for this is because when you use
> fallocate, the extent is marked as uninitialized, so that when you
> read from the those newly allocated blocks, you don't see previously
> written data belonging to deleted files.  These files could contain
> someone else's e-mail, or medical information, etc.  So if we didn't
> do this, it would be a walking, talking HIPPA or PCI violation.
>
> So when you write to an fallocated region, and then call fdatasync(2),
> we need to update the metadata blocks to clear the uninitialized bit
> so that when you read from the file after a crash, you actually get
> the data that was written.  So the fdatasync(2) operation is quite the
> heavyweight operation, since it requries journal commit because of the
> required metadata update.  When you do an overwrite, there is no need
> to force a metadata update and journal update, which is why write(2)
> plus fdatasync(2) is much lighter weight when you do an overwrite.
>
> What enterprise databases (e.g., Oracle Enterprise Database and IBM's
> Informix DB) tend to do is to use fallocate a chunk of space (say,
> 16MB or 32MB), because for Legacy Unix OS's, this tends enable some
> file system's block allocators to be more likely to allocate a
> contiguous block range, and then immediate write zero's on that 16 or
> 32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
> tree once to make that 16MB or 32MB to be marked initialized to the
> database's tablespace file, so you only pay the metadata update once,
> instead of every few dozen kilobytes as you write each database commit
> into the tablespace file.
>
> There is also an old, out of tree patch which enables an fallocate
> mode called "no hide stale", which marks the extent tree blcoks which
> are allocated using fallocate(2) as initialized.  This substantially
> speeds things up, but it is potentially a walking, talking, HIPPA or
> PCI violation in that revealing previously written data is considered
> a horrible security violation by most file system developers.
>
> If you know, say, that a cluster file system is the only user of the
> file system, and all data is written encrypted at rest using a
> per-user key, such that exposing stale data is not a security
> disaster, the "no hide stale" flag could be "safe" in that highly
> specialized user case.
>
> But that assumes that file system authors can trust application
> writers not to do something stupid and insecure, and historically,
> file system authors (possibly with good reason, given bitter past
> experience) don't trust application writesr to do something which is
> very easy, and gooses performance, even if it has terrible side
> effects on either data robustness or data security.
>
> Effectively, the no hide stale flag could be considered an "Attractive
> Nuisance"[1] and so support for this feature has never been accepted
> into the mainline kernel, and never to any distro kernels, since the
> distribution companies don't want to be held liable for making an
> "acctive nuisance" that might enable application authors from shooting
> themselves in the foot.
>
> [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine
>
> In any case, the technique of fallocatE(2) plus zero-fill-write plus
> fdatasync(2) isn't *that* slow, and is only needed when you are first
> extending the tablespace file.  In the steady state, most database
> applications tend to be overwriting space, so this isn't an issue.
>
> In any case, if you need to get that last 5% or so of performance ---
> say, if you are are an enterprise database company interested in
> taking a full page advertisement on the back cover of Business Week
> Magazine touting how your enterprise database benchmarks are better
> than the competition --- the simple solution is to use a raw block
> device.  Of course, most end users want the convenience of the file
> system, but that's not the point if you are engaging in
> benchmarketing.   :-)
>
> Cheers,
>
>                                                 - Ted

Thank you for a comprehensive answer :-)

I have one more question - when I gradually increase the i/o transfer
size the performance degradation begins to lessen and at 32K it is
similar to the "overwriting the file" case. I assume this is because
the metadata update is now spread over 32K of data rather than 4K.
However, my understanding is that, in my case, an extent should
represent the max 128MiB of data and so the clearing of the
uninitialized bit for an extent should happen once every 128MiB, so
then why is a higher transfer size making a difference?

Santosh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Overwrite faster than fallocate
  2022-06-17 23:56   ` Santosh S
@ 2022-06-18  0:41     ` Santosh S
  2022-06-20 18:52     ` Andreas Dilger
  1 sibling, 0 replies; 8+ messages in thread
From: Santosh S @ 2022-06-18  0:41 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On Fri, Jun 17, 2022 at 7:56 PM Santosh S <santosh.letterz@gmail.com> wrote:
>
>  On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@mit.edu> wrote:
> >
> > On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
> > > Dear ext4 developers,
> > >
> > > This is my test - preallocate a large file (2G) and then do sequential
> > > 4K direct-io writes to that file, with fdatasync after every write.
> > > I am preallocating using fallocate mode 0. I noticed that if the 2G
> > > file is pre-written rather than fallocate'd I get more than twice the
> > > throughput. I could reproduce this with fio. The storage is nvme.
> > > Kernel version is 5.3.18 on Suse.
> > >
> > > Am I doing something wrong or is this difference expected? Any
> > > suggestion to get a better throughput without actually pre-writing the
> > > file.
> >
> > This is, alas, expected.  The reason for this is because when you use
> > fallocate, the extent is marked as uninitialized, so that when you
> > read from the those newly allocated blocks, you don't see previously
> > written data belonging to deleted files.  These files could contain
> > someone else's e-mail, or medical information, etc.  So if we didn't
> > do this, it would be a walking, talking HIPPA or PCI violation.
> >
> > So when you write to an fallocated region, and then call fdatasync(2),
> > we need to update the metadata blocks to clear the uninitialized bit
> > so that when you read from the file after a crash, you actually get
> > the data that was written.  So the fdatasync(2) operation is quite the
> > heavyweight operation, since it requries journal commit because of the
> > required metadata update.  When you do an overwrite, there is no need
> > to force a metadata update and journal update, which is why write(2)
> > plus fdatasync(2) is much lighter weight when you do an overwrite.
> >
> > What enterprise databases (e.g., Oracle Enterprise Database and IBM's
> > Informix DB) tend to do is to use fallocate a chunk of space (say,
> > 16MB or 32MB), because for Legacy Unix OS's, this tends enable some
> > file system's block allocators to be more likely to allocate a
> > contiguous block range, and then immediate write zero's on that 16 or
> > 32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
> > tree once to make that 16MB or 32MB to be marked initialized to the
> > database's tablespace file, so you only pay the metadata update once,
> > instead of every few dozen kilobytes as you write each database commit
> > into the tablespace file.
> >
> > There is also an old, out of tree patch which enables an fallocate
> > mode called "no hide stale", which marks the extent tree blcoks which
> > are allocated using fallocate(2) as initialized.  This substantially
> > speeds things up, but it is potentially a walking, talking, HIPPA or
> > PCI violation in that revealing previously written data is considered
> > a horrible security violation by most file system developers.
> >
> > If you know, say, that a cluster file system is the only user of the
> > file system, and all data is written encrypted at rest using a
> > per-user key, such that exposing stale data is not a security
> > disaster, the "no hide stale" flag could be "safe" in that highly
> > specialized user case.
> >
> > But that assumes that file system authors can trust application
> > writers not to do something stupid and insecure, and historically,
> > file system authors (possibly with good reason, given bitter past
> > experience) don't trust application writesr to do something which is
> > very easy, and gooses performance, even if it has terrible side
> > effects on either data robustness or data security.
> >
> > Effectively, the no hide stale flag could be considered an "Attractive
> > Nuisance"[1] and so support for this feature has never been accepted
> > into the mainline kernel, and never to any distro kernels, since the
> > distribution companies don't want to be held liable for making an
> > "acctive nuisance" that might enable application authors from shooting
> > themselves in the foot.
> >
> > [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine
> >
> > In any case, the technique of fallocatE(2) plus zero-fill-write plus
> > fdatasync(2) isn't *that* slow, and is only needed when you are first
> > extending the tablespace file.  In the steady state, most database
> > applications tend to be overwriting space, so this isn't an issue.
> >
> > In any case, if you need to get that last 5% or so of performance ---
> > say, if you are are an enterprise database company interested in
> > taking a full page advertisement on the back cover of Business Week
> > Magazine touting how your enterprise database benchmarks are better
> > than the competition --- the simple solution is to use a raw block
> > device.  Of course, most end users want the convenience of the file
> > system, but that's not the point if you are engaging in
> > benchmarketing.   :-)
> >
> > Cheers,
> >
> >                                                 - Ted
>
> Thank you for a comprehensive answer :-)
>
> I have one more question - when I gradually increase the i/o transfer
> size the performance degradation begins to lessen and at 32K it is
> similar to the "overwriting the file" case. I assume this is because
> the metadata update is now spread over 32K of data rather than 4K.
> However, my understanding is that, in my case, an extent should
> represent the max 128MiB of data and so the clearing of the
> uninitialized bit for an extent should happen once every 128MiB, so
> then why is a higher transfer size making a difference?
>

I think I understand. The metadata update cannot just be clearing the
uninitialized bit, but also updating the high water mark telling the
length of the initialized part of the extent.

> Santosh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Overwrite faster than fallocate
  2022-06-17 23:56   ` Santosh S
  2022-06-18  0:41     ` Santosh S
@ 2022-06-20 18:52     ` Andreas Dilger
  2022-06-23 18:28       ` Santosh S
  1 sibling, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2022-06-20 18:52 UTC (permalink / raw)
  To: Santosh S; +Cc: Theodore Ts'o, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 6831 bytes --]

On Jun 17, 2022, at 5:56 PM, Santosh S <santosh.letterz@gmail.com> wrote:
> 
> On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@mit.edu> wrote:
>> 
>> On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
>>> Dear ext4 developers,
>>> 
>>> This is my test - preallocate a large file (2G) and then do sequential
>>> 4K direct-io writes to that file, with fdatasync after every write.
>>> I am preallocating using fallocate mode 0. I noticed that if the 2G
>>> file is pre-written rather than fallocate'd I get more than twice the
>>> throughput. I could reproduce this with fio. The storage is nvme.
>>> Kernel version is 5.3.18 on Suse.
>>> 
>>> Am I doing something wrong or is this difference expected? Any
>>> suggestion to get a better throughput without actually pre-writing the
>>> file.
>> 
>> This is, alas, expected.  The reason for this is because when you use
>> fallocate, the extent is marked as uninitialized, so that when you
>> read from the those newly allocated blocks, you don't see previously
>> written data belonging to deleted files.  These files could contain
>> someone else's e-mail, or medical information, etc.  So if we didn't
>> do this, it would be a walking, talking HIPPA or PCI violation.
>> 
>> So when you write to an fallocated region, and then call fdatasync(2),
>> we need to update the metadata blocks to clear the uninitialized bit
>> so that when you read from the file after a crash, you actually get
>> the data that was written.  So the fdatasync(2) operation is quite the
>> heavyweight operation, since it requries journal commit because of the
>> required metadata update.  When you do an overwrite, there is no need
>> to force a metadata update and journal update, which is why write(2)
>> plus fdatasync(2) is much lighter weight when you do an overwrite.
>> 
>> What enterprise databases (e.g., Oracle Enterprise Database and IBM's
>> Informix DB) tend to do is to use fallocate a chunk of space (say,
>> 16MB or 32MB), because for Legacy Unix OS's, this tends enable some
>> file system's block allocators to be more likely to allocate a
>> contiguous block range, and then immediate write zero's on that 16 or
>> 32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
>> tree once to make that 16MB or 32MB to be marked initialized to the
>> database's tablespace file, so you only pay the metadata update once,
>> instead of every few dozen kilobytes as you write each database commit
>> into the tablespace file.
>> 
>> There is also an old, out of tree patch which enables an fallocate
>> mode called "no hide stale", which marks the extent tree blcoks which
>> are allocated using fallocate(2) as initialized.  This substantially
>> speeds things up, but it is potentially a walking, talking, HIPPA or
>> PCI violation in that revealing previously written data is considered
>> a horrible security violation by most file system developers.
>> 
>> If you know, say, that a cluster file system is the only user of the
>> file system, and all data is written encrypted at rest using a
>> per-user key, such that exposing stale data is not a security
>> disaster, the "no hide stale" flag could be "safe" in that highly
>> specialized user case.
>> 
>> But that assumes that file system authors can trust application
>> writers not to do something stupid and insecure, and historically,
>> file system authors (possibly with good reason, given bitter past
>> experience) don't trust application writesr to do something which is
>> very easy, and gooses performance, even if it has terrible side
>> effects on either data robustness or data security.
>> 
>> Effectively, the no hide stale flag could be considered an "Attractive
>> Nuisance"[1] and so support for this feature has never been accepted
>> into the mainline kernel, and never to any distro kernels, since the
>> distribution companies don't want to be held liable for making an
>> "acctive nuisance" that might enable application authors from shooting
>> themselves in the foot.
>> 
>> [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine
>> 
>> In any case, the technique of fallocatE(2) plus zero-fill-write plus
>> fdatasync(2) isn't *that* slow, and is only needed when you are first
>> extending the tablespace file.  In the steady state, most database
>> applications tend to be overwriting space, so this isn't an issue.
>> 
>> In any case, if you need to get that last 5% or so of performance ---
>> say, if you are are an enterprise database company interested in
>> taking a full page advertisement on the back cover of Business Week
>> Magazine touting how your enterprise database benchmarks are better
>> than the competition --- the simple solution is to use a raw block
>> device.  Of course, most end users want the convenience of the file
>> system, but that's not the point if you are engaging in
>> benchmarketing.   :-)
>> 
>> Cheers,
>> 
>>                                                - Ted
> 
> Thank you for a comprehensive answer :-)
> 
> I have one more question - when I gradually increase the i/o transfer
> size the performance degradation begins to lessen and at 32K it is
> similar to the "overwriting the file" case. I assume this is because
> the metadata update is now spread over 32K of data rather than 4K.

When splitting unwritten extents, the ext4 code will write out zero
blocks up to 32KB by default (/sys/fs/ext4/*/extent_max_zeroout_kb)
to avoid having millions of very small extents in a file (e.g. in
case of a pathological alternating 4KB write pattern).  If your test
is writing >= 32KB blocks then this no longer needs to be done.  If
writing smaller blocks then it makes sense that the speed is 1/2 the
raw speed because the file blocks are all being written twice (first
with zeroes, then with actual data on a later write).

32KB (or 64KB) is a reasonable minimum size because any disk write
will take the same time to write a single block or a whole sector,
so doing writes in smaller units is not very efficient.  Depending
on the underlying storage (e.g. RAID-6) it might be more efficient
to set extent_max_zeroout_kb=1024 or similar.

> However, my understanding is that, in my case, an extent should
> represent the max 128MiB of data and so the clearing of the
> uninitialized bit for an extent should happen once every 128MiB, so
> then why is a higher transfer size making a difference?

You are misunderstanding how uninitialized extents are cleared.  The
uninitialized extent is split into two/three parts, where only the
extent that has data written to it (min 32KB) is set to "initialized"
and the remaining one/two extents are left uninitialized.  Otherwise,
each write to an uninitialized extent would need up to 128MB of zeroes
written to disk each time, which would be slow/high latency.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Overwrite faster than fallocate
  2022-06-20 18:52     ` Andreas Dilger
@ 2022-06-23 18:28       ` Santosh S
  2022-06-23 19:43         ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Santosh S @ 2022-06-23 18:28 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Ts'o, linux-ext4

On Mon, Jun 20, 2022 at 2:50 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Jun 17, 2022, at 5:56 PM, Santosh S <santosh.letterz@gmail.com> wrote:
> >
> > On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@mit.edu> wrote:
> >>
> >> On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
> >>> Dear ext4 developers,
> >>>
> >>> This is my test - preallocate a large file (2G) and then do sequential
> >>> 4K direct-io writes to that file, with fdatasync after every write.
> >>> I am preallocating using fallocate mode 0. I noticed that if the 2G
> >>> file is pre-written rather than fallocate'd I get more than twice the
> >>> throughput. I could reproduce this with fio. The storage is nvme.
> >>> Kernel version is 5.3.18 on Suse.
> >>>
> >>> Am I doing something wrong or is this difference expected? Any
> >>> suggestion to get a better throughput without actually pre-writing the
> >>> file.
> >>
> >> This is, alas, expected.  The reason for this is because when you use
> >> fallocate, the extent is marked as uninitialized, so that when you
> >> read from the those newly allocated blocks, you don't see previously
> >> written data belonging to deleted files.  These files could contain
> >> someone else's e-mail, or medical information, etc.  So if we didn't
> >> do this, it would be a walking, talking HIPPA or PCI violation.
> >>
> >> So when you write to an fallocated region, and then call fdatasync(2),
> >> we need to update the metadata blocks to clear the uninitialized bit
> >> so that when you read from the file after a crash, you actually get
> >> the data that was written.  So the fdatasync(2) operation is quite the
> >> heavyweight operation, since it requries journal commit because of the
> >> required metadata update.  When you do an overwrite, there is no need
> >> to force a metadata update and journal update, which is why write(2)
> >> plus fdatasync(2) is much lighter weight when you do an overwrite.
> >>
> >> What enterprise databases (e.g., Oracle Enterprise Database and IBM's
> >> Informix DB) tend to do is to use fallocate a chunk of space (say,
> >> 16MB or 32MB), because for Legacy Unix OS's, this tends enable some
> >> file system's block allocators to be more likely to allocate a
> >> contiguous block range, and then immediate write zero's on that 16 or
> >> 32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
> >> tree once to make that 16MB or 32MB to be marked initialized to the
> >> database's tablespace file, so you only pay the metadata update once,
> >> instead of every few dozen kilobytes as you write each database commit
> >> into the tablespace file.
> >>
> >> There is also an old, out of tree patch which enables an fallocate
> >> mode called "no hide stale", which marks the extent tree blcoks which
> >> are allocated using fallocate(2) as initialized.  This substantially
> >> speeds things up, but it is potentially a walking, talking, HIPPA or
> >> PCI violation in that revealing previously written data is considered
> >> a horrible security violation by most file system developers.
> >>
> >> If you know, say, that a cluster file system is the only user of the
> >> file system, and all data is written encrypted at rest using a
> >> per-user key, such that exposing stale data is not a security
> >> disaster, the "no hide stale" flag could be "safe" in that highly
> >> specialized user case.
> >>
> >> But that assumes that file system authors can trust application
> >> writers not to do something stupid and insecure, and historically,
> >> file system authors (possibly with good reason, given bitter past
> >> experience) don't trust application writesr to do something which is
> >> very easy, and gooses performance, even if it has terrible side
> >> effects on either data robustness or data security.
> >>
> >> Effectively, the no hide stale flag could be considered an "Attractive
> >> Nuisance"[1] and so support for this feature has never been accepted
> >> into the mainline kernel, and never to any distro kernels, since the
> >> distribution companies don't want to be held liable for making an
> >> "acctive nuisance" that might enable application authors from shooting
> >> themselves in the foot.
> >>
> >> [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine
> >>
> >> In any case, the technique of fallocatE(2) plus zero-fill-write plus
> >> fdatasync(2) isn't *that* slow, and is only needed when you are first
> >> extending the tablespace file.  In the steady state, most database
> >> applications tend to be overwriting space, so this isn't an issue.
> >>
> >> In any case, if you need to get that last 5% or so of performance ---
> >> say, if you are are an enterprise database company interested in
> >> taking a full page advertisement on the back cover of Business Week
> >> Magazine touting how your enterprise database benchmarks are better
> >> than the competition --- the simple solution is to use a raw block
> >> device.  Of course, most end users want the convenience of the file
> >> system, but that's not the point if you are engaging in
> >> benchmarketing.   :-)
> >>
> >> Cheers,
> >>
> >>                                                - Ted
> >
> > Thank you for a comprehensive answer :-)
> >
> > I have one more question - when I gradually increase the i/o transfer
> > size the performance degradation begins to lessen and at 32K it is
> > similar to the "overwriting the file" case. I assume this is because
> > the metadata update is now spread over 32K of data rather than 4K.
>
> When splitting unwritten extents, the ext4 code will write out zero
> blocks up to 32KB by default (/sys/fs/ext4/*/extent_max_zeroout_kb)
> to avoid having millions of very small extents in a file (e.g. in
> case of a pathological alternating 4KB write pattern).  If your test
> is writing >= 32KB blocks then this no longer needs to be done.  If
> writing smaller blocks then it makes sense that the speed is 1/2 the
> raw speed because the file blocks are all being written twice (first
> with zeroes, then with actual data on a later write).
>
> 32KB (or 64KB) is a reasonable minimum size because any disk write
> will take the same time to write a single block or a whole sector,
> so doing writes in smaller units is not very efficient.  Depending
> on the underlying storage (e.g. RAID-6) it might be more efficient
> to set extent_max_zeroout_kb=1024 or similar.
>
> > However, my understanding is that, in my case, an extent should
> > represent the max 128MiB of data and so the clearing of the
> > uninitialized bit for an extent should happen once every 128MiB, so
> > then why is a higher transfer size making a difference?
>
> You are misunderstanding how uninitialized extents are cleared.  The
> uninitialized extent is split into two/three parts, where only the
> extent that has data written to it (min 32KB) is set to "initialized"
> and the remaining one/two extents are left uninitialized.  Otherwise,
> each write to an uninitialized extent would need up to 128MB of zeroes
> written to disk each time, which would be slow/high latency.
>
> Cheers, Andreas
>
>
Thank you and sorry for the delay in responding.

What kind of write will stop an uninitialized extent from splitting?
For example, I want to create a file, fallocate 512MB, and zero-fill
it. But I want the file system to only create 4 extents so they all
reside in the inode itself, and each extent represents the entire
128MB (so no splitting).
Even if I do large sized writes, my understanding is that ultimately
the kernel / hardware restrictions will split the the i/o into smaller
chunks thus causing the extent to split. For example, this is what I
see on my test system

# cat /sys/block/nvme1n1/queue/max_hw_sectors_kb
128
# cat /sys/block/nvme1n1/queue/max_sectors_kb
128

Santosh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Overwrite faster than fallocate
  2022-06-23 18:28       ` Santosh S
@ 2022-06-23 19:43         ` Theodore Ts'o
  2022-06-23 21:55           ` Santosh S
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2022-06-23 19:43 UTC (permalink / raw)
  To: Santosh S; +Cc: Andreas Dilger, linux-ext4

On Thu, Jun 23, 2022 at 02:28:47PM -0400, Santosh S wrote:
> 
> What kind of write will stop an uninitialized extent from splitting?
> For example, I want to create a file, fallocate 512MB, and zero-fill
> it. But I want the file system to only create 4 extents so they all
> reside in the inode itself, and each extent represents the entire
> 128MB (so no splitting).

If you write into an unitialized extent, it *has* to be split, since
we have to record what has been initialized, and what has not.  So for
example:

root@kvm-xfstests:/vdc# fallocate  -l 1M test-file
root@kvm-xfstests:/vdc# filefrag -vs test-file
Filesystem type is: ef53
File size of test-file is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:      68864..     69119:    256:             last,unwritten,eof
test-file: 1 extent found
root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=1 seek=10
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000252186 s, 16.2 MB/s
root@kvm-xfstests:/vdc# filefrag -vs test-file
Filesystem type is: ef53
File size of test-file is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       9:      68864..     68873:     10:             unwritten
   1:       10..      10:      68874..     68874:      1:            
   2:       11..     255:      68875..     69119:    245:             last,unwritten,eof
test-file: 1 extent found

However, if you write to an adjacent block, the extent will get split
--- and then we will merge it to the initialized block.  So for
example, if we write to block 9:

root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=1 seek=9
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000205357 s, 19.9 MB/s
root@kvm-xfstests:/vdc# filefrag -vs test-file
Filesystem type is: ef53
File size of test-file is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       8:      68864..     68872:      9:             unwritten
   1:        9..      10:      68873..     68874:      2:            
   2:       11..     255:      68875..     69119:    245:             last,unwritten,eof
test-file: 1 extent found

So if you eventually write all of the blocks, because of the split and
the merging behavior, eventually the extent tree will be put into an efficient state:

root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=9 seek=0
    ...
root@kvm-xfstests:/vdc# filefrag -vs test-file
Filesystem type is: ef53
File size of test-file is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      10:      68864..     68874:     11:            
   1:       11..     255:      68875..     69119:    245:             last,unwritten,eof
test-file: 1 extent found
root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=240 seek=11
    ...
root@kvm-xfstests:/vdc# filefrag -vs test-file
Filesystem type is: ef53
File size of test-file is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     250:      68864..     69114:    251:            
   1:      251..     255:      69115..     69119:      5:             last,unwritten,eof
test-file: 1 extent found
root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=5 seek=251
    ...
root@kvm-xfstests:/vdc# filefrag -vs test-file
Filesystem type is: ef53
File size of test-file is 1048576 (256 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     255:      68864..     69119:    256:             last,eof
test-file: 1 extent found
root@kvm-xfstests:/vdc# 

Bottom-line, there isn't just splitting, but there is also merging
going on.  So it's not really something that you need to worry about.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Overwrite faster than fallocate
  2022-06-23 19:43         ` Theodore Ts'o
@ 2022-06-23 21:55           ` Santosh S
  0 siblings, 0 replies; 8+ messages in thread
From: Santosh S @ 2022-06-23 21:55 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4

On Thu, Jun 23, 2022 at 3:43 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Thu, Jun 23, 2022 at 02:28:47PM -0400, Santosh S wrote:
> >
> > What kind of write will stop an uninitialized extent from splitting?
> > For example, I want to create a file, fallocate 512MB, and zero-fill
> > it. But I want the file system to only create 4 extents so they all
> > reside in the inode itself, and each extent represents the entire
> > 128MB (so no splitting).
>
> If you write into an unitialized extent, it *has* to be split, since
> we have to record what has been initialized, and what has not.  So for
> example:
>
> root@kvm-xfstests:/vdc# fallocate  -l 1M test-file
> root@kvm-xfstests:/vdc# filefrag -vs test-file
> Filesystem type is: ef53
> File size of test-file is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:      68864..     69119:    256:             last,unwritten,eof
> test-file: 1 extent found
> root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=1 seek=10
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000252186 s, 16.2 MB/s
> root@kvm-xfstests:/vdc# filefrag -vs test-file
> Filesystem type is: ef53
> File size of test-file is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       9:      68864..     68873:     10:             unwritten
>    1:       10..      10:      68874..     68874:      1:
>    2:       11..     255:      68875..     69119:    245:             last,unwritten,eof
> test-file: 1 extent found
>
> However, if you write to an adjacent block, the extent will get split
> --- and then we will merge it to the initialized block.  So for
> example, if we write to block 9:
>
> root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=1 seek=9
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000205357 s, 19.9 MB/s
> root@kvm-xfstests:/vdc# filefrag -vs test-file
> Filesystem type is: ef53
> File size of test-file is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       8:      68864..     68872:      9:             unwritten
>    1:        9..      10:      68873..     68874:      2:
>    2:       11..     255:      68875..     69119:    245:             last,unwritten,eof
> test-file: 1 extent found
>
> So if you eventually write all of the blocks, because of the split and
> the merging behavior, eventually the extent tree will be put into an efficient state:
>
> root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=9 seek=0
>     ...
> root@kvm-xfstests:/vdc# filefrag -vs test-file
> Filesystem type is: ef53
> File size of test-file is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      10:      68864..     68874:     11:
>    1:       11..     255:      68875..     69119:    245:             last,unwritten,eof
> test-file: 1 extent found
> root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=240 seek=11
>     ...
> root@kvm-xfstests:/vdc# filefrag -vs test-file
> Filesystem type is: ef53
> File size of test-file is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     250:      68864..     69114:    251:
>    1:      251..     255:      69115..     69119:      5:             last,unwritten,eof
> test-file: 1 extent found
> root@kvm-xfstests:/vdc# dd if=/dev/zero of=test-file bs=1k conv=notrunc bs=4k count=5 seek=251
>     ...
> root@kvm-xfstests:/vdc# filefrag -vs test-file
> Filesystem type is: ef53
> File size of test-file is 1048576 (256 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..     255:      68864..     69119:    256:             last,eof
> test-file: 1 extent found
> root@kvm-xfstests:/vdc#
>
> Bottom-line, there isn't just splitting, but there is also merging
> going on.  So it's not really something that you need to worry about.
>
> Cheers,
>
>                                                 - Ted

Nice! Thank you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-06-23 21:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-17 16:38 Overwrite faster than fallocate Santosh S
2022-06-17 22:12 ` Theodore Ts'o
2022-06-17 23:56   ` Santosh S
2022-06-18  0:41     ` Santosh S
2022-06-20 18:52     ` Andreas Dilger
2022-06-23 18:28       ` Santosh S
2022-06-23 19:43         ` Theodore Ts'o
2022-06-23 21:55           ` Santosh S

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.