linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Questions about filesystems from SQLite author presentation
@ 2020-01-06  7:24 Sitsofe Wheeler
  2020-01-06 10:15 ` Dave Chinner
  2020-01-06 15:40 ` Amir Goldstein
  0 siblings, 2 replies; 11+ messages in thread
From: Sitsofe Wheeler @ 2020-01-06  7:24 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: drh

At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite
(https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled
"Things to discuss"
(https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 )
and had a few questions:

1. Reliable ways to discover detailed filesystem properties
2. fbarrier()
3. Notify the OS about unused regions in the database file

For 1. I think Jan Kara said that supporting it was undesirable for
details like just how much additional fsync were needed due to
competing constraints (https://youtu.be/-oP2BOsMpdo?t=6063 ). Someone
mentioned there was a
patch for fsinfo to discover if you were on a network filesystem
(https://www.youtube.com/watch?v=-oP2BOsMpdo&feature=youtu.be&t=5525
)...
For 2. there was a talk by MySQL dev Sergei Golubchik (
https://youtu.be/-oP2BOsMpdo?t=1219 ) talking about how barriers had
been taken out and was there a replacement. In
https://youtu.be/-oP2BOsMpdo?t=1731 Chris Mason(?) seems to suggest
that the desired effect could be achieved with io_uring chaining.
For 3. it sounded like Jan Kara was saying there wasn't anything at
the moment (hypothetically you could introduce a call that marked the
extents as "unwritten" but it doesn't sound like you can do that
today) and even if you wanted to use something like TRIM it wouldn't
be worth it unless you were trimming a large (gigabytes) amount of
data (https://youtu.be/-oP2BOsMpdo?t=6330 ).

However, there were even more questions in the briefing paper
(https://sqlite.org/lpc2019/doc/trunk/briefing.md and search for '?')
that couldn't be asked due to limited time. Does anyone know the
answer to the extended questions and whether the the above is right
deduction for the questions that were asked?

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06  7:24 Questions about filesystems from SQLite author presentation Sitsofe Wheeler
@ 2020-01-06 10:15 ` Dave Chinner
  2020-01-07  8:40   ` Sitsofe Wheeler
  2020-01-07  8:47   ` Jan Kara
  2020-01-06 15:40 ` Amir Goldstein
  1 sibling, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2020-01-06 10:15 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: linux-fsdevel, drh

On Mon, Jan 06, 2020 at 07:24:53AM +0000, Sitsofe Wheeler wrote:
> At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite
> (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled
> "Things to discuss"
> (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 )
> and had a few questions:
> 
> 1. Reliable ways to discover detailed filesystem properties
> 2. fbarrier()
> 3. Notify the OS about unused regions in the database file
> 
> For 1. I think Jan Kara said that supporting it was undesirable for
> details like just how much additional fsync were needed due to
> competing constraints (https://youtu.be/-oP2BOsMpdo?t=6063 ). Someone
> mentioned there was a
> patch for fsinfo to discover if you were on a network filesystem
> (https://www.youtube.com/watch?v=-oP2BOsMpdo&feature=youtu.be&t=5525
> )...
> For 2. there was a talk by MySQL dev Sergei Golubchik (
> https://youtu.be/-oP2BOsMpdo?t=1219 ) talking about how barriers had
> been taken out and was there a replacement. In
> https://youtu.be/-oP2BOsMpdo?t=1731 Chris Mason(?) seems to suggest
> that the desired effect could be achieved with io_uring chaining.

Even though it wasn't explicitly mentioned, I'm pretty sure that
those "write barriers" for ordering groups of writes against other
groups of writes are intended to be used for data integrity
purposes.

The problem is that data integrity writes also require any
uncommitted filesytsem metadata to be written in the correct order
to disk along with the data. i.e.  you can write to the log file,
but if the transactions during that write that allocate space and/or
convert it to written space have not been committed to the journal
then the data is not on stable storage and so data completion
ordering cannot be relied on for integrity related operations.

This is why write ordering always comes back to "you need to use
fdatasync(), O_DSYNC or RWF_DSYNC" - it is the only way to guarantee
the integrity of a initial data write(s) right down to the hardware
before starting the new dependent write(s).

Hence AIO_FSYNC and now chained operations in io_uring to allow
fsync to be issues asynchronously and be used as a "write barrier"
between groups of order dependent IOs...

> For 3. it sounded like Jan Kara was saying there wasn't anything at
> the moment (hypothetically you could introduce a call that marked the
> extents as "unwritten" but it doesn't sound like you can do that

You can do that with fallocate() - FALLOC_FL_ZERO_RANGE will mark
the unused range as unwritten in XFS, or you can just punch a hole
to free the unused space with FALLOC_FL_PUNCH_HOLE...

> today) and even if you wanted to use something like TRIM it wouldn't
> be worth it unless you were trimming a large (gigabytes) amount of
> data (https://youtu.be/-oP2BOsMpdo?t=6330 ).

Punch the space out, then run a periodic background fstrim so the
filesystem can issue efficient TRIM commands over free space...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06  7:24 Questions about filesystems from SQLite author presentation Sitsofe Wheeler
  2020-01-06 10:15 ` Dave Chinner
@ 2020-01-06 15:40 ` Amir Goldstein
  2020-01-06 16:42   ` Matthew Wilcox
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Amir Goldstein @ 2020-01-06 15:40 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: linux-fsdevel, drh, Jan Kara, Dave Chinner, Theodore Tso,
	harshad shirwadkar

On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote:
>
> At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite
> (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled
> "Things to discuss"
> (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 )
> and had a few questions:
>
[...]
>
> However, there were even more questions in the briefing paper
> (https://sqlite.org/lpc2019/doc/trunk/briefing.md and search for '?')
> that couldn't be asked due to limited time. Does anyone know the
> answer to the extended questions and whether the the above is right
> deduction for the questions that were asked?
>

As Jan said, there is a difference between the answer to "what is the
current behavior" and "what are filesystem developers willing to commit
as behavior that will remain the same in the future", but I will try to provide
some answers to your questions.

> If a power loss occurs at about the same time that a file is being extended
> with new data, will the file be guaranteed to contain valid data after reboot,
> or might the extended area of the file contain all zeros or all ones or
> arbitrary content? In other words, is the file data always committed to disk
> ahead of the file size?

While that statement would generally be true (ever since ext3
journal=ordered...),
you have no such guaranty. Getting such a guaranty would require a new API
like O_ATOMIC.

> If a power loss occurs at about the same time as a file truncation, is it possible
> that the truncated area of the file will contain arbitrary data after reboot?
> In other words, is the file size guaranteed to be committed to disk before the
> data sections are released?

That statement is generally true for filesystem that claim to be crash
consistent.
And the filesystems that do not claim to be crash consistent provide
no guaranties
at all w.r.t power loss, so it's not worth talking about them in this context.

> If a write occurs on one or two bytes of a file at about the same time as a power
> loss, are other bytes of the file guaranteed to be unchanged after reboot?
> Or might some other bytes within the same sector have been modified as well?

I don't see how other bytes could change in this scenario, but I don't
know if the
hardware provides this guarantee. Maybe someone else knows the answer.

> When you create a new file, write to it, and fdatasync() successfully, is it also
> necessary to open and fsync() the containing directory in or to ensure that the
> file will still be there following reboot from a power loss?

There is no guarantee that file will be there after power loss without
fsync() of
containing directory. In practice, with current upstream xfs and ext4
file will be
there after reboot, because at the moment, fdatasync() of a new file implies
journal flush, which also includes the file creation.
With current upstream btrfs file may not be there after reboot.

I tried to promote a new API to provide a weaker guarantee
in LSF/MM 2019 [1][2]. The idea is an API used by an application that does not
need durability - it doesn't care if new file is there or not after
power loss, but if
the file is there, its data of the file should be valid.

I do not know if sqlite could potentially use such an API. If there is
a potential
use, I did not find it. Specifically, the proposed API DOES NOT have the
semantics of fbarrier() mentioned in the sqlite briefing doc.

[See more about fdatasync() at the bottom of my reply...]

> Has a file been unlinked or renamed since it was opened?
> (SQLite accomplishes this now by remembering the device and inode numbers
> obtained from fstat() and comparing them against the results of subsequent stat()
> calls against the original filename. Is there a more efficient way to do this?)

name_to_handle_at() is a better way to make sure that file with same name
wasn't replaced by another, because inode numbers get frequently recycled
in create/delete workloads.

> Has a particular file been created since the most recent reboot?

statx(2) exposes "birth time" (STATX_BTIME) which some filesystems
support depending on how they were formatted (e.g. ext4 inode size).
In any case, statx reports if btime info is available or not.

> Is it possible (or helpful) to tell the filesystem that the content of a particular file
> does not need to survive reboot?

Not that I know of.

> Is it possible (or helpful) to tell the filesystem that a particular file can be
> unlinked upon reboot?

Not that I know of.

> Is it possible (or helpful) to tell the filesystem about parts of the database
> file that are currently unused and that the filesystem can zero-out without
> harming the database?

As Dave already replied, FALLOC_FL_ZERO_RANGE.

[...more about fdatasync()]

One thing that I think is worth mentioning, I discussed it on LSF [3],
is the cost
of requiring applications developers to use the most strict API (i.e. fsync()),
because filesystem developers don't want to commit to new APIs -

When the same filesystem hosts two different workloads:
1. sqlite with many frequent small transaction commits
2. Creating many small files with no need for durability (e.g. untar)

Both workloads may in practice hurt each other on many filesystems.
The frequent fdatasync() calls from sqlite will sometimes cause journal
flushes, which flush more than sqlite needs, take more time to commit
and slows down the other metadata intensive workload.

Ext4 is trying to address this issue without extending the API [4].
XFS was a bit bettter than ext4 with avoiding unneeded journal flushes,
but those could still take place. Btrfs is generally better in this regard
(fdatasync() effects are quite isolated to the file).

So how can sqlite developers help to improve the situation?
If you ask me, I would suggest to provide benchmark results from
mixed workloads, like the one I described above.

If you can demonstrate the negative effects that frequent fdatasync()
calls on a single sqlite db have on the system performance as a whole,
then there is surely something that could be done to fix the problem.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/
[3] https://lwn.net/Articles/788938/
[4] https://lore.kernel.org/linux-ext4/20191001074101.256523-1-harshadshirwadkar@gmail.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06 15:40 ` Amir Goldstein
@ 2020-01-06 16:42   ` Matthew Wilcox
  2020-01-07  9:28     ` Sitsofe Wheeler
  2020-01-06 18:31   ` Amir Goldstein
  2020-01-07  9:16   ` Jan Kara
  2 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2020-01-06 16:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sitsofe Wheeler, linux-fsdevel, drh, Jan Kara, Dave Chinner,
	Theodore Tso, harshad shirwadkar

On Mon, Jan 06, 2020 at 05:40:20PM +0200, Amir Goldstein wrote:
> On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> > If a write occurs on one or two bytes of a file at about the same time as a power
> > loss, are other bytes of the file guaranteed to be unchanged after reboot?
> > Or might some other bytes within the same sector have been modified as well?
> 
> I don't see how other bytes could change in this scenario, but I don't
> know if the
> hardware provides this guarantee. Maybe someone else knows the answer.

The question is nonsense because there is no way to write less than one
sector to a hardware device, by definition.  So, treating this question
as being a read-modify-write of a single sector (assuming the "two bytes"
don't cross a sector boundary):

Hardware vendors are reluctant to provide this guarantee, but it's
essential to constructing a reliable storage system.  We wrote the NVMe
spec in such a way that vendors must provide single-sector-atomicity
guarantees, and I hope they haven't managed to wiggle some nonsense
into the spec that allows them to not make that guarantee.  The below
is a quote from the 1.4 spec.  For those not versed in NVMe spec-ese,
"0's based value" means that putting a zero in this field means the
value of AWUPF is 1.

  Atomic Write Unit Power Fail (AWUPF): This field indicates the size of
  the write operation guaranteed to be written atomically to the NVM across
  all namespaces with any supported namespace format during a power fail
  or error condition.

  If a specific namespace guarantees a larger size than is reported in
  this field, then this namespace specific size is reported in the NAWUPF
  field in the Identify Namespace data structure. Refer to section 6.4.

  This field is specified in logical blocks and is a 0’s based value. The
  AWUPF value shall be less than or equal to the AWUN value.

  If a write command is submitted with size less than or equal to the
  AWUPF value, the host is guaranteed that the write is atomic to the
  NVM with respect to other read or write commands. If a write command
  is submitted that is greater than this size, there is no guarantee of
  command atomicity. If the write size is less than or equal to the AWUPF
  value and the write command fails, then subsequent read commands for the
  associated logical blocks shall return data from the previous successful
  write command. If a write command is submitted with size greater than
  the AWUPF value, then there is no guarantee of data returned on
  subsequent reads of the associated logical blocks.

I take neither blame nor credit for what other storage standards may
implement; this is the only one I had a hand in, and I had to fight
hard to get it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06 15:40 ` Amir Goldstein
  2020-01-06 16:42   ` Matthew Wilcox
@ 2020-01-06 18:31   ` Amir Goldstein
  2020-01-07  9:16   ` Jan Kara
  2 siblings, 0 replies; 11+ messages in thread
From: Amir Goldstein @ 2020-01-06 18:31 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: linux-fsdevel, drh, Jan Kara, Dave Chinner, Theodore Tso,
	harshad shirwadkar

On Mon, Jan 6, 2020 at 5:40 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> >
> > At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite
> > (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled
> > "Things to discuss"
> > (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 )
> > and had a few questions:
> >
> [...]
> >
> > However, there were even more questions in the briefing paper
> > (https://sqlite.org/lpc2019/doc/trunk/briefing.md and search for '?')
> > that couldn't be asked due to limited time. Does anyone know the
> > answer to the extended questions and whether the the above is right
> > deduction for the questions that were asked?
> >
>
> As Jan said, there is a difference between the answer to "what is the
> current behavior" and "what are filesystem developers willing to commit
> as behavior that will remain the same in the future", but I will try to provide
> some answers to your questions.
>
> > If a power loss occurs at about the same time that a file is being extended
> > with new data, will the file be guaranteed to contain valid data after reboot,
> > or might the extended area of the file contain all zeros or all ones or
> > arbitrary content? In other words, is the file data always committed to disk
> > ahead of the file size?
>
> While that statement would generally be true (ever since ext3
> journal=ordered...),

Bah! scratch that. The statement is generally not true.
Due to delayed allocation with xfs/ext4 you are much more likely to
find the extended areas contain all zeroes.
The only guaranty AFAIK is that with truncate+extend sequence, you
won't find the old data in the re-extended area.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06 10:15 ` Dave Chinner
@ 2020-01-07  8:40   ` Sitsofe Wheeler
  2020-01-07  8:55     ` Jan Kara
  2020-01-07  8:47   ` Jan Kara
  1 sibling, 1 reply; 11+ messages in thread
From: Sitsofe Wheeler @ 2020-01-07  8:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, drh, Jan Kara

On Mon, 6 Jan 2020 at 10:16, Dave Chinner <david@fromorbit.com> wrote:
>
> Hence AIO_FSYNC and now chained operations in io_uring to allow
> fsync to be issues asynchronously and be used as a "write barrier"
> between groups of order dependent IOs...

Thanks for detailing this.

> > For 3. it sounded like Jan Kara was saying there wasn't anything at
> > the moment (hypothetically you could introduce a call that marked the
> > extents as "unwritten" but it doesn't sound like you can do that
>
> You can do that with fallocate() - FALLOC_FL_ZERO_RANGE will mark
> the unused range as unwritten in XFS, or you can just punch a hole
> to free the unused space with FALLOC_FL_PUNCH_HOLE...

Ah!

> > today) and even if you wanted to use something like TRIM it wouldn't
> > be worth it unless you were trimming a large (gigabytes) amount of
> > data (https://youtu.be/-oP2BOsMpdo?t=6330 ).
>
> Punch the space out, then run a periodic background fstrim so the
> filesystem can issue efficient TRIM commands over free space...

Jan mentions this over on https://youtu.be/-oP2BOsMpdo?t=6268 .
Basically he advises against hole punching if you're going to write to
that area again because it fragments the file, hurts future
performance etc. But I guess if you were using FALLOC_FL_ZERO_RANGE no
hole is punched (so no fragmentation) and you likely get faster reads
of that area until the data is rewritten too. Are areas that have had
FALLOC_FL_ZERO_RANGE run on them eligible for trimming if someone goes
on to do a background trim (Jan - doesn't this sound like the best of
both both worlds)?

My question is what happens if you call FALLOC_FL_ZERO_RANGE and your
filesystem is too dumb to mark extents unwritten - will it literally
go away and write a bunch of zeros over that region and your disk is a
slow HDD or will that call just fail? It's almost like you need
something that can tell you if FALLOC_FL_ZERO_RANGE is efficient...

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06 10:15 ` Dave Chinner
  2020-01-07  8:40   ` Sitsofe Wheeler
@ 2020-01-07  8:47   ` Jan Kara
  1 sibling, 0 replies; 11+ messages in thread
From: Jan Kara @ 2020-01-07  8:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Sitsofe Wheeler, linux-fsdevel, drh

On Mon 06-01-20 21:15:18, Dave Chinner wrote:
> On Mon, Jan 06, 2020 at 07:24:53AM +0000, Sitsofe Wheeler wrote:
> > For 3. it sounded like Jan Kara was saying there wasn't anything at
> > the moment (hypothetically you could introduce a call that marked the
> > extents as "unwritten" but it doesn't sound like you can do that
> 
> You can do that with fallocate() - FALLOC_FL_ZERO_RANGE will mark
> the unused range as unwritten in XFS, or you can just punch a hole
> to free the unused space with FALLOC_FL_PUNCH_HOLE...

Yes, this works for ext4 the same way.

> > today) and even if you wanted to use something like TRIM it wouldn't
> > be worth it unless you were trimming a large (gigabytes) amount of
> > data (https://youtu.be/-oP2BOsMpdo?t=6330 ).
> 
> Punch the space out, then run a periodic background fstrim so the
> filesystem can issue efficient TRIM commands over free space...

Yes, in that particular case Richard was mentioning with Sqlite, he was
asking about a situation where he has a DB file which has 64k free here,
256k free there and whether it helps the OS in any way to tell that these
areas are free (but will likely get reused in the future). And in this case
I told him that punching out the free space is going to do more harm than
good (due to fragmentation) and using FALLOC_FL_ZERO_RANGE isn't going to
bring any benefit to the filesystem or the storage. He was also wondering
whether using TRIM for these free areas on disk is useful and I told him
that for current devices I don't think it will bring any benefit with the
sizes he is talking about.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-07  8:40   ` Sitsofe Wheeler
@ 2020-01-07  8:55     ` Jan Kara
  2020-01-07 17:18       ` Darrick J. Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2020-01-07  8:55 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Dave Chinner, linux-fsdevel, drh, Jan Kara

On Tue 07-01-20 08:40:00, Sitsofe Wheeler wrote:
> On Mon, 6 Jan 2020 at 10:16, Dave Chinner <david@fromorbit.com> wrote:
> > > today) and even if you wanted to use something like TRIM it wouldn't
> > > be worth it unless you were trimming a large (gigabytes) amount of
> > > data (https://youtu.be/-oP2BOsMpdo?t=6330 ).
> >
> > Punch the space out, then run a periodic background fstrim so the
> > filesystem can issue efficient TRIM commands over free space...
> 
> Jan mentions this over on https://youtu.be/-oP2BOsMpdo?t=6268 .
> Basically he advises against hole punching if you're going to write to
> that area again because it fragments the file, hurts future
> performance etc. But I guess if you were using FALLOC_FL_ZERO_RANGE no
> hole is punched (so no fragmentation) and you likely get faster reads
> of that area until the data is rewritten too.

Yes, no fragmentation in this case (well, there's still the fact that
the extent tree needs to record that a particular range is marked as
unwritten so that will get fragmented but it is merged again as soon as the
range is written).

> Are areas that have had
> FALLOC_FL_ZERO_RANGE run on them eligible for trimming if someone goes
> on to do a background trim (Jan - doesn't this sound like the best of
> both both worlds)?

No, these areas are still allocated for the file and thus background trim
will not touch them. Concievably, we could use trim for such areas but
technically this is going to be too expensive to discover them (you'd need
to read all the inodes and their extent trees to discover them) at least
for ext4 and I belive for xfs as well.

> My question is what happens if you call FALLOC_FL_ZERO_RANGE and your
> filesystem is too dumb to mark extents unwritten - will it literally
> go away and write a bunch of zeros over that region and your disk is a
> slow HDD or will that call just fail? It's almost like you need
> something that can tell you if FALLOC_FL_ZERO_RANGE is efficient...

It is upto the filesystem how it implements the operation but so far we
managed to maintain a situation that FALLOC_FL_ZERO_RANGE returns error if
it is not efficient.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06 15:40 ` Amir Goldstein
  2020-01-06 16:42   ` Matthew Wilcox
  2020-01-06 18:31   ` Amir Goldstein
@ 2020-01-07  9:16   ` Jan Kara
  2 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2020-01-07  9:16 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sitsofe Wheeler, linux-fsdevel, drh, Jan Kara, Dave Chinner,
	Theodore Tso, harshad shirwadkar

On Mon 06-01-20 17:40:20, Amir Goldstein wrote:
> On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> > If a power loss occurs at about the same time that a file is being extended
> > with new data, will the file be guaranteed to contain valid data after reboot,
> > or might the extended area of the file contain all zeros or all ones or
> > arbitrary content? In other words, is the file data always committed to disk
> > ahead of the file size?
> 
> While that statement would generally be true (ever since ext3
> journal=ordered...),
> you have no such guaranty. Getting such a guaranty would require a new API
> like O_ATOMIC.

This is not quite true.

1) The rule you can rely on is: No random data in a file. So after a power
failure the state of the can be:
  a) original file state
  b) file size increased (possibly only partially), each block in the
     extended area contains either correct data or zeros.

There are exceptions to this for filesystems that don't maintain metadata
consistency on crash such as ext2, vfat, udf, or ext4 in data=writeback
mode. There the outcome after a crash is undefined...

> > If a write occurs on one or two bytes of a file at about the same time as a power
> > loss, are other bytes of the file guaranteed to be unchanged after reboot?
> > Or might some other bytes within the same sector have been modified as well?
> 
> I don't see how other bytes could change in this scenario, but I don't
> know if the hardware provides this guarantee. Maybe someone else knows
> the answer.

As Matthew wrote this boils down to whether the HW provides sector write
atomicity. Practically that seems to be the case.

> > Is it possible (or helpful) to tell the filesystem that the content of a particular file
> > does not need to survive reboot?
> 
> Not that I know of.
> 
> > Is it possible (or helpful) to tell the filesystem that a particular file can be
> > unlinked upon reboot?
> 
> Not that I know of.

Well, you could create the file with O_TMPFILE flag. That will give you
unlinked inode which will get just deleted once the file is closed (and
also on reboot). If you don't want to keep the file open all the time you
use it, then I don't know of a way.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-06 16:42   ` Matthew Wilcox
@ 2020-01-07  9:28     ` Sitsofe Wheeler
  0 siblings, 0 replies; 11+ messages in thread
From: Sitsofe Wheeler @ 2020-01-07  9:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, linux-fsdevel, drh, Jan Kara, Dave Chinner,
	Theodore Tso, harshad shirwadkar

On Mon, 6 Jan 2020 at 16:42, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 06, 2020 at 05:40:20PM +0200, Amir Goldstein wrote:
> > On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> > > If a write occurs on one or two bytes of a file at about the same time as a power
> > > loss, are other bytes of the file guaranteed to be unchanged after reboot?
> > > Or might some other bytes within the same sector have been modified as well?
> >
> > I don't see how other bytes could change in this scenario, but I don't
> > know if the
> > hardware provides this guarantee. Maybe someone else knows the answer.
>
> The question is nonsense because there is no way to write less than one
> sector to a hardware device, by definition.  So, treating this question
> as being a read-modify-write of a single sector (assuming the "two bytes"
> don't cross a sector boundary):
>
> Hardware vendors are reluctant to provide this guarantee, but it's
> essential to constructing a reliable storage system.  We wrote the NVMe
> spec in such a way that vendors must provide single-sector-atomicity
> guarantees, and I hope they haven't managed to wiggle some nonsense
> into the spec that allows them to not make that guarantee.  The below
> is a quote from the 1.4 spec.  For those not versed in NVMe spec-ese,
> "0's based value" means that putting a zero in this field means the
> value of AWUPF is 1.

Wow - that's the first time I've seen someone go on the record as
saying a sector write is atomic (albeit only for NVMe disks) without
having it instantly debated! Sadly there's no way of guaranteeing this
atomicity from userspace if https://youtu.be/-oP2BOsMpdo?t=3557 (where
Chris Mason(?) warns there can be corner cases trying to use O_DIRECT)
is to be believed though?

> I take neither blame nor credit for what other storage standards may
> implement; this is the only one I had a hand in, and I had to fight
> hard to get it.

So there's no consensus for SATA/SCSI etc
(https://stackoverflow.com/questions/2009063/are-disk-sector-writes-atomic
)? Just need to wait until there's NVMe everywhere :-)

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Questions about filesystems from SQLite author presentation
  2020-01-07  8:55     ` Jan Kara
@ 2020-01-07 17:18       ` Darrick J. Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2020-01-07 17:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: Sitsofe Wheeler, Dave Chinner, linux-fsdevel, drh

On Tue, Jan 07, 2020 at 09:55:06AM +0100, Jan Kara wrote:
> On Tue 07-01-20 08:40:00, Sitsofe Wheeler wrote:
> > On Mon, 6 Jan 2020 at 10:16, Dave Chinner <david@fromorbit.com> wrote:
> > > > today) and even if you wanted to use something like TRIM it wouldn't
> > > > be worth it unless you were trimming a large (gigabytes) amount of
> > > > data (https://youtu.be/-oP2BOsMpdo?t=6330 ).
> > >
> > > Punch the space out, then run a periodic background fstrim so the
> > > filesystem can issue efficient TRIM commands over free space...
> > 
> > Jan mentions this over on https://youtu.be/-oP2BOsMpdo?t=6268 .
> > Basically he advises against hole punching if you're going to write to
> > that area again because it fragments the file, hurts future
> > performance etc. But I guess if you were using FALLOC_FL_ZERO_RANGE no
> > hole is punched (so no fragmentation) and you likely get faster reads
> > of that area until the data is rewritten too.
> 
> Yes, no fragmentation in this case (well, there's still the fact that
> the extent tree needs to record that a particular range is marked as
> unwritten so that will get fragmented but it is merged again as soon as the
> range is written).
> 
> > Are areas that have had
> > FALLOC_FL_ZERO_RANGE run on them eligible for trimming if someone goes
> > on to do a background trim (Jan - doesn't this sound like the best of
> > both both worlds)?
> 
> No, these areas are still allocated for the file and thus background trim
> will not touch them. Concievably, we could use trim for such areas but
> technically this is going to be too expensive to discover them (you'd need
> to read all the inodes and their extent trees to discover them) at least
> for ext4 and I belive for xfs as well.
> 
> > My question is what happens if you call FALLOC_FL_ZERO_RANGE and your
> > filesystem is too dumb to mark extents unwritten - will it literally
> > go away and write a bunch of zeros over that region and your disk is a
> > slow HDD or will that call just fail? It's almost like you need
> > something that can tell you if FALLOC_FL_ZERO_RANGE is efficient...
> 
> It is upto the filesystem how it implements the operation but so far we
> managed to maintain a situation that FALLOC_FL_ZERO_RANGE returns error if
> it is not efficient.

The manpage says "...the specified range will not be physically zeroed
out on the device (except for partial blocks at the either end of the
range), and I/O is (otherwise) required only to update metadata."

I think that should be sufficient to hold the fs authors to
"FALLOC_FL_ZERO_RANGE must be efficient".

Though I've also wondered if that means fs is free to call
blkdev_issue_zeroout with NOFALLBACK in lieu of using unwritten extents?

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-01-07 17:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-06  7:24 Questions about filesystems from SQLite author presentation Sitsofe Wheeler
2020-01-06 10:15 ` Dave Chinner
2020-01-07  8:40   ` Sitsofe Wheeler
2020-01-07  8:55     ` Jan Kara
2020-01-07 17:18       ` Darrick J. Wong
2020-01-07  8:47   ` Jan Kara
2020-01-06 15:40 ` Amir Goldstein
2020-01-06 16:42   ` Matthew Wilcox
2020-01-07  9:28     ` Sitsofe Wheeler
2020-01-06 18:31   ` Amir Goldstein
2020-01-07  9:16   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).