All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] untorn buffered writes
@ 2024-02-28  6:12 Theodore Ts'o
  2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Theodore Ts'o @ 2024-02-28  6:12 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, linux-mm

Last year, I talked about an interest to provide database such as
MySQL with the ability to issue writes that would not be torn as they
write 16k database pages[1].

[1] https://lwn.net/Articles/932900/

There is a patch set being worked on by John Garry which provides
stronger guarantees than what is actually required for this use case,
called "atomic writes".  The proposed interface for this facility
involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
that the specific write be written to the storage device in an
all-or-nothing fashion, and if it can not be guaranteed, that the
write should fail.  In this interface, if the userspace sends an 128k
write with the RWF_ATOMIC flag, if the storage device will support
that an all-or-nothing write with the given size and alignment the
kernel will guarantee that it will be sent as a single 128k request
--- although from the database perspective, if it is using 16k
database pages, it only needs to guarantee that if the write is torn,
it only happen on a 16k boundary.  That is, if the write is split into
32k and 96k request, that would be totally fine as far as the database
is concerned --- and so the RWF_ATOMIC interface is a stronger
guarantee than what might be needed.

So far, the "atomic write" patchset has only focused on Direct I/O,
where this stronger guarantee is mostly harmless, even if it is
unneeded for the original motivating use case.  Which might be OK,
since perhaps there might be other future use cases where they might
want some 32k writes to be "atomic", while other 128k writes might
want to be "atomic" (that is to say, persisted with all-or-nothing
semantics), and the proposed RWF_ATOMIC interface might permit that
--- even though no one can seem top come up with a credible use case
that would require this.


However, this proposed interface is highly problematic when it comes
to buffered writes, and Postgress database uses buffered, not direct
I/O writes.   Suppose the database performs a 16k write, followed by a
64k write, followed by a 128k write --- and these writes are done
using a file descriptor that does not have O_DIRECT enable, and let's
suppose they are written using the proposed RWF_ATOMIC flag.   In
order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
kernel would need to store the fact that certain pages in the page
cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
writeback code knows what the "atomic" guarantee that was made at
write time.   This very quickly becomes a mess.

Another interface that one be much simpler to implement for buffered
writes would be one the untorn write granularity is set on a per-file
descriptor basis, using fcntl(2).  We validate whether the untorn
write granularity is one that can be supported when fcntl(2) is
called, and we also store in the inode the largest untorn write
granularity that has been requested by a file descriptor for that
inode.  (When the last file descriptor opened for writing has been
closed, the largest untorn write granularity for that inode can be set
back down to zero.)

The write(2) system call will check whether the size and alignment of
the write are valid given the requested untorn write granularity.  And
in the writeback path, the writeback will detect if there are
contiguous (aligned) dirty pages, and make sure they are sent to the
storage device in multiples of the largest requested untorn write
granularity.  This provides only the guarantees required by databases,
and obviates the need to track which pages were dirtied by an
RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.

I'd like to discuss at LSF/MM what the best interface would be for
buffered, untorn writes (I am deliberately avoiding the use of the
word "atomic" since that presumes stronger guarantees than what we
need, and because it has led to confusion in previous discussions),
and what might be needed to support it.

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
@ 2024-02-28 11:38 ` Amir Goldstein
  2024-02-28 20:21   ` Theodore Ts'o
  2024-02-28 14:11 ` Matthew Wilcox
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2024-02-28 11:38 UTC (permalink / raw)
  To: Theodore Ts'o, Luis R. Rodriguez
  Cc: lsf-pc, linux-fsdevel, linux-mm, Jan Kara

On Wed, Feb 28, 2024 at 8:13 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
>
> [1] https://lwn.net/Articles/932900/
>
> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes".  The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail.  In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary.  That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.
>
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case.  Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
>
>
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.
>
> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)
>
> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity.  And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity.  This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.
>
> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.
>


Seems a duplicate of this topic proposed by Luis?

https://lore.kernel.org/linux-fsdevel/ZdfDxN26VOFaT_Tv@bombadil.infradead.org/

Maybe you guys want to co-lead this session?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
  2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
@ 2024-02-28 14:11 ` Matthew Wilcox
  2024-02-28 23:33   ` Theodore Ts'o
  2024-02-28 16:06 ` John Garry
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2024-02-28 14:11 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Wed, Feb 28, 2024 at 12:12:57AM -0600, Theodore Ts'o wrote:
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.

I'm not entirely sure that it does become a mess.  If our implementation
of this ensures that each write ends up in a single folio (even if the
entire folio is larger than the write), then we will have satisfied the
semantics of the flag.

That's not to say that such an implementation would be easy.  We'd have
to be able to allocate a folio of the correct size (or fail the I/O),
and we'd have to cope with already-present smaller-than-needed folios
in the page cache, but it seems like a SMOP.

> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)

I'm not opposed to this API either.

> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity.  And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity.  This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.

I think we'd be better off treating RWF_ATOMIC like it's a bs>PS device.
That takes two somewhat special cases and makes them use the same code
paths, which probably means fewer bugs as both camps will be testing
the same code.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
  2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
  2024-02-28 14:11 ` Matthew Wilcox
@ 2024-02-28 16:06 ` John Garry
  2024-02-28 23:24   ` Theodore Ts'o
  2024-02-29  0:52 ` Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: John Garry @ 2024-02-28 16:06 UTC (permalink / raw)
  To: Theodore Ts'o, lsf-pc; +Cc: linux-fsdevel, linux-mm

On 28/02/2024 06:12, Theodore Ts'o wrote:
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
> 
> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
> 
> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes".  The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail.  In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary.  That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.

Note that the initial RFC for my series did propose an interface that 
does allow a write to be split in the kernel on a boundary, and that 
boundary was evaluated on a per-write basis by the length and alignment 
of the write along with any extent alignment granularity.

We decided not to pursue that, and instead require a write per 16K page, 
for the example above.

> 
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case.  Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
> 
> 
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess. >
> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)

If you check the latest discussion on XFS support we are proposing 
something along those lines:
https://lore.kernel.org/linux-fsdevel/Zc1GwE%2F7QJisKZCX@dread.disaster.area/

There FS_IOC_FSSETXATTR would be used to set extent size w/ 
fsx.fsx_extsize and new flag FS_XGLAG_FORCEALIGN to guarantee extent 
alignment, and this alignment would be the largest untorn write granularity.

Note that I already got push back on using fcntl for this.

So whether you are more interested in ext4 and how ext4 can adopt that 
API is another matter.. but I did consider adding something like struct 
inode.i_blkbits for this untorn write granularity, so an FS would just 
need to set that. But I am not proposing that ATM.

> 
> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity.  And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity.  This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.
> 
> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.
> 

Thanks,
John


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
@ 2024-02-28 20:21   ` Theodore Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Theodore Ts'o @ 2024-02-28 20:21 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Luis R. Rodriguez, lsf-pc, linux-fsdevel, linux-mm, Jan Kara

On Wed, Feb 28, 2024 at 01:38:44PM +0200, Amir Goldstein wrote:
> 
> Seems a duplicate of this topic proposed by Luis?
> 
> https://lore.kernel.org/linux-fsdevel/ZdfDxN26VOFaT_Tv@bombadil.infradead.org/

Maybe.  I did see Luis's topic, but it seemed to me to be largely
orthogonal to what I was interested in talking about.  Maybe I'm
missing something, but my observations were largely similar to Dave
Chinner's comments here:

https://lore.kernel.org/r/ZdvXAn1Q%2F+QX5sPQ@dread.disaster.area/

To wit, there are two cases here; either the desired untorn write
granularity is smaller than the large block size, in which case there
really nothing that needs to be done from an API perspective.
Alternatively, if the desired untorn granularity is *larger* than the
large block size, then the API considerations are the same with or
without LBS support.

From the implementation perspective, yes, there is a certain amount of
commonality, but that to me is relatively trivial --- or at least, it
isn't a particular subtle design.  That is, in the writeback code, it
needs to know what the desired write granularity, whether it is
required by the device because the logical sector size is larger than
the page size, or because there is an untorn write granularity
requested by the userspace process doing the writing (in practice,
pretty much always 16k for databases).  In terms of what the writeback
code needs to do, it needs to make sure that gathers up pages
respecting the alignment and required size, and if a page is locked,
we have to wait until it is available, instead of skipping that page
in the case of a non-data-integrity writeback.

As far as tooling/testing is concerned, against, it appears to me that
the requirements of LBA and the desire for untorn writes in units of
granularity larger than the block size are quite orthogonal.  For LBA,
all you need is some kind of synthetic/debug device which has a
logical block size larger than the page size.  This could be done a
number of ways:

    * via the VMM --- e.g., a QEMU block device that has a 64k logical
      sector size.
    * via loop device that exports a larger logical sector size
    * via blktrace (or its ebpf or ftrace) and making sure that size of every
      write request is the right multiple of 512 byte sectors

For testing untorn writes, life is a bit tricker, because not all
writes will be larger than the page size.  For example, we might have
an ext4 file system with a 4k blocksize, so metadata writes to the
inode table, etc., will be in 4k writes.  However, when writing to the
database file, *those* writes need to be in multiples of 16k, with 16k
alignment required, and if a write needs to be broken up it must be at
a 16k boundary.

The tooling for this, which is untorn write specific, and completely
irrelevant for the LBS case, needs to know which parts of the storage
device are assigned to the database file --- and which are not.  If
the database file is not getting deleted or truncated, it's relatively
easy to take a blktrace (or ebpf or ftrace equivalent) and validate
all of the I/O's, after the fact.  The tooling to do this isn't
terribly complicated, would involve using filefrag -v if the file
system is already mounted, and a file system specific tool (i.e.,
debugfs for ext4, or xfs_db for xfs) if the file system is not mounted.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28 16:06 ` John Garry
@ 2024-02-28 23:24   ` Theodore Ts'o
  2024-02-29 16:28     ` John Garry
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2024-02-28 23:24 UTC (permalink / raw)
  To: John Garry; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Wed, Feb 28, 2024 at 04:06:43PM +0000, John Garry wrote:
> 
> Note that the initial RFC for my series did propose an interface that does
> allow a write to be split in the kernel on a boundary, and that boundary was
> evaluated on a per-write basis by the length and alignment of the write
> along with any extent alignment granularity.
> 
> We decided not to pursue that, and instead require a write per 16K page, for
> the example above.

Yes, I did see that.  And that leads to the problem where if you do an
RWF_ATOMIC write which is 32k, then we are promising that it will be
sent as a single 32k SCSI or NVMe request --- even though that isn't
required by the database, the API is *promising* that we will honor
it.  But that leads to the problem where for buffered writes, we need
to track which dirty pages are part of write #1, where we had promised
a 32k "atomic" write, which pages were part of writes #2, and #3,
which were each promised to be 16k "atomic writes", and which pages
were part of write #4, which was promised to be a 64k write.  If the
pages dirtied by writes #1, #2, and #3, and #4 are all contiguous, how
do we know what promise we had made about which pages should be
atomically sent together in a single write request?  Do we have to
store all of this information somewhere in the struct page or struct
folio?

And if we use Matthew's suggestion that we treat each folio as the
atomic write unit, does that mean that we have to break part or join
folios together depending on which writes were sent with an RWF_ATOMIC
write flag and by their size?

You see?  This is why I think the RWF_ATOMIC flag, which was mostly
harmless when it over-promised unneeded semantics for Direct I/O, is
actively harmful and problematic for buffered I/O.

> 
> If you check the latest discussion on XFS support we are proposing something
> along those lines:
> https://lore.kernel.org/linux-fsdevel/Zc1GwE%2F7QJisKZCX@dread.disaster.area/
> 
> There FS_IOC_FSSETXATTR would be used to set extent size w/ fsx.fsx_extsize
> and new flag FS_XGLAG_FORCEALIGN to guarantee extent alignment, and this
> alignment would be the largest untorn write granularity.
> 
> Note that I already got push back on using fcntl for this.

There are two separable untorn write granularity that you might need to
set, One is specifying the constraints that must be required for all
block allocations associated with the file.  This needs to be
persistent, and stored with the file or directory (or for the entire
file system; I'll talk about this option in a moment) so that we know
that a particular file has blocks allocated in contiguous chunks with
the correct alignment so we can make the untorn write guarantee.
Since this needs to be persistent, and set when the file is first
created, that's why I could imagine that someone pushed back on using
fcntl(2) --- since fcntl is a property of the file descriptor, not of
the inode, and when you close the file descriptor, nothing that you
set via fcntl(2) is persisted.

However, the second untorn write granularity which is required for
writes using a particular file descriptor.  And please note that these
two values don't necessarily need to be the same.  For example, if the
first granularity is 32k, such that block allocations are done in 32k
clusters, aligned on 32k boundaries, then you can provide untorn write
guarantees of 8k, 16k, or 32k ---- so long as (a) the file or block
device has the appropriate alignment guarantees, and (b) the hardware
can support untorn write guarantees of the requested size.

And for some file systems, and for block devices, you might not need
to set the first untorn write granularity size at all.  For example,
if the block device represents the entire disk, or represents a
partition which is aligned on a 1MB boundary (which tends to be case
for GPT partitions IIRC), then we don't need to set any kind of magic
persistent granularity size, because it's a fundamental propert of the
partition.  As another example, ext4 has the bigalloc file system
feature, which allows you to set at file system creation time, a
cluster allocation size which is a power of two multiple of the
blocksize.  So for example, if you have a block size of 4k, and
block/cluster ratio is 16, then the cluster size is 64k, and all data
blocks will be done in aligned 64k chunks.

The ext4 bigalloc feature has been around since 2011, so it's
something that can be enabled even for a really ancient distro kernel.
:-) Hence, we don't actually *need* any file system format changes.
If there was a way that we could set a requeted untorn write
granularity size associated with all writes to a particular file
descriptor, via fcntl(2), that's all we actually need.  That is, we
just need the non-persistent, file descriptor-specific write
granularity parameter which applies to writes; and this would work for
raw block devices, where we wouldn't have any *place* to store file
attribute.  And like with ext4 bigalloc file systems, we don't need
any file system format changes in order to support untorn writes for
block devices, so long as the starting offset of the block device
(zero if it's the whole disk) is appropriately aligned.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28 14:11 ` Matthew Wilcox
@ 2024-02-28 23:33   ` Theodore Ts'o
  2024-02-29  1:07     ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2024-02-28 23:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Wed, Feb 28, 2024 at 02:11:06PM +0000, Matthew Wilcox wrote:
> I'm not entirely sure that it does become a mess.  If our implementation
> of this ensures that each write ends up in a single folio (even if the
> entire folio is larger than the write), then we will have satisfied the
> semantics of the flag.

What if we do a 32k write which spans two folios?  And what
if the physical pages for those 32k in the buffer cache are not
contiguous?  Are you going to have to join the two 16k folios
together, or maybe two 8k folios and an 16k folio, and relocate pages
to make a contiguous 32k folio when we do a buffered RWF_ATOMIC write
of size 32k?

Folios have to consist of physically contiguous pages, right?  But we
can do send a single 32k write request using scatter-gather even if
the pages are not physically contiguous.  So it would seem to me that
trying to overload the folio size to represent the "atomic write
guarantee" of RWF_ATOMIC seems unwise.

(And yes, the database might not need it to be 32k untorn write, but
what if it sends a 32k write, for example because it's writing a set
of pages to the database journal file?  The RWF_ATOMIC interface
doesn't *know* what is really required, the only thing it knows is the
overly strong guarantees that we set in the definition of that
interface.  Or are we going to make the RWF_ATOMIC interface fail all
writes that aren't exactly 16k?  That seems.... baroque.)

> I think we'd be better off treating RWF_ATOMIC like it's a bs>PS device.
> That takes two somewhat special cases and makes them use the same code
> paths, which probably means fewer bugs as both camps will be testing
> the same code.

But for a bs > PS device, where the logical block size is greater than
the page size, you don't need the RWF_ATOMIC flag at all.  All direct
I/O writes *must* be a multiple of the logical sector size, and
buffered writes, if they are smaller than the block size, *must* be
handled as a read-modify-write, since you can't send writes to the
device smaller than the logical sector size.

This is why I claim that LBS devices and untorn writes are largely
orthogonal; for LBS devices no special API is needed at all, and
certainly not the highly problematic RWF_ATOMIC API that has been
proposed.  (Well, not problematic for Direct I/O, which is what we had
originally focused upon, but highly problematic for buffered I/O.)

	   	   	     	    		    - Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
                   ` (2 preceding siblings ...)
  2024-02-28 16:06 ` John Garry
@ 2024-02-29  0:52 ` Dave Chinner
  2024-03-11  8:42 ` John Garry
  2024-05-15 19:54 ` John Garry
  5 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2024-02-29  0:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Wed, Feb 28, 2024 at 12:12:57AM -0600, Theodore Ts'o wrote:
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
> 
> [1] https://lwn.net/Articles/932900/
> 
> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes".  The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail.  In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary.  That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.
> 
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case.  Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
> 
> 
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag. 

Not problematic at all, we're already intending to handle this
"software RWF_ATOMIC" situation for buffered writes in XFS via a
forced COW operation.  That is, we'll allocate new blocks for the
write, and then when the data IO is complete we'll do an atomic swap
of the new data extent over the old one. We'll probably even enable
this for direct IO on hardware that doesn't support REQ_ATOMIC so
that software can just code for RWF_ATOMIC existing for all types of
IO on XFS....

> In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.

The simplification of this is using a single high-order folio for
the RWF_ATOMIC write data, then there's just a single folio that
needs to be written back. RWF_ATOMIC alreayd has a constraint of
only being supported for aligned power of 2 IOs, so it matches
hig-order folio cache indexing exactly. We can then run RWF_ATOMIC
IO as a write-through operation (i.e.  fdatawrite_range()) and IO
completion will then swap the entire range with the new data.

Hence on return from the syscall, we have new data on disk, and the
only thing that we need to do to make it permanent is commit the
journal (e.g. via RWF_DSYNC or explicit fdatasync()). This largely
makes the software RWF_ATOMIC behave exactly the same as hardware
based direct IO RWF_ATOMIC. i.e. the atomic extent swap on data IO
compeltion is the data integrity pivot that provides the RWF_ATOMIC
semantics, not the REQ_ATOMIC bio flag...

Yes, I know that ext4 has neither COW nor high order folio support,
but that just means that ext4 needs to add high-order folio support
and whatever internal code it needs to implement write-anywhere data
semantics for software RWF_ATOMIC support.

> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)

fcntl has already been rejected for reasons (i.e. alignment is a
persistent inode property, not a ephemeral file property). The way
we intend to do this is via fsx.fsx_extsize hints and a
FS_XFLAG_FORCEALIGN control of a on-disk inode flag. This triggers
all the alignment restrictions needed to guarantee atomic writes
from the filesystem and/or hardware.

> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.

I think we're almost all the way there already, and that it is
likely this will already be scheduled for discussion at LSFMM...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28 23:33   ` Theodore Ts'o
@ 2024-02-29  1:07     ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2024-02-29  1:07 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm

On Wed, Feb 28, 2024 at 05:33:54PM -0600, Theodore Ts'o wrote:
> On Wed, Feb 28, 2024 at 02:11:06PM +0000, Matthew Wilcox wrote:
> > I'm not entirely sure that it does become a mess.  If our implementation
> > of this ensures that each write ends up in a single folio (even if the
> > entire folio is larger than the write), then we will have satisfied the
> > semantics of the flag.
> 
> What if we do a 32k write which spans two folios?  And what
> if the physical pages for those 32k in the buffer cache are not
> contiguous?  Are you going to have to join the two 16k folios
> together, or maybe two 8k folios and an 16k folio, and relocate pages
> to make a contiguous 32k folio when we do a buffered RWF_ATOMIC write
> of size 32k?

RWF_ATOMIC defines contraints that a 32kB write must be 32kB
aligned. So the only way a 32kB write would span two folios is if
a 16kB write had already been done in this space.

WE are already dealing with this problem for bs > ps with the min
order mapping constraint. We can deal with this easily by ensuring
that when we set the inode as supporting atomic writes. This already
ensures physical extent allocation alignment, we can also set the
mapping folio order at this time to ensure that we only allocate
RWF_ATOMIC compatible aligned/sized folios....

> > I think we'd be better off treating RWF_ATOMIC like it's a bs>PS device.

Which is why Willy says this...

> > That takes two somewhat special cases and makes them use the same code
> > paths, which probably means fewer bugs as both camps will be testing
> > the same code.
> 
> But for a bs > PS device, where the logical block size is greater than
> the page size, you don't need the RWF_ATOMIC flag at all.

Yes we do - hardware already supports REQ_ATOMIC sizes larger than
64kB filesystem blocks. i.e. RWF_ATOMIC is not restricted to 64kB
or any specific filesystem block size, and can always be larger than
the filesystem block size.

> All direct
> I/O writes *must* be a multiple of the logical sector size, and
> buffered writes, if they are smaller than the block size, *must* be
> handled as a read-modify-write, since you can't send writes to the
> device smaller than the logical sector size.

The filesystem will likely need to constrain minimum RWF_ATOMIC
sizes to a single filesystem block. That's the whole point of having
the statx interface - the application is going to have to query what
the min/max atomic write sizes supported are and adjust to those.
Applications will not be able to use 2kB RWF_ATOMIC writes on a 4kB
block size filesystem, and it's no different with larger filesystem
block sizes.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28 23:24   ` Theodore Ts'o
@ 2024-02-29 16:28     ` John Garry
  2024-02-29 21:21       ` Ritesh Harjani
  0 siblings, 1 reply; 16+ messages in thread
From: John Garry @ 2024-02-29 16:28 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm

On 28/02/2024 23:24, Theodore Ts'o wrote:
> On Wed, Feb 28, 2024 at 04:06:43PM +0000, John Garry wrote:
>> Note that the initial RFC for my series did propose an interface that does
>> allow a write to be split in the kernel on a boundary, and that boundary was
>> evaluated on a per-write basis by the length and alignment of the write
>> along with any extent alignment granularity.
>>
>> We decided not to pursue that, and instead require a write per 16K page, for
>> the example above.
> Yes, I did see that.  And that leads to the problem where if you do an
> RWF_ATOMIC write which is 32k, then we are promising that it will be
> sent as a single 32k SCSI or NVMe request 

We actually guarantee that it will be sent as part of a single request 
which is at least 32K, as we may merge atomic writes in the block layer. 
But that's not so important here.

> --- even though that isn't
> required by the database,

Then I do wonder why the DB is asking for some 32K of data to be written 
with no-tears guarantee. Convenience, I guess.

> the API is*promising*  that we will honor
> it.  But that leads to the problem where for buffered writes, we need
> to track which dirty pages are part of write #1, where we had promised
> a 32k "atomic" write, which pages were part of writes #2, and #3,
> which were each promised to be 16k "atomic writes", and which pages
> were part of write #4, which was promised to be a 64k write.  If the
> pages dirtied by writes #1, #2, and #3, and #4 are all contiguous, how
> do we know what promise we had made about which pages should be
> atomically sent together in a single write request?  Do we have to
> store all of this information somewhere in the struct page or struct
> folio?
> 
> And if we use Matthew's suggestion that we treat each folio as the
> atomic write unit, does that mean that we have to break part or join
> folios together depending on which writes were sent with an RWF_ATOMIC
> write flag and by their size?
> 
> You see?  This is why I think the RWF_ATOMIC flag, which was mostly >
> harmless when it over-promised unneeded semantics for Direct I/O, is
> actively harmful and problematic for buffered I/O.
> 
>> If you check the latest discussion on XFS support we are proposing something
>> along those lines:
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/Zc1GwE*2F7QJisKZCX@dread.disaster.area/__;JQ!!ACWV5N9M2RV99hQ!IlGiuVKB_rW6nIXKv1iGSM4FrX-9ehXa4NF-nvpP5MNsycQLKCcKmRgmKEFgT8hoo7rfN8EhOzwWoDrA$  
>>
>> There FS_IOC_FSSETXATTR would be used to set extent size w/ fsx.fsx_extsize
>> and new flag FS_XGLAG_FORCEALIGN to guarantee extent alignment, and this
>> alignment would be the largest untorn write granularity.
>>
>> Note that I already got push back on using fcntl for this.
> There are two separable untorn write granularity that you might need to
> set, One is specifying the constraints that must be required for all
> block allocations associated with the file.  This needs to be
> persistent, and stored with the file or directory (or for the entire
> file system; I'll talk about this option in a moment) so that we know
> that a particular file has blocks allocated in contiguous chunks with
> the correct alignment so we can make the untorn write guarantee.
> Since this needs to be persistent, and set when the file is first
> created, that's why I could imagine that someone pushed back on using
> fcntl(2) --- since fcntl is a property of the file descriptor, not of
> the inode, and when you close the file descriptor, nothing that you
> set via fcntl(2) is persisted.
> 
> However, the second untorn write granularity which is required for
> writes using a particular file descriptor.  And please note that these
> two values don't necessarily need to be the same.  For example, if the
> first granularity is 32k, such that block allocations are done in 32k
> clusters, aligned on 32k boundaries, then you can provide untorn write
> guarantees of 8k, 16k, or 32k ---- so long as (a) the file or block
> device has the appropriate alignment guarantees, and (b) the hardware
> can support untorn write guarantees of the requested size.
> 
> And for some file systems, and for block devices, you might not need
> to set the first untorn write granularity size at all.  For example,
> if the block device represents the entire disk, or represents a
> partition which is aligned on a 1MB boundary (which tends to be case
> for GPT partitions IIRC), then we don't need to set any kind of magic
> persistent granularity size, because it's a fundamental propert of the
> partition.  As another example, ext4 has the bigalloc file system
> feature, which allows you to set at file system creation time, a
> cluster allocation size which is a power of two multiple of the
> blocksize.  So for example, if you have a block size of 4k, and
> block/cluster ratio is 16, then the cluster size is 64k, and all data
> blocks will be done in aligned 64k chunks.
> 
> The ext4 bigalloc feature has been around since 2011, so it's
> something that can be enabled even for a really ancient distro kernel.
> 🙂 Hence, we don't actually*need*  any file system format changes.

That's what I thought, until this following proposal: 
https://lore.kernel.org/linux-ext4/cover.1701339358.git.ojaswin@linux.ibm.com/

> If there was a way that we could set a requeted untorn write
> granularity size associated with all writes to a particular file
> descriptor, via fcntl(2), that's all we actually need.

Would there be a conflict if we had 2x fds for the same inode with 
different untorn write granularity set via fcntl(2)?

And how does this interact with regular buffered IO?

I am just not sure on how this would be implemented.

>  That is, we
> just need the non-persistent, file descriptor-specific write
> granularity parameter which applies to writes; and this would work for
> raw block devices, where we wouldn't have any*place*  to store file
> attribute.  And like with ext4 bigalloc file systems, we don't need
> any file system format changes in order to support untorn writes for
> block devices, so long as the starting offset of the block device
> (zero if it's the whole disk) is appropriately aligned.

Judging from Dave Chinner's response, he has some idea on how this would 
work.

For me, my thoughts were that we will need to employ some writeback when 
partially or fully overwriting an untorn write sitting in the page 
cache. And a folio seems a good way to track an individual unforn write.

Thanks,
John


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-29 16:28     ` John Garry
@ 2024-02-29 21:21       ` Ritesh Harjani
  0 siblings, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2024-02-29 21:21 UTC (permalink / raw)
  To: John Garry, Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm

John Garry <john.g.garry@oracle.com> writes:

> On 28/02/2024 23:24, Theodore Ts'o wrote:
>> On Wed, Feb 28, 2024 at 04:06:43PM +0000, John Garry wrote:
>>> Note that the initial RFC for my series did propose an interface that does
>>> allow a write to be split in the kernel on a boundary, and that boundary was
>>> evaluated on a per-write basis by the length and alignment of the write
>>> along with any extent alignment granularity.
>>>
>>> We decided not to pursue that, and instead require a write per 16K page, for
>>> the example above.
>> Yes, I did see that.  And that leads to the problem where if you do an
>> RWF_ATOMIC write which is 32k, then we are promising that it will be
>> sent as a single 32k SCSI or NVMe request
>
> We actually guarantee that it will be sent as part of a single request
> which is at least 32K, as we may merge atomic writes in the block layer.
> But that's not so important here.
>
>> --- even though that isn't
>> required by the database,
>
> Then I do wonder why the DB is asking for some 32K of data to be written
> with no-tears guarantee. Convenience, I guess.
>
>> the API is*promising*  that we will honor
>> it.  But that leads to the problem where for buffered writes, we need
>> to track which dirty pages are part of write #1, where we had promised
>> a 32k "atomic" write, which pages were part of writes #2, and #3,
>> which were each promised to be 16k "atomic writes", and which pages
>> were part of write #4, which was promised to be a 64k write.  If the
>> pages dirtied by writes #1, #2, and #3, and #4 are all contiguous, how
>> do we know what promise we had made about which pages should be
>> atomically sent together in a single write request?  Do we have to
>> store all of this information somewhere in the struct page or struct
>> folio?
>>
>> And if we use Matthew's suggestion that we treat each folio as the
>> atomic write unit, does that mean that we have to break part or join
>> folios together depending on which writes were sent with an RWF_ATOMIC
>> write flag and by their size?
>>
>> You see?  This is why I think the RWF_ATOMIC flag, which was mostly >
>> harmless when it over-promised unneeded semantics for Direct I/O, is
>> actively harmful and problematic for buffered I/O.
>>
>>> If you check the latest discussion on XFS support we are proposing something
>>> along those lines:
>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/Zc1GwE*2F7QJisKZCX@dread.disaster.area/__;JQ!!ACWV5N9M2RV99hQ!IlGiuVKB_rW6nIXKv1iGSM4FrX-9ehXa4NF-nvpP5MNsycQLKCcKmRgmKEFgT8hoo7rfN8EhOzwWoDrA$
>>>
>>> There FS_IOC_FSSETXATTR would be used to set extent size w/ fsx.fsx_extsize
>>> and new flag FS_XGLAG_FORCEALIGN to guarantee extent alignment, and this
>>> alignment would be the largest untorn write granularity.
>>>
>>> Note that I already got push back on using fcntl for this.
>> There are two separable untorn write granularity that you might need to
>> set, One is specifying the constraints that must be required for all
>> block allocations associated with the file.  This needs to be
>> persistent, and stored with the file or directory (or for the entire
>> file system; I'll talk about this option in a moment) so that we know
>> that a particular file has blocks allocated in contiguous chunks with
>> the correct alignment so we can make the untorn write guarantee.
>> Since this needs to be persistent, and set when the file is first
>> created, that's why I could imagine that someone pushed back on using
>> fcntl(2) --- since fcntl is a property of the file descriptor, not of
>> the inode, and when you close the file descriptor, nothing that you
>> set via fcntl(2) is persisted.
>>
>> However, the second untorn write granularity which is required for
>> writes using a particular file descriptor.  And please note that these
>> two values don't necessarily need to be the same.  For example, if the
>> first granularity is 32k, such that block allocations are done in 32k
>> clusters, aligned on 32k boundaries, then you can provide untorn write
>> guarantees of 8k, 16k, or 32k ---- so long as (a) the file or block
>> device has the appropriate alignment guarantees, and (b) the hardware
>> can support untorn write guarantees of the requested size.
>> 
>> And for some file systems, and for block devices, you might not need
>> to set the first untorn write granularity size at all.  For example,
>> if the block device represents the entire disk, or represents a
>> partition which is aligned on a 1MB boundary (which tends to be case
>> for GPT partitions IIRC), then we don't need to set any kind of magic
>> persistent granularity size, because it's a fundamental propert of the
>> partition.  As another example, ext4 has the bigalloc file system
>> feature, which allows you to set at file system creation time, a
>> cluster allocation size which is a power of two multiple of the
>> blocksize.  So for example, if you have a block size of 4k, and
>> block/cluster ratio is 16, then the cluster size is 64k, and all data
>> blocks will be done in aligned 64k chunks.
>>
>> The ext4 bigalloc feature has been around since 2011, so it's
>> something that can be enabled even for a really ancient distro kernel.
>> 🙂 Hence, we don't actually*need*  any file system format changes.
>
> That's what I thought, until this following proposal:
> https://lore.kernel.org/linux-ext4/cover.1701339358.git.ojaswin@linux.ibm.com/
>

So there are two ways, ext4 could achieve aligned block allocation
requirements which is required to gurantee atomic writes.

1. Format a filesystem with bigalloc which ensures allocation happen in
units of clusters or format it with -b <BS> on a higher pagesize system.
2. Add intelligence in multi-block allocator of ext4 to provide aligned
allocations (this option won't require any formatting)

The patch series you pointed is an initial RFC for doing option 2,
i.e. adding allocator changes to provide aligned allocations.
But I agree none of that should require any on disk fs layout changes.

Currently we are looking into utilizing option-1 which should be
relatively easier to do it than option-2, more so when the interfaces
for doing atomic writes are still getting discussed.

-ritesh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
                   ` (3 preceding siblings ...)
  2024-02-29  0:52 ` Dave Chinner
@ 2024-03-11  8:42 ` John Garry
  2024-05-15 19:54 ` John Garry
  5 siblings, 0 replies; 16+ messages in thread
From: John Garry @ 2024-03-11  8:42 UTC (permalink / raw)
  To: Theodore Ts'o, lsf-pc; +Cc: linux-fsdevel, linux-mm

On 28/02/2024 06:12, Theodore Ts'o wrote:
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.

Having done some research, postgres has a fixed "page" size per file and 
this is typically 8KB. This is configured at compile time. Page size may 
be different between certain file types, but it is possible to have all 
file types be configured for the same page size. This all seems like 
standard DB stuff.

So, as I mentioned in response to Matthew here:
https://lore.kernel.org/linux-scsi/47d264c2-bc97-4313-bce0-737557312106@oracle.com/

.. for untorn buffered writes support, we could just set 
atomic_write_unit_min = atomic_write_unit_max = FS file alignment 
granule = DB page size. That would seem easier to support in the page 
cache and still provide the RWF_ATOMIC guarantee. For ext4, bigalloc 
cluster size could be this FS file alignment granule. For XFS, it would 
be the extsize with forcealign.

It might be argued that we would like to submit larger untorn write IOs 
from userspace for performance benefit and allow the kernel to split on 
some page boundary, but I doubt that this will be utilised by userspace. 
On the other hand, the block atomic writes kernel series does support 
block layer merging (of atomic writes).

About advertising untorn buffered write capability, current statx fields 
update for atomic writes is here:
https://lore.kernel.org/linux-api/20240124112731.28579-2-john.g.garry@oracle.com/

Only direct IO support is mentioned there. For supporting buffered IO, I 
suppose an additional flag can be added for getting buffered IO info, 
like STATX_ATTR_WRITE_ATOMIC_BUFFERED, and reuse atomic_write_unit_{min, 
max, segments_max} fields for buffered IO. Setting the direct IO and 
buffered IO flags would be mutually exclusive.

Is there any anticipated problem with this idea?

On another topic, there is some development to allow postgres to use 
direct IO, see:
https://wiki.postgresql.org/wiki/AIO

Assuming all info there is accurate and up to date, it does still seem 
to be lagging kernel untorn write support.

John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
                   ` (4 preceding siblings ...)
  2024-03-11  8:42 ` John Garry
@ 2024-05-15 19:54 ` John Garry
  2024-05-22 21:56   ` Luis Chamberlain
  2024-05-23 12:59   ` Christoph Hellwig
  5 siblings, 2 replies; 16+ messages in thread
From: John Garry @ 2024-05-15 19:54 UTC (permalink / raw)
  To: Theodore Ts'o, lsf-pc
  Cc: linux-fsdevel, linux-mm, Luis Chamberlain, Martin K. Petersen,
	Matthew Wilcox, Dave Chinner, linux-kernel

On 27/02/2024 23:12, Theodore Ts'o wrote:
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
> 
> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
> 

After discussing this topic earlier this week, I would like to know if 
there are still objections or concerns with the untorn-writes userspace 
API proposed in 
https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/

I feel that the series for supporting direct-IO only, above, is stuck 
because of this topic of buffered IO.

So I sent an RFC for buffered untorn-writes last month in 
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/, 
which did leverage the bs > ps effort. Maybe it did not get noticed due 
to being an RFC. It works on the following principles:

- A buffered atomic write requires RWF_ATOMIC flag be set, same as
   direct IO. The same other atomic writes rules apply.
- For an inode, only a single size of buffered write is allowed. So for
   statx, atomic_write_unit_min = atomic_write_unit_max always for
   buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. So inode
   address_space folio min order = max order = atomic_write_unit_min/max
- A folio is tagged as "atomic" when atomically written and written back
   to storage "atomically", same as direct-IO method would do for an
   atomic write.
- If userspace wants to guarantee a buffered atomic write is written to
   storage atomically after the write syscall returns, it must use
   RWF_SYNC or similar (along with RWF_ATOMIC).

This is all along the lines of what I described on Monday.

There are no concrete semantics for buffered untorn-writes ATM - like 
mixing RWF_ATOMIC write with non-RWF_ATOMIC writes in the pagecache - 
but I don't think that this needs to be formalized yet. Or, if it really 
does, let me know.

There was also talk in the "limits of buffered IO.. " session - as I 
understand - that RWF_ATOMIC for buffered IO should be writethough. If 
anyone wants to discuss that further or describe that issue, then please do.

Anyway, I plan to push the direct IO series for merging in the next 
cycle, so let me know of what else to discuss and get conclusion on.


> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes".  The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail.  In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary.  That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.
> 
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case.  Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
> 
> 
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.
> 
> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)
> 
> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity.  And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity.  This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.
> 
> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.
> 
> 						- Ted
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-05-15 19:54 ` John Garry
@ 2024-05-22 21:56   ` Luis Chamberlain
  2024-05-23 11:59     ` John Garry
  2024-05-23 12:59   ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Luis Chamberlain @ 2024-05-22 21:56 UTC (permalink / raw)
  To: John Garry, David Bueso
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm,
	Martin K. Petersen, Matthew Wilcox, Dave Chinner, linux-kernel

On Wed, May 15, 2024 at 01:54:39PM -0600, John Garry wrote:
> On 27/02/2024 23:12, Theodore Ts'o wrote:
> > Last year, I talked about an interest to provide database such as
> > MySQL with the ability to issue writes that would not be torn as they
> > write 16k database pages[1].
> > 
> > [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
> > 
> 
> After discussing this topic earlier this week, I would like to know if there
> are still objections or concerns with the untorn-writes userspace API
> proposed in https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/
> 
> I feel that the series for supporting direct-IO only, above, is stuck
> because of this topic of buffered IO.

I think it was good we had the discussions at LSFMM over it, however
I personally don't percieve it as stuck, however without any consensus
being obviated or written down anywhere it would not be clear to anyone
that we did reach any consensus at all. Hope is that lwn captures any
consensus if any was indeed reached as you're not making it clear any
was.

In case it helps, as we did with the LBS effort it may also be useful to
put together bi-monthly cabals to follow up progress, and divide and
conquer any pending work items.

> So I sent an RFC for buffered untorn-writes last month in https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/,
> which did leverage the bs > ps effort. Maybe it did not get noticed due to
> being an RFC. It works on the following principles:
> 
> - A buffered atomic write requires RWF_ATOMIC flag be set, same as
>   direct IO. The same other atomic writes rules apply.
> - For an inode, only a single size of buffered write is allowed. So for
>   statx, atomic_write_unit_min = atomic_write_unit_max always for
>   buffered atomic writes.
> - A single folio maps to an atomic write in the pagecache. So inode
>   address_space folio min order = max order = atomic_write_unit_min/max
> - A folio is tagged as "atomic" when atomically written and written back
>   to storage "atomically", same as direct-IO method would do for an
>   atomic write.
> - If userspace wants to guarantee a buffered atomic write is written to
>   storage atomically after the write syscall returns, it must use
>   RWF_SYNC or similar (along with RWF_ATOMIC).

From my perspective the above just needs the IOCB atomic support, and
the pending long term work item there is the near-write-through buffered
IO support. We could just wait for buffered-IO support until we have
support for that. I can't think of anying blocking DIO support though,
now that we at least have a mental model of how buffered IO *should*
work.

What about testing? Are you extending fstests, blktests?

  Luis

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-05-22 21:56   ` Luis Chamberlain
@ 2024-05-23 11:59     ` John Garry
  0 siblings, 0 replies; 16+ messages in thread
From: John Garry @ 2024-05-23 11:59 UTC (permalink / raw)
  To: Luis Chamberlain, David Bueso
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm,
	Martin K. Petersen, Matthew Wilcox, Dave Chinner, linux-kernel,
	catherine.hoang

On 22/05/2024 22:56, Luis Chamberlain wrote:
> On Wed, May 15, 2024 at 01:54:39PM -0600, John Garry wrote:
>> On 27/02/2024 23:12, Theodore Ts'o wrote:
>>> Last year, I talked about an interest to provide database such as
>>> MySQL with the ability to issue writes that would not be torn as they
>>> write 16k database pages[1].
>>>
>>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
>>>
>>
>> After discussing this topic earlier this week, I would like to know if there
>> are still objections or concerns with the untorn-writes userspace API
>> proposed in https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/
>>
>> I feel that the series for supporting direct-IO only, above, is stuck
>> because of this topic of buffered IO.
> 
> I think it was good we had the discussions at LSFMM over it, however
> I personally don't percieve it as stuck, however without any consensus
> being obviated or written down anywhere it would not be clear to anyone
> that we did reach any consensus at all.

> Hope is that lwn captures any
> consensus if any was indeed reached as you're not making it clear any
> was.

That's my point really. There were some positive discussion. I put 
across the idea of implementing buffered atomic writes, and now I want 
to ensure that everyone is satisfied with that going forward. I think 
that a LWN report is now being written.

> 
> In case it helps, as we did with the LBS effort it may also be useful to
> put together bi-monthly cabals to follow up progress, and divide and
> conquer any pending work items.

ok, we can consider that.

> 
>> So I sent an RFC for buffered untorn-writes last month in https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/,
>> which did leverage the bs > ps effort. Maybe it did not get noticed due to
>> being an RFC. It works on the following principles:
>>
>> - A buffered atomic write requires RWF_ATOMIC flag be set, same as
>>    direct IO. The same other atomic writes rules apply.
>> - For an inode, only a single size of buffered write is allowed. So for
>>    statx, atomic_write_unit_min = atomic_write_unit_max always for
>>    buffered atomic writes.
>> - A single folio maps to an atomic write in the pagecache. So inode
>>    address_space folio min order = max order = atomic_write_unit_min/max
>> - A folio is tagged as "atomic" when atomically written and written back
>>    to storage "atomically", same as direct-IO method would do for an
>>    atomic write.
>> - If userspace wants to guarantee a buffered atomic write is written to
>>    storage atomically after the write syscall returns, it must use
>>    RWF_SYNC or similar (along with RWF_ATOMIC).
> 
>  From my perspective the above just needs the IOCB atomic support, and
> the pending long term work item there is the near-write-through buffered
> IO support. We could just wait for buffered-IO support until we have
> support for that. I can't think of anying blocking DIO support though,
> now that we at least have a mental model of how buffered IO *should*
> work.

Yes, these are my thoughts as well.

> 
> What about testing? Are you extending fstests, blktests?

Yes, so 3 things to mention here:

- We have been looking at adding full test coverage in xfstests. 
Catherine Hoang recently starting working on this. Most tests will 
actually cover the forcealign feature. Indeed, just atomic writes 
support testing would be quite limited when compared to forcealign 
testing. Furthermore we are also looking at forcealign and atomic writes 
testing in fsx.c, as finding forcealign corner cases would be quite 
limited on the formalized tests

- for blktests, we were going to add some basic atomic writes test 
there, like ensuring that misaligned or mis-sized writes are rejected. 
This would be the same really for xfstests, above. I don't think that 
there are so many tests which we can cover. scsi_debug will support 
atomic writes, which can be used for blktests.

- I have done some limited power-fail testing for my NVMe card.

I have 2x challenges here:
- My host does not allow the card port to be manually powered down, so I 
need to physically plug out the power cable to test :(
- My NVMe card only supports 4KB power-fail atomic writes, which is 
quite small.

The actual power-fail testing involves using fio in verify mode. In 
that, each data block has a CRC written per test loop. I just verify 
that the CRCs are valid after the power cycle (which they are when block 
size is 4KB and lower :)).

Thanks,
John


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  2024-05-15 19:54 ` John Garry
  2024-05-22 21:56   ` Luis Chamberlain
@ 2024-05-23 12:59   ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2024-05-23 12:59 UTC (permalink / raw)
  To: John Garry
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm,
	Luis Chamberlain, Martin K. Petersen, Matthew Wilcox,
	Dave Chinner, linux-kernel

On Wed, May 15, 2024 at 01:54:39PM -0600, John Garry wrote:
> On 27/02/2024 23:12, Theodore Ts'o wrote:
> > Last year, I talked about an interest to provide database such as
> > MySQL with the ability to issue writes that would not be torn as they
> > write 16k database pages[1].
> > 
> > [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
> > 
> 
> After discussing this topic earlier this week, I would like to know if there
> are still objections or concerns with the untorn-writes userspace API
> proposed in https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/
> 
> I feel that the series for supporting direct-IO only, above, is stuck
> because of this topic of buffered IO.

Just my 2 cents, but I think supporting untorn I/O for buffered I/O
is an amazingly bad idea that opens up a whole can of worms in terms
of potential failure paths while not actually having a convincing use
case.

For buffered I/O something like the atomic msync proposal makes a lot
more sense, because it actually provides a useful API for non-trivial
transactions.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-05-23 12:59 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-28  6:12 [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
2024-02-28 11:38 ` [Lsf-pc] " Amir Goldstein
2024-02-28 20:21   ` Theodore Ts'o
2024-02-28 14:11 ` Matthew Wilcox
2024-02-28 23:33   ` Theodore Ts'o
2024-02-29  1:07     ` Dave Chinner
2024-02-28 16:06 ` John Garry
2024-02-28 23:24   ` Theodore Ts'o
2024-02-29 16:28     ` John Garry
2024-02-29 21:21       ` Ritesh Harjani
2024-02-29  0:52 ` Dave Chinner
2024-03-11  8:42 ` John Garry
2024-05-15 19:54 ` John Garry
2024-05-22 21:56   ` Luis Chamberlain
2024-05-23 11:59     ` John Garry
2024-05-23 12:59   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.