linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] man2: Document RWF_ATOMIC
@ 2023-09-29  9:37 John Garry
  2023-09-29  9:37 ` [PATCH 1/4] statx.2: Document STATX_WRITE_ATOMIC John Garry
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: John Garry @ 2023-09-29  9:37 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani, John Garry

Document RWF_ATOMIC flag for pwritev2().

RWF_ATOMIC atomic is used for enabling torn-write protection.

We use RWF_ATOMIC as this is legacy name for similar feature proposed in
the past.

Himanshu Madhani (2):
  statx.2: Document STATX_WRITE_ATOMIC
  readv.2: Document RWF_ATOMIC flag

John Garry (2):
  man2/open.2: Document RWF_ATOMIC
  io_submit.2: Document RWF_ATOMIC

 man2/io_submit.2 |  5 +++++
 man2/open.2      | 13 +++++++++++++
 man2/readv.2     | 45 +++++++++++++++++++++++++++++++++++++++++++++
 man2/statx.2     | 18 ++++++++++++++++++
 4 files changed, 81 insertions(+)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] statx.2: Document STATX_WRITE_ATOMIC
  2023-09-29  9:37 [PATCH 0/4] man2: Document RWF_ATOMIC John Garry
@ 2023-09-29  9:37 ` John Garry
  2023-09-29  9:37 ` [PATCH 2/4] readv.2: Document RWF_ATOMIC flag John Garry
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2023-09-29  9:37 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add the text to the statx man page.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 man2/statx.2 | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/man2/statx.2 b/man2/statx.2
index bcc67c8e3bd3..23b9c6953ec5 100644
--- a/man2/statx.2
+++ b/man2/statx.2
@@ -68,6 +68,9 @@ struct statx {
     /* Direct I/O alignment restrictions */
     __u32 stx_dio_mem_align;
     __u32 stx_dio_offset_align;
+\&
+    __u32 stx_atomic_write_unit_max;
+    __u32 stx_atomic_write_unit_min;
 };
 .EE
 .in
@@ -255,6 +258,8 @@ STATX_ALL	The same as STATX_BASIC_STATS | STATX_BTIME.
 STATX_MNT_ID	Want stx_mnt_id (since Linux 5.8)
 STATX_DIOALIGN	Want stx_dio_mem_align and stx_dio_offset_align
 	(since Linux 6.1; support varies by filesystem)
+STATX_WRITE_ATOMIC	Want stx_atomic_write_unit_min and stx_atomic_write_unit_max
+	(since Linux 6.7; support varies by filesystem)
 .TE
 .in
 .PP
@@ -439,6 +444,16 @@ or 0 if direct I/O is not supported on this file.
 This will only be nonzero if
 .I stx_dio_mem_align
 is nonzero, and vice versa.
+.TP
+.I stx_atomic_write_unit_min
+The minimum size (in bytes) supported for direct I/O
+.RB ( O_DIRECT )
+on the file to be written with torn-write protection. This value is guaranteed to be power-of-2.
+.TP
+.I stx_atomic_write_unit_max
+The maximum size (in bytes) supported for direct I/O
+.RB ( O_DIRECT )
+on the file to be written with torn-write protection. This value is guaranteed to be power-of-2.
 .PP
 For further information on the above fields, see
 .BR inode (7).
@@ -492,6 +507,9 @@ It cannot be written to, and all reads from it will be verified
 against a cryptographic hash that covers the
 entire file (e.g., via a Merkle tree).
 .TP
+.BR STATX_ATTR_WRITE_ATOMIC " (since Linux 6.7)"
+The file supports torn-write protection.
+.TP
 .BR STATX_ATTR_DAX " (since Linux 5.8)"
 The file is in the DAX (cpu direct access) state.
 DAX state attempts to
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-09-29  9:37 [PATCH 0/4] man2: Document RWF_ATOMIC John Garry
  2023-09-29  9:37 ` [PATCH 1/4] statx.2: Document STATX_WRITE_ATOMIC John Garry
@ 2023-09-29  9:37 ` John Garry
  2023-10-03 19:25   ` Bart Van Assche
  2023-10-09 17:44   ` Darrick J. Wong
  2023-09-29  9:37 ` [PATCH 3/4] man2/open.2: Document RWF_ATOMIC John Garry
  2023-09-29  9:37 ` [PATCH 4/4] io_submit.2: " John Garry
  3 siblings, 2 replies; 18+ messages in thread
From: John Garry @ 2023-09-29  9:37 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add RWF_ATOMIC flag description for pwritev2().

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
#jpg: complete rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/man2/readv.2 b/man2/readv.2
index fa9b0e4e44a2..ff09f3bc9792 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -193,6 +193,51 @@ which provides lower latency, but may use additional resources.
 .B O_DIRECT
 flag.)
 .TP
+.BR RWF_ATOMIC " (since Linux 6.7)"
+Allows block-based filesystems to indicate that write operations will be issued
+with torn-write protection. Torn-write protection means that for a power or any
+other hardware failure, all or none of the data from the write will be stored,
+but never a mix of old and new data. This flag is meaningful only for
+.BR pwritev2 (),
+and its effect applies only to the data range written by the system call.
+The total write length must be power-of-2 and must be sized between
+stx_atomic_write_unit_min and stx_atomic_write_unit_max, both inclusive. The
+write must be at a natural offset within the file with respect to the total
+write length. Torn-write protection only works with
+.B O_DIRECT
+flag, i.e. buffered writes are not supported. To guarantee consistency from
+the write between a file's in-core state with the storage device,
+.BR fdatasync (2)
+or
+.BR fsync (2)
+or
+.BR open (2)
+and
+.B O_SYNC
+or
+.B O_DSYNC
+or
+.B pwritev2 ()
+flag
+.B RWF_SYNC
+or
+.B RWF_DSYNC
+is required.
+For when regular files are opened with
+.BR open (2)
+but without
+.B O_SYNC
+or
+.B O_DSYNC
+and the
+.BR pwritev2()
+call is made without
+.B RWF_SYNC
+or
+.BR RWF_DSYNC
+set, the range metadata must already be flushed to storage and the data range
+must not be in unwritten state, shared, a preallocation, or a hole.
+.TP
 .BR RWF_SYNC " (since Linux 4.7)"
 .\" commit e864f39569f4092c2b2bc72c773b6e486c7e3bd9
 Provide a per-write equivalent of the
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] man2/open.2: Document RWF_ATOMIC
  2023-09-29  9:37 [PATCH 0/4] man2: Document RWF_ATOMIC John Garry
  2023-09-29  9:37 ` [PATCH 1/4] statx.2: Document STATX_WRITE_ATOMIC John Garry
  2023-09-29  9:37 ` [PATCH 2/4] readv.2: Document RWF_ATOMIC flag John Garry
@ 2023-09-29  9:37 ` John Garry
  2023-09-29  9:37 ` [PATCH 4/4] io_submit.2: " John Garry
  3 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2023-09-29  9:37 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani, John Garry

Document special rule when using RWF_ATOMIC flag in conjunction with
O_DIRECT (which is the only time which it can be used).

Some block devices have a virtual boundary, which is a boundary at which
write iovec vectors start and end addresses need to align to - see
virt_boundary_mask in linux kernel sysfs-block Doc.

To avoid splitting writes with torn-write protection in the kernel for
when vectors cross this boundary, special iovec boundary rules are added
such that we don't cross this boundary.

Even though a virt_boundary_mask may be larger than the CPU page size,
we just impose that as a reasonable limit, which matches what the linux
kernel NVMe driver uses today - NVMe is of special interest for supporting
RWF_ATOMIC.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 man2/open.2 | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/man2/open.2 b/man2/open.2
index 4c921723c95c..e5b9d107859a 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -1782,6 +1782,19 @@ blockdev \-\-getss
 .EE
 .in
 .PP
+When
+.B O_DIRECT
+is used in conjunction with
+.BR pwritev2 ()
+and
+.BR RWF_ATOMIC
+flag, in addition to any alignment rules already imposed by the filesystem and
+underlying block device, it must be ensured that
+.I iovec
+vectors have no gaps such that the end alignment of a vector must have the same
+start alignment of any subsequent vector and that alignment must be least
+at a page boundary.
+.PP
 .B O_DIRECT
 I/Os should never be run concurrently with the
 .BR fork (2)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] io_submit.2: Document RWF_ATOMIC
  2023-09-29  9:37 [PATCH 0/4] man2: Document RWF_ATOMIC John Garry
                   ` (2 preceding siblings ...)
  2023-09-29  9:37 ` [PATCH 3/4] man2/open.2: Document RWF_ATOMIC John Garry
@ 2023-09-29  9:37 ` John Garry
  2023-10-09 17:45   ` Darrick J. Wong
  3 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2023-09-29  9:37 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani, John Garry

Document RWF_ATOMIC for asynchronous I/O.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 man2/io_submit.2 | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/man2/io_submit.2 b/man2/io_submit.2
index 1030bb6cd965..beba8865ac2a 100644
--- a/man2/io_submit.2
+++ b/man2/io_submit.2
@@ -140,6 +140,11 @@ as well the description of
 .B O_SYNC
 in
 .BR open (2).
+.TP
+.BR RWF_ATOMIC " (since Linux 6.7)"
+Write a block of data such that a write will never be
+torn from power fail or similar. See See the description
+of the flag of the same name in pwritev2(2).
 .RE
 .TP
 .I aio_lio_opcode
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-09-29  9:37 ` [PATCH 2/4] readv.2: Document RWF_ATOMIC flag John Garry
@ 2023-10-03 19:25   ` Bart Van Assche
  2023-10-04  8:47     ` John Garry
  2023-10-09 17:44   ` Darrick J. Wong
  1 sibling, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2023-10-03 19:25 UTC (permalink / raw)
  To: John Garry, linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani

On 9/29/23 02:37, John Garry wrote:
> +.BR RWF_ATOMIC " (since Linux 6.7)"
> +Allows block-based filesystems to indicate that write operations will be issued
> +with torn-write protection. Torn-write protection means that for a power or any
> +other hardware failure, all or none of the data from the write will be stored,
> +but never a mix of old and new data. This flag is meaningful only for
> +.BR pwritev2 (),
> +and its effect applies only to the data range written by the system call.
> +The total write length must be power-of-2 and must be sized between
> +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both inclusive. The
> +write must be at a natural offset within the file with respect to the total
> +write length. Torn-write protection only works with
> +.B O_DIRECT
> +flag, i.e. buffered writes are not supported. To guarantee consistency from
> +the write between a file's in-core state with the storage device,

It seems wrong to me to start the first sentence with "Allows". Atomic
behavior should be mandatory if RWF_ATOMIC has been set.

Additionally, shouldn't it be documented what value will be stored in
errno if the atomic write has been rejected?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-03 19:25   ` Bart Van Assche
@ 2023-10-04  8:47     ` John Garry
  2023-10-04 17:36       ` Bart Van Assche
  2023-10-04 22:48       ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: John Garry @ 2023-10-04  8:47 UTC (permalink / raw)
  To: Bart Van Assche, linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani

On 03/10/2023 20:25, Bart Van Assche wrote:
> On 9/29/23 02:37, John Garry wrote:
>> +.BR RWF_ATOMIC " (since Linux 6.7)"
>> +Allows block-based filesystems to indicate that write operations will 
>> be issued
>> +with torn-write protection. Torn-write protection means that for a 
>> power or any
>> +other hardware failure, all or none of the data from the write will 
>> be stored,
>> +but never a mix of old and new data. This flag is meaningful only for
>> +.BR pwritev2 (),
>> +and its effect applies only to the data range written by the system 
>> call.
>> +The total write length must be power-of-2 and must be sized between
>> +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both 
>> inclusive. The
>> +write must be at a natural offset within the file with respect to the 
>> total
>> +write length. Torn-write protection only works with
>> +.B O_DIRECT
>> +flag, i.e. buffered writes are not supported. To guarantee 
>> consistency from
>> +the write between a file's in-core state with the storage device,
> 
> It seems wrong to me to start the first sentence with "Allows". Atomic
> behavior should be mandatory if RWF_ATOMIC has been set.

Yes, I agree that this has been poorly worded. Flag RWF_ATOMIC does not 
indicate anything. I will fix it.

> 
> Additionally, shouldn't it be documented what value will be stored in
> errno if the atomic write has been rejected?

So I was treating all atomic writes errors which don't follow the 
"rules" as low-level I/O errors, which is -EIO. However, yes, I can 
document this. Further to that, based on description of an error for 
O_DIRECT, which is to return -EINVAL for misaligned, I think that 
-EINVAL may be better for any atomic write rule violations. OK?

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-04  8:47     ` John Garry
@ 2023-10-04 17:36       ` Bart Van Assche
  2023-10-04 22:48       ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2023-10-04 17:36 UTC (permalink / raw)
  To: John Garry, linux-kernel, linux-api
  Cc: martin.petersen, djwong, david, himanshu.madhani

On 10/4/23 01:47, John Garry wrote:
> On 03/10/2023 20:25, Bart Van Assche wrote:
>> Additionally, shouldn't it be documented what value will be stored in
>> errno if the atomic write has been rejected?
> 
> So I was treating all atomic writes errors which don't follow the 
> "rules" as low-level I/O errors, which is -EIO. However, yes, I can 
> document this. Further to that, based on description of an error for 
> O_DIRECT, which is to return -EINVAL for misaligned, I think that 
> -EINVAL may be better for any atomic write rule violations. OK?

That sounds good to me.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-04  8:47     ` John Garry
  2023-10-04 17:36       ` Bart Van Assche
@ 2023-10-04 22:48       ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2023-10-04 22:48 UTC (permalink / raw)
  To: John Garry
  Cc: Bart Van Assche, linux-kernel, linux-api, martin.petersen,
	djwong, himanshu.madhani

On Wed, Oct 04, 2023 at 09:47:24AM +0100, John Garry wrote:
> On 03/10/2023 20:25, Bart Van Assche wrote:
> > On 9/29/23 02:37, John Garry wrote:
> > > +.BR RWF_ATOMIC " (since Linux 6.7)"
> > > +Allows block-based filesystems to indicate that write operations
> > > will be issued
> > > +with torn-write protection. Torn-write protection means that for a
> > > power or any
> > > +other hardware failure, all or none of the data from the write will
> > > be stored,
> > > +but never a mix of old and new data. This flag is meaningful only for
> > > +.BR pwritev2 (),
> > > +and its effect applies only to the data range written by the system
> > > call.
> > > +The total write length must be power-of-2 and must be sized between
> > > +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both
> > > inclusive. The
> > > +write must be at a natural offset within the file with respect to
> > > the total
> > > +write length. Torn-write protection only works with
> > > +.B O_DIRECT
> > > +flag, i.e. buffered writes are not supported. To guarantee
> > > consistency from
> > > +the write between a file's in-core state with the storage device,
> > 
> > It seems wrong to me to start the first sentence with "Allows". Atomic
> > behavior should be mandatory if RWF_ATOMIC has been set.
> 
> Yes, I agree that this has been poorly worded. Flag RWF_ATOMIC does not
> indicate anything. I will fix it.
> 
> > 
> > Additionally, shouldn't it be documented what value will be stored in
> > errno if the atomic write has been rejected?
> 
> So I was treating all atomic writes errors which don't follow the "rules" as
> low-level I/O errors, which is -EIO. However, yes, I can document this.
> Further to that, based on description of an error for O_DIRECT, which is to
> return -EINVAL for misaligned, I think that -EINVAL may be better for any
> atomic write rule violations. OK?

Agreed - I was going to make that comment myself about using EINVAL
instead of EIO...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-09-29  9:37 ` [PATCH 2/4] readv.2: Document RWF_ATOMIC flag John Garry
  2023-10-03 19:25   ` Bart Van Assche
@ 2023-10-09 17:44   ` Darrick J. Wong
  2023-10-09 20:39     ` Dave Chinner
  2023-10-24 12:30     ` John Garry
  1 sibling, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2023-10-09 17:44 UTC (permalink / raw)
  To: John Garry
  Cc: linux-kernel, linux-api, martin.petersen, david, himanshu.madhani

On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
> 
> Add RWF_ATOMIC flag description for pwritev2().
> 
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> #jpg: complete rewrite
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/man2/readv.2 b/man2/readv.2
> index fa9b0e4e44a2..ff09f3bc9792 100644
> --- a/man2/readv.2
> +++ b/man2/readv.2
> @@ -193,6 +193,51 @@ which provides lower latency, but may use additional resources.
>  .B O_DIRECT
>  flag.)
>  .TP
> +.BR RWF_ATOMIC " (since Linux 6.7)"
> +Allows block-based filesystems to indicate that write operations will be issued

"Require regular file write operations to be issued with torn write
protection."

> +with torn-write protection. Torn-write protection means that for a power or any
> +other hardware failure, all or none of the data from the write will be stored,
> +but never a mix of old and new data. This flag is meaningful only for
> +.BR pwritev2 (),
> +and its effect applies only to the data range written by the system call.
> +The total write length must be power-of-2 and must be sized between
> +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both inclusive. The
> +write must be at a natural offset within the file with respect to the total

What is a "natural" offset?  That should be defined with more
specificity.  Does that mean that the position of a XX-KiB write must
also be aligned to XX-KiB?  e.g. a 32K untorn write can only start at a
multiple of 32K?  What if the device supports untorn writes between 4K
and 64K, does that mean I /cannot/ issue a 32K untorn write at offset
48K?

> +write length. Torn-write protection only works with
> +.B O_DIRECT
> +flag, i.e. buffered writes are not supported. To guarantee consistency from
> +the write between a file's in-core state with the storage device,
> +.BR fdatasync (2)
> +or
> +.BR fsync (2)
> +or
> +.BR open (2)
> +and
> +.B O_SYNC
> +or
> +.B O_DSYNC
> +or
> +.B pwritev2 ()
> +flag
> +.B RWF_SYNC
> +or
> +.B RWF_DSYNC
> +is required.

I'm starting to think that this manpage shouldn't be restating
durability information here.

"Application programs with data or file integrity completion
requirements must configure synchronous writes with the DSYNC
or SYNC flags, as explained above."

> +For when regular files are opened with
> +.BR open (2)
> +but without
> +.B O_SYNC
> +or
> +.B O_DSYNC
> +and the
> +.BR pwritev2()
> +call is made without
> +.B RWF_SYNC
> +or
> +.BR RWF_DSYNC
> +set, the range metadata must already be flushed to storage and the data range
> +must not be in unwritten state, shared, a preallocation, or a hole.

I think that we can drop all of these flags requirements, since the
contiguous small space allocation requirement means that the fs can
provide all-or-nothing writes even if metadata updates are needed:

If the file range is allocated and marked unwritten (i.e. a
preallocation), the ioend will clear the unwritten bit from the file
mapping atomically.  After a crash, the application sees either zeroes
or all the data that was written.

If the file range is shared, the ioend will map the COW staging extent
into the file atomically.  After a crash, the application sees either
the old contents from the old blocks, or the new contents from the new
blocks.

If the file range is a sparse hole, the directio setup will allocate
space and create an unwritten mapping before issuing the write bio.  The
rest of the process works the same as preallocations and has the same
behaviors.

If the file range is allocated and was previously written, the write is
issued and that's all that's needed from the fs.  After a crash, reads
of the storage device produce the old contents or the new contents.

Summarizing:

An (ATOMIC|SYNC) request provides the strongest guarantees (data
will not be torn, and all file metadata updates are persisted before
the write is returned to userspace.  Programs see either the old data or
the new data, even if there's a crash.

(ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
updates for just that region are persisted before the write is returned.

(ATOMIC) is the least strong -- data will not be torn.  Neither the
filesystem nor the device make guarantees that anything ended up on
stable storage, but if it does, programs see either the old data or the
new data.

Maybe we should rename the whole UAPI s/atomic/untorn/...

--D

> +.TP
>  .BR RWF_SYNC " (since Linux 4.7)"
>  .\" commit e864f39569f4092c2b2bc72c773b6e486c7e3bd9
>  Provide a per-write equivalent of the
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] io_submit.2: Document RWF_ATOMIC
  2023-09-29  9:37 ` [PATCH 4/4] io_submit.2: " John Garry
@ 2023-10-09 17:45   ` Darrick J. Wong
  2023-10-24 11:51     ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2023-10-09 17:45 UTC (permalink / raw)
  To: John Garry
  Cc: linux-kernel, linux-api, martin.petersen, david, himanshu.madhani

On Fri, Sep 29, 2023 at 09:37:17AM +0000, John Garry wrote:
> Document RWF_ATOMIC for asynchronous I/O.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  man2/io_submit.2 | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/man2/io_submit.2 b/man2/io_submit.2
> index 1030bb6cd965..beba8865ac2a 100644
> --- a/man2/io_submit.2
> +++ b/man2/io_submit.2
> @@ -140,6 +140,11 @@ as well the description of
>  .B O_SYNC
>  in
>  .BR open (2).
> +.TP
> +.BR RWF_ATOMIC " (since Linux 6.7)"
> +Write a block of data such that a write will never be
> +torn from power fail or similar. See See the description

Nit: Double 'See' here.

--D

> +of the flag of the same name in pwritev2(2).
>  .RE
>  .TP
>  .I aio_lio_opcode
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-09 17:44   ` Darrick J. Wong
@ 2023-10-09 20:39     ` Dave Chinner
  2023-10-09 21:05       ` Darrick J. Wong
  2023-10-24 12:30     ` John Garry
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-10-09 20:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, linux-kernel, linux-api, martin.petersen, himanshu.madhani

On Mon, Oct 09, 2023 at 10:44:38AM -0700, Darrick J. Wong wrote:
> On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
> > From: Himanshu Madhani <himanshu.madhani@oracle.com>
> > 
> > Add RWF_ATOMIC flag description for pwritev2().
> > 
> > Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> > #jpg: complete rewrite
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 45 insertions(+)
....
> > +For when regular files are opened with
> > +.BR open (2)
> > +but without
> > +.B O_SYNC
> > +or
> > +.B O_DSYNC
> > +and the
> > +.BR pwritev2()
> > +call is made without
> > +.B RWF_SYNC
> > +or
> > +.BR RWF_DSYNC
> > +set, the range metadata must already be flushed to storage and the data range
> > +must not be in unwritten state, shared, a preallocation, or a hole.
> 
> I think that we can drop all of these flags requirements, since the
> contiguous small space allocation requirement means that the fs can
> provide all-or-nothing writes even if metadata updates are needed:
> 
> If the file range is allocated and marked unwritten (i.e. a
> preallocation), the ioend will clear the unwritten bit from the file
> mapping atomically.  After a crash, the application sees either zeroes
> or all the data that was written.
> 
> If the file range is shared, the ioend will map the COW staging extent
> into the file atomically.  After a crash, the application sees either
> the old contents from the old blocks, or the new contents from the new
> blocks.
> 
> If the file range is a sparse hole, the directio setup will allocate
> space and create an unwritten mapping before issuing the write bio.  The
> rest of the process works the same as preallocations and has the same
> behaviors.
> 
> If the file range is allocated and was previously written, the write is
> issued and that's all that's needed from the fs.  After a crash, reads
> of the storage device produce the old contents or the new contents.

This is exactly what I explained when reviewing the code that
rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.

> Summarizing:
> 
> An (ATOMIC|SYNC) request provides the strongest guarantees (data
> will not be torn, and all file metadata updates are persisted before
> the write is returned to userspace.  Programs see either the old data or
> the new data, even if there's a crash.
> 
> (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> updates for just that region are persisted before the write is returned.
> 
> (ATOMIC) is the least strong -- data will not be torn.  Neither the
> filesystem nor the device make guarantees that anything ended up on
> stable storage, but if it does, programs see either the old data or the
> new data.

Yup, that makes sense to me.

> Maybe we should rename the whole UAPI s/atomic/untorn/...

Perhaps, though "torn writes" is nomenclature that nobody outside
storage and filesystem developers really knows about. All I ever
hear from userspace developers is "we want atomic/all-or-nothing
data writes"...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-09 20:39     ` Dave Chinner
@ 2023-10-09 21:05       ` Darrick J. Wong
  2023-10-24 12:35         ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2023-10-09 21:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, linux-kernel, linux-api, martin.petersen, himanshu.madhani

On Tue, Oct 10, 2023 at 07:39:17AM +1100, Dave Chinner wrote:
> On Mon, Oct 09, 2023 at 10:44:38AM -0700, Darrick J. Wong wrote:
> > On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
> > > From: Himanshu Madhani <himanshu.madhani@oracle.com>
> > > 
> > > Add RWF_ATOMIC flag description for pwritev2().
> > > 
> > > Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> > > #jpg: complete rewrite
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >  man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 45 insertions(+)
> ....
> > > +For when regular files are opened with
> > > +.BR open (2)
> > > +but without
> > > +.B O_SYNC
> > > +or
> > > +.B O_DSYNC
> > > +and the
> > > +.BR pwritev2()
> > > +call is made without
> > > +.B RWF_SYNC
> > > +or
> > > +.BR RWF_DSYNC
> > > +set, the range metadata must already be flushed to storage and the data range
> > > +must not be in unwritten state, shared, a preallocation, or a hole.
> > 
> > I think that we can drop all of these flags requirements, since the
> > contiguous small space allocation requirement means that the fs can
> > provide all-or-nothing writes even if metadata updates are needed:
> > 
> > If the file range is allocated and marked unwritten (i.e. a
> > preallocation), the ioend will clear the unwritten bit from the file
> > mapping atomically.  After a crash, the application sees either zeroes
> > or all the data that was written.
> > 
> > If the file range is shared, the ioend will map the COW staging extent
> > into the file atomically.  After a crash, the application sees either
> > the old contents from the old blocks, or the new contents from the new
> > blocks.
> > 
> > If the file range is a sparse hole, the directio setup will allocate
> > space and create an unwritten mapping before issuing the write bio.  The
> > rest of the process works the same as preallocations and has the same
> > behaviors.
> > 
> > If the file range is allocated and was previously written, the write is
> > issued and that's all that's needed from the fs.  After a crash, reads
> > of the storage device produce the old contents or the new contents.
> 
> This is exactly what I explained when reviewing the code that
> rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.

I'm glad we agree. :)

John, when you're back from vacation, can we get rid of this language
and all those checks under _is_dsync() in the iomap patch?

(That code is 100% the result of me handwaving and bellyaching 6 months
ago when the team was trying to get all the atomic writes bits working
prior to LSF and I was too burned out to think the xfs part through.
As a result, I decided that we'd only support strict overwrites for the
first iteration.)

> > Summarizing:
> > 
> > An (ATOMIC|SYNC) request provides the strongest guarantees (data
> > will not be torn, and all file metadata updates are persisted before
> > the write is returned to userspace.  Programs see either the old data or
> > the new data, even if there's a crash.
> > 
> > (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> > updates for just that region are persisted before the write is returned.
> > 
> > (ATOMIC) is the least strong -- data will not be torn.  Neither the
> > filesystem nor the device make guarantees that anything ended up on
> > stable storage, but if it does, programs see either the old data or the
> > new data.
> 
> Yup, that makes sense to me.

Perhaps this ^^ is what we should be documenting here.

> > Maybe we should rename the whole UAPI s/atomic/untorn/...
> 
> Perhaps, though "torn writes" is nomenclature that nobody outside
> storage and filesystem developers really knows about. All I ever
> hear from userspace developers is "we want atomic/all-or-nothing
> data writes"...

Fair 'enuf.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] io_submit.2: Document RWF_ATOMIC
  2023-10-09 17:45   ` Darrick J. Wong
@ 2023-10-24 11:51     ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2023-10-24 11:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-api, martin.petersen, david, himanshu.madhani

On 09/10/2023 18:45, Darrick J. Wong wrote:
> On Fri, Sep 29, 2023 at 09:37:17AM +0000, John Garry wrote:
>> Document RWF_ATOMIC for asynchronous I/O.
>>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
>> ---
>>   man2/io_submit.2 | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/man2/io_submit.2 b/man2/io_submit.2
>> index 1030bb6cd965..beba8865ac2a 100644
>> --- a/man2/io_submit.2
>> +++ b/man2/io_submit.2
>> @@ -140,6 +140,11 @@ as well the description of
>>   .B O_SYNC
>>   in
>>   .BR open (2).
>> +.TP
>> +.BR RWF_ATOMIC " (since Linux 6.7)"
>> +Write a block of data such that a write will never be
>> +torn from power fail or similar. See See the description
> Nit: Double 'See' here.

A bit more than a nit, it's something that needs fixing :)

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-09 17:44   ` Darrick J. Wong
  2023-10-09 20:39     ` Dave Chinner
@ 2023-10-24 12:30     ` John Garry
  2023-10-24 15:39       ` Darrick J. Wong
  1 sibling, 1 reply; 18+ messages in thread
From: John Garry @ 2023-10-24 12:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-api, martin.petersen, david, himanshu.madhani

On 09/10/2023 18:44, Darrick J. Wong wrote:
> On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
>> From: Himanshu Madhani <himanshu.madhani@oracle.com>
>>
>> Add RWF_ATOMIC flag description for pwritev2().
>>
>> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
>> #jpg: complete rewrite
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/man2/readv.2 b/man2/readv.2
>> index fa9b0e4e44a2..ff09f3bc9792 100644
>> --- a/man2/readv.2
>> +++ b/man2/readv.2
>> @@ -193,6 +193,51 @@ which provides lower latency, but may use additional resources.
>>   .B O_DIRECT
>>   flag.)
>>   .TP
>> +.BR RWF_ATOMIC " (since Linux 6.7)"
>> +Allows block-based filesystems to indicate that write operations will be issued
> 
> "Require regular file write operations to be issued with torn write
> protection."

ok

> 
>> +with torn-write protection. Torn-write protection means that for a power or any
>> +other hardware failure, all or none of the data from the write will be stored,
>> +but never a mix of old and new data. This flag is meaningful only for
>> +.BR pwritev2 (),
>> +and its effect applies only to the data range written by the system call.
>> +The total write length must be power-of-2 and must be sized between
>> +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both inclusive. The
>> +write must be at a natural offset within the file with respect to the total
> 
> What is a "natural" offset?

I really meant naturally-aligned offset

>  That should be defined with more
> specificity.  Does that mean that the position of a XX-KiB write must
> also be aligned to XX-KiB?

Yes

>  e.g. a 32K untorn write can only start at a
> multiple of 32K? 

Correct

> What if the device supports untorn writes between 4K
> and 64K, does that mean I /cannot/ issue a 32K untorn write at offset
> 48K?

Correct

Do you think that an example would help?

> 
>> +write length. Torn-write protection only works with
>> +.B O_DIRECT
>> +flag, i.e. buffered writes are not supported. To guarantee consistency from
>> +the write between a file's in-core state with the storage device,
>> +.BR fdatasync (2)
>> +or
>> +.BR fsync (2)
>> +or
>> +.BR open (2)
>> +and
>> +.B O_SYNC
>> +or
>> +.B O_DSYNC
>> +or
>> +.B pwritev2 ()
>> +flag
>> +.B RWF_SYNC
>> +or
>> +.B RWF_DSYNC
>> +is required.
> 
> I'm starting to think that this manpage shouldn't be restating
> durability information here.
> 
> "Application programs with data or file integrity completion
> requirements must configure synchronous writes with the DSYNC
> or SYNC flags, as explained above."

ok

> 
>> +For when regular files are opened with
>> +.BR open (2)
>> +but without
>> +.B O_SYNC
>> +or
>> +.B O_DSYNC
>> +and the
>> +.BR pwritev2()
>> +call is made without
>> +.B RWF_SYNC
>> +or
>> +.BR RWF_DSYNC
>> +set, the range metadata must already be flushed to storage and the data range
>> +must not be in unwritten state, shared, a preallocation, or a hole.
> 
> I think that we can drop all of these flags requirements, since the
> contiguous small space allocation requirement means that the fs can
> provide all-or-nothing writes even if metadata updates are needed:
> 
> If the file range is allocated and marked unwritten (i.e. a
> preallocation), the ioend will clear the unwritten bit from the file
> mapping atomically.  After a crash, the application sees either zeroes
> or all the data that was written.
> 
> If the file range is shared, the ioend will map the COW staging extent
> into the file atomically.  After a crash, the application sees either
> the old contents from the old blocks, or the new contents from the new
> blocks.
> 
> If the file range is a sparse hole, the directio setup will allocate
> space and create an unwritten mapping before issuing the write bio.  The
> rest of the process works the same as preallocations and has the same
> behaviors.
> 
> If the file range is allocated and was previously written, the write is
> issued and that's all that's needed from the fs.  After a crash, reads
> of the storage device produce the old contents or the new contents.
> 
> Summarizing:
> 
> An (ATOMIC|SYNC) request provides the strongest guarantees (data
> will not be torn, and all file metadata updates are persisted before
> the write is returned to userspace.  Programs see either the old data or
> the new data, even if there's a crash.
> 
> (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> updates for just that region are persisted before the write is returned.
> 
> (ATOMIC) is the least strong -- data will not be torn.  Neither the
> filesystem nor the device make guarantees that anything ended up on
> stable storage, but if it does, programs see either the old data or the
> new data.
> 


Will respond to later mail in thread.

> Maybe we should rename the whole UAPI s/atomic/untorn/...
>  > --D
> 
>> +.TP
>>   .BR RWF_SYNC " (since Linux 4.7)"
>>   .\" commit e864f39569f4092c2b2bc72c773b6e486c7e3bd9
>>   Provide a per-write equivalent of the
>> -- 
>> 2.31.1
>>

Thanks,
John


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-09 21:05       ` Darrick J. Wong
@ 2023-10-24 12:35         ` John Garry
  2023-10-24 15:39           ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2023-10-24 12:35 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner
  Cc: linux-kernel, linux-api, martin.petersen, himanshu.madhani

On 09/10/2023 22:05, Darrick J. Wong wrote:
>>> If the file range is a sparse hole, the directio setup will allocate
>>> space and create an unwritten mapping before issuing the write bio.  The
>>> rest of the process works the same as preallocations and has the same
>>> behaviors.
>>>
>>> If the file range is allocated and was previously written, the write is
>>> issued and that's all that's needed from the fs.  After a crash, reads
>>> of the storage device produce the old contents or the new contents.
>> This is exactly what I explained when reviewing the code that
>> rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.
> I'm glad we agree. 😄
> 
> John, when you're back from vacation, can we get rid of this language
> and all those checks under _is_dsync() in the iomap patch?
> 
> (That code is 100% the result of me handwaving and bellyaching 6 months
> ago when the team was trying to get all the atomic writes bits working
> prior to LSF and I was too burned out to think the xfs part through.
> As a result, I decided that we'd only support strict overwrites for the
> first iteration.)

So this following additive code in iomap_dio_bio_iter() should be dropped:

----8<-----

--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -275,10 +275,11 @@ static inline blk_opf_t 
iomap_dio_bio_opflags(struct iomap_dio *dio,
  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
  		struct iomap_dio *dio)
  {

...

@@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct 
iomap_iter *iter,
  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
  		return -EINVAL;

+	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
+		if (iomap->flags & IOMAP_F_DIRTY)
+			return -EIO;
+		if (iomap->type != IOMAP_MAPPED)
+			return -EIO;
+	}
+

---->8-----

ok?

> 
>>> Summarizing:
>>>
>>> An (ATOMIC|SYNC) request provides the strongest guarantees (data
>>> will not be torn, and all file metadata updates are persisted before
>>> the write is returned to userspace.  Programs see either the old data or
>>> the new data, even if there's a crash.
>>>
>>> (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
>>> updates for just that region are persisted before the write is returned.
>>>
>>> (ATOMIC) is the least strong -- data will not be torn.  Neither the
>>> filesystem nor the device make guarantees that anything ended up on
>>> stable storage, but if it does, programs see either the old data or the
>>> new data.
>> Yup, that makes sense to me.
> Perhaps this ^^ is what we should be documenting here.
> 
>>> Maybe we should rename the whole UAPI s/atomic/untorn/...
>> Perhaps, though "torn writes" is nomenclature that nobody outside
>> storage and filesystem developers really knows about. All I ever
>> hear from userspace developers is "we want atomic/all-or-nothing
>> data writes"...
> Fair 'enuf.


Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-24 12:35         ` John Garry
@ 2023-10-24 15:39           ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2023-10-24 15:39 UTC (permalink / raw)
  To: John Garry
  Cc: Dave Chinner, linux-kernel, linux-api, martin.petersen, himanshu.madhani

On Tue, Oct 24, 2023 at 01:35:33PM +0100, John Garry wrote:
> On 09/10/2023 22:05, Darrick J. Wong wrote:
> > > > If the file range is a sparse hole, the directio setup will allocate
> > > > space and create an unwritten mapping before issuing the write bio.  The
> > > > rest of the process works the same as preallocations and has the same
> > > > behaviors.
> > > > 
> > > > If the file range is allocated and was previously written, the write is
> > > > issued and that's all that's needed from the fs.  After a crash, reads
> > > > of the storage device produce the old contents or the new contents.
> > > This is exactly what I explained when reviewing the code that
> > > rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.
> > I'm glad we agree. 😄
> > 
> > John, when you're back from vacation, can we get rid of this language
> > and all those checks under _is_dsync() in the iomap patch?
> > 
> > (That code is 100% the result of me handwaving and bellyaching 6 months
> > ago when the team was trying to get all the atomic writes bits working
> > prior to LSF and I was too burned out to think the xfs part through.
> > As a result, I decided that we'd only support strict overwrites for the
> > first iteration.)
> 
> So this following additive code in iomap_dio_bio_iter() should be dropped:
> 
> ----8<-----
> 
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -275,10 +275,11 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct
> iomap_dio *dio,
>  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		struct iomap_dio *dio)
>  {
> 
> ...
> 
> @@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct
> iomap_iter *iter,
>  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>  		return -EINVAL;
> 
> +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
> +		if (iomap->flags & IOMAP_F_DIRTY)
> +			return -EIO;
> +		if (iomap->type != IOMAP_MAPPED)
> +			return -EIO;
> +	}
> +
> 
> ---->8-----
> 
> ok?

Yes.

> > 
> > > > Summarizing:
> > > > 
> > > > An (ATOMIC|SYNC) request provides the strongest guarantees (data
> > > > will not be torn, and all file metadata updates are persisted before
> > > > the write is returned to userspace.  Programs see either the old data or
> > > > the new data, even if there's a crash.
> > > > 
> > > > (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> > > > updates for just that region are persisted before the write is returned.
> > > > 
> > > > (ATOMIC) is the least strong -- data will not be torn.  Neither the
> > > > filesystem nor the device make guarantees that anything ended up on
> > > > stable storage, but if it does, programs see either the old data or the
> > > > new data.
> > > Yup, that makes sense to me.
> > Perhaps this ^^ is what we should be documenting here.
> > 
> > > > Maybe we should rename the whole UAPI s/atomic/untorn/...
> > > Perhaps, though "torn writes" is nomenclature that nobody outside
> > > storage and filesystem developers really knows about. All I ever
> > > hear from userspace developers is "we want atomic/all-or-nothing
> > > data writes"...

How about O_NOTEARS -> PWF_NOTEARS -> REQ_NOTEARS.

<obligatory "There's no crying in baseball" link, etc.>

--D

> > Fair 'enuf.
> 
> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
  2023-10-24 12:30     ` John Garry
@ 2023-10-24 15:39       ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2023-10-24 15:39 UTC (permalink / raw)
  To: John Garry
  Cc: linux-kernel, linux-api, martin.petersen, david, himanshu.madhani

On Tue, Oct 24, 2023 at 01:30:03PM +0100, John Garry wrote:
> On 09/10/2023 18:44, Darrick J. Wong wrote:
> > On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
> > > From: Himanshu Madhani <himanshu.madhani@oracle.com>
> > > 
> > > Add RWF_ATOMIC flag description for pwritev2().
> > > 
> > > Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> > > #jpg: complete rewrite
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 45 insertions(+)
> > > 
> > > diff --git a/man2/readv.2 b/man2/readv.2
> > > index fa9b0e4e44a2..ff09f3bc9792 100644
> > > --- a/man2/readv.2
> > > +++ b/man2/readv.2
> > > @@ -193,6 +193,51 @@ which provides lower latency, but may use additional resources.
> > >   .B O_DIRECT
> > >   flag.)
> > >   .TP
> > > +.BR RWF_ATOMIC " (since Linux 6.7)"
> > > +Allows block-based filesystems to indicate that write operations will be issued
> > 
> > "Require regular file write operations to be issued with torn write
> > protection."
> 
> ok
> 
> > 
> > > +with torn-write protection. Torn-write protection means that for a power or any
> > > +other hardware failure, all or none of the data from the write will be stored,
> > > +but never a mix of old and new data. This flag is meaningful only for
> > > +.BR pwritev2 (),
> > > +and its effect applies only to the data range written by the system call.
> > > +The total write length must be power-of-2 and must be sized between
> > > +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both inclusive. The
> > > +write must be at a natural offset within the file with respect to the total
> > 
> > What is a "natural" offset?
> 
> I really meant naturally-aligned offset
> 
> >  That should be defined with more
> > specificity.  Does that mean that the position of a XX-KiB write must
> > also be aligned to XX-KiB?
> 
> Yes
> 
> >  e.g. a 32K untorn write can only start at a
> > multiple of 32K?
> 
> Correct
> 
> > What if the device supports untorn writes between 4K
> > and 64K, does that mean I /cannot/ issue a 32K untorn write at offset
> > 48K?
> 
> Correct
> 
> Do you think that an example would help?

Yes.

> > 
> > > +write length. Torn-write protection only works with
> > > +.B O_DIRECT
> > > +flag, i.e. buffered writes are not supported. To guarantee consistency from
> > > +the write between a file's in-core state with the storage device,
> > > +.BR fdatasync (2)
> > > +or
> > > +.BR fsync (2)
> > > +or
> > > +.BR open (2)
> > > +and
> > > +.B O_SYNC
> > > +or
> > > +.B O_DSYNC
> > > +or
> > > +.B pwritev2 ()
> > > +flag
> > > +.B RWF_SYNC
> > > +or
> > > +.B RWF_DSYNC
> > > +is required.
> > 
> > I'm starting to think that this manpage shouldn't be restating
> > durability information here.
> > 
> > "Application programs with data or file integrity completion
> > requirements must configure synchronous writes with the DSYNC
> > or SYNC flags, as explained above."
> 
> ok
> 
> > 
> > > +For when regular files are opened with
> > > +.BR open (2)
> > > +but without
> > > +.B O_SYNC
> > > +or
> > > +.B O_DSYNC
> > > +and the
> > > +.BR pwritev2()
> > > +call is made without
> > > +.B RWF_SYNC
> > > +or
> > > +.BR RWF_DSYNC
> > > +set, the range metadata must already be flushed to storage and the data range
> > > +must not be in unwritten state, shared, a preallocation, or a hole.
> > 
> > I think that we can drop all of these flags requirements, since the
> > contiguous small space allocation requirement means that the fs can
> > provide all-or-nothing writes even if metadata updates are needed:
> > 
> > If the file range is allocated and marked unwritten (i.e. a
> > preallocation), the ioend will clear the unwritten bit from the file
> > mapping atomically.  After a crash, the application sees either zeroes
> > or all the data that was written.
> > 
> > If the file range is shared, the ioend will map the COW staging extent
> > into the file atomically.  After a crash, the application sees either
> > the old contents from the old blocks, or the new contents from the new
> > blocks.
> > 
> > If the file range is a sparse hole, the directio setup will allocate
> > space and create an unwritten mapping before issuing the write bio.  The
> > rest of the process works the same as preallocations and has the same
> > behaviors.
> > 
> > If the file range is allocated and was previously written, the write is
> > issued and that's all that's needed from the fs.  After a crash, reads
> > of the storage device produce the old contents or the new contents.
> > 
> > Summarizing:
> > 
> > An (ATOMIC|SYNC) request provides the strongest guarantees (data
> > will not be torn, and all file metadata updates are persisted before
> > the write is returned to userspace.  Programs see either the old data or
> > the new data, even if there's a crash.
> > 
> > (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> > updates for just that region are persisted before the write is returned.
> > 
> > (ATOMIC) is the least strong -- data will not be torn.  Neither the
> > filesystem nor the device make guarantees that anything ended up on
> > stable storage, but if it does, programs see either the old data or the
> > new data.
> > 
> 
> 
> Will respond to later mail in thread.

Ok, thank you!

--D

> > Maybe we should rename the whole UAPI s/atomic/untorn/...
> >  > --D
> > 
> > > +.TP
> > >   .BR RWF_SYNC " (since Linux 4.7)"
> > >   .\" commit e864f39569f4092c2b2bc72c773b6e486c7e3bd9
> > >   Provide a per-write equivalent of the
> > > -- 
> > > 2.31.1
> > > 
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-10-24 15:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-29  9:37 [PATCH 0/4] man2: Document RWF_ATOMIC John Garry
2023-09-29  9:37 ` [PATCH 1/4] statx.2: Document STATX_WRITE_ATOMIC John Garry
2023-09-29  9:37 ` [PATCH 2/4] readv.2: Document RWF_ATOMIC flag John Garry
2023-10-03 19:25   ` Bart Van Assche
2023-10-04  8:47     ` John Garry
2023-10-04 17:36       ` Bart Van Assche
2023-10-04 22:48       ` Dave Chinner
2023-10-09 17:44   ` Darrick J. Wong
2023-10-09 20:39     ` Dave Chinner
2023-10-09 21:05       ` Darrick J. Wong
2023-10-24 12:35         ` John Garry
2023-10-24 15:39           ` Darrick J. Wong
2023-10-24 12:30     ` John Garry
2023-10-24 15:39       ` Darrick J. Wong
2023-09-29  9:37 ` [PATCH 3/4] man2/open.2: Document RWF_ATOMIC John Garry
2023-09-29  9:37 ` [PATCH 4/4] io_submit.2: " John Garry
2023-10-09 17:45   ` Darrick J. Wong
2023-10-24 11:51     ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).