linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/11] fs: interface for directly reading/writing compressed data
@ 2020-11-18 19:18 Omar Sandoval
  2020-11-18 19:18 ` [PATCH man-pages v6] Document encoded I/O Omar Sandoval
                   ` (11 more replies)
  0 siblings, 12 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

This series adds an API for reading compressed data on a filesystem
without decompressing it as well as support for writing compressed data
directly to the filesystem. As with the previous submissions, I've
included a man page patch describing the API. I have test cases
(including fsstress support) and example programs which I'll send up
[1].

The main use-case is Btrfs send/receive: currently, when sending data
from one compressed filesystem to another, the sending side decompresses
the data and the receiving side recompresses it before writing it out.
This is wasteful and can be avoided if we can just send and write
compressed extents. The patches implementing the send/receive support
will be sent shortly.

Patches 1-3 add the VFS support and UAPI. Patches 4 and 5 are fixes for
patches in the Btrfs misc-next branch that conflicted with this series;
they can go into misc-next or be folded into the original patches
independently. Patches 6-9 are Btrfs prep patches. Patch 10 adds Btrfs
encoded read support and patch 11 adds Btrfs encoded write support.

These patches are based on Dave Sterba's Btrfs misc-next branch [2],
which is in turn currently based on v5.10-rc4.

Changes since v5 [3]:

- Made O_CLOEXEC mandatory in conjuction with O_ALLOW_ENCODED.
- Added _BTRFS to the ENCODED_IOV_COMPRESSION names (e.g.,
  ENCODED_IOV_COMPRESSION_ZSTD -> ENCODED_IOV_COMPRESSION_BTRFS_ZSTD).
- Split ENCODED_IOV_COMPRESSION_LZO compression mode into separate modes
  per page size. I missed that the ill-conceived Btrfs LZO format
  depends on PAGE_SIZE. Having a separate compression mode for each
  supported page size at least lets us detect mismatches.
- Fixed up other minor comments from v5.
- Added reviewed-bys.

1: https://github.com/osandov/xfstests/tree/rwf-encoded
2: https://github.com/kdave/btrfs-devel/tree/misc-next
3: https://lore.kernel.org/linux-btrfs/cover.1597993855.git.osandov@osandov.com/

Omar Sandoval (11):
  iov_iter: add copy_struct_from_iter()
  fs: add O_ALLOW_ENCODED open flag
  fs: add RWF_ENCODED for reading/writing compressed data
  btrfs: fix btrfs_write_check()
  btrfs: fix check_data_csum() error message for direct I/O
  btrfs: don't advance offset for compressed bios in
    btrfs_csum_one_bio()
  btrfs: add ram_bytes and offset to btrfs_ordered_extent
  btrfs: support different disk extent size for delalloc
  btrfs: optionally extend i_size in cow_file_range_inline()
  btrfs: implement RWF_ENCODED reads
  btrfs: implement RWF_ENCODED writes

 Documentation/filesystems/encoded_io.rst |  74 ++
 Documentation/filesystems/index.rst      |   1 +
 arch/alpha/include/uapi/asm/fcntl.h      |   1 +
 arch/parisc/include/uapi/asm/fcntl.h     |   1 +
 arch/sparc/include/uapi/asm/fcntl.h      |   1 +
 fs/btrfs/compression.c                   |  12 +-
 fs/btrfs/compression.h                   |   6 +-
 fs/btrfs/ctree.h                         |   9 +-
 fs/btrfs/delalloc-space.c                |  18 +-
 fs/btrfs/file-item.c                     |  35 +-
 fs/btrfs/file.c                          |  73 +-
 fs/btrfs/inode.c                         | 933 ++++++++++++++++++++---
 fs/btrfs/ordered-data.c                  |  80 +-
 fs/btrfs/ordered-data.h                  |  18 +-
 fs/btrfs/relocation.c                    |   4 +-
 fs/fcntl.c                               |  10 +-
 fs/namei.c                               |   4 +
 fs/open.c                                |   7 +
 fs/read_write.c                          | 167 +++-
 include/linux/fcntl.h                    |   2 +-
 include/linux/fs.h                       |  11 +
 include/linux/uio.h                      |   2 +
 include/uapi/asm-generic/fcntl.h         |   4 +
 include/uapi/linux/fs.h                  |  41 +-
 lib/iov_iter.c                           |  82 ++
 25 files changed, 1384 insertions(+), 212 deletions(-)
 create mode 100644 Documentation/filesystems/encoded_io.rst

-- 
2.29.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH man-pages v6] Document encoded I/O
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-19 23:29   ` Alejandro Colomar (mailing lists; readonly)
  2020-11-18 19:18 ` [PATCH v6 01/11] iov_iter: add copy_struct_from_iter() Omar Sandoval
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team, Michael Kerrisk, linux-man

From: Omar Sandoval <osandov@fb.com>

This adds a new page, encoded_io(7), providing an overview of encoded
I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
reference it.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: linux-man <linux-man@vger.kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
This feature is not yet upstream.

 man2/fcntl.2      |  10 +-
 man2/open.2       |  23 +++
 man2/readv.2      |  70 +++++++++
 man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 471 insertions(+), 1 deletion(-)
 create mode 100644 man7/encoded_io.7

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index 546016617..b0d7fa2c3 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -221,8 +221,9 @@ On Linux, this command can change only the
 .BR O_ASYNC ,
 .BR O_DIRECT ,
 .BR O_NOATIME ,
+.BR O_NONBLOCK ,
 and
-.B O_NONBLOCK
+.B O_ALLOW_ENCODED
 flags.
 It is not possible to change the
 .BR O_DSYNC
@@ -1820,6 +1821,13 @@ Attempted to clear the
 flag on a file that has the append-only attribute set.
 .TP
 .B EPERM
+Attempted to set the
+.B O_ALLOW_ENCODED
+flag and the calling process did not have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
+.B EPERM
 .I cmd
 was
 .BR F_ADD_SEALS ,
diff --git a/man2/open.2 b/man2/open.2
index f587b0d95..84697dfa8 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -437,6 +437,16 @@ was followed by a call to
 .BR fdatasync (2)).
 .IR "See NOTES below" .
 .TP
+.B O_ALLOW_ENCODED
+Open the file with encoded I/O permissions;
+see
+.BR encoded_io (7).
+.B O_CLOEXEC
+must be specified in conjuction with this flag.
+The caller must have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
 .B O_EXCL
 Ensure that this call creates the file:
 if this flag is specified in conjunction with
@@ -1082,6 +1092,14 @@ is invalid
 (e.g., it contains characters not permitted by the underlying filesystem).
 .TP
 .B EINVAL
+.B O_ALLOW_ENCODED
+was specified in
+.IR flags ,
+but
+.B O_CLOEXEC
+was not specified.
+.TP
+.B EINVAL
 The final component ("basename") of
 .I pathname
 is invalid
@@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
 The operation was prevented by a file seal; see
 .BR fcntl (2).
 .TP
+.B EPERM
+The
+.B O_ALLOW_ENCODED
+flag was specified, but the caller was not privileged.
+.TP
 .B EROFS
 .I pathname
 refers to a file on a read-only filesystem and write access was
diff --git a/man2/readv.2 b/man2/readv.2
index 5a8b74168..c9933acf0 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -264,6 +264,11 @@ the data is always appended to the end of the file.
 However, if the
 .I offset
 argument is \-1, the current file offset is updated.
+.TP
+.BR RWF_ENCODED " (since Linux 5.12)"
+Read or write encoded (e.g., compressed) data.
+See
+.BR encoded_io (7).
 .SH RETURN VALUE
 On success,
 .BR readv (),
@@ -283,6 +288,13 @@ than requested (see
 and
 .BR write (2)).
 .PP
+If
+.B
+RWF_ENCODED
+was specified in
+.IR flags ,
+then the return value is the number of encoded bytes.
+.PP
 On error, \-1 is returned, and \fIerrno\fP is set appropriately.
 .SH ERRORS
 The errors are as given for
@@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
 .TP
 .B EOPNOTSUPP
 An unknown flag is specified in \fIflags\fP.
+.TP
+.B EOPNOTSUPP
+.B RWF_ENCODED
+is specified in
+.I flags
+and the filesystem does not implement encoded I/O.
+.TP
+.B EPERM
+.B RWF_ENCODED
+is specified in
+.I flags
+and the file was not opened with the
+.B O_ALLOW_ENCODED
+flag.
+.PP
+.BR preadv2 ()
+can fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+is not large enough to return the encoding metadata.
+.TP
+.B ENOBUFS
+.B RWF_ENCODED
+is specified in
+.I flags
+and the buffers in
+.I iov
+are not big enough to return the encoded data.
+.PP
+.BR pwritev2 ()
+can fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+contains non-zero fields
+after the kernel's
+.IR "sizeof(struct\ encoded_iov)" .
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the encoding is unknown or not supported by the filesystem.
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the alignment and/or size requirements are not met.
 .SH VERSIONS
 .BR preadv ()
 and
diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
new file mode 100644
index 000000000..106fa587b
--- /dev/null
+++ b/man7/encoded_io.7
@@ -0,0 +1,369 @@
+.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.\"
+.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
+.SH NAME
+encoded_io \- overview of encoded I/O
+.SH DESCRIPTION
+Several filesystems (e.g., Btrfs) support transparent encoding
+(e.g., compression, encryption) of data on disk:
+written data is encoded by the kernel before it is written to disk,
+and read data is decoded before being returned to the user.
+In some cases, it is useful to skip this encoding step.
+For example, the user may want to read the compressed contents of a file
+or write pre-compressed data directly to a file.
+This is referred to as "encoded I/O".
+.SS Encoded I/O API
+Encoded I/O is specified with the
+.B RWF_ENCODED
+flag to
+.BR preadv2 (2)
+and
+.BR pwritev2 (2).
+If
+.B RWF_ENCODED
+is specified, then
+.I iov[0].iov_base
+points to an
+.I
+encoded_iov
+structure, defined in
+.I <linux/fs.h>
+as:
+.PP
+.in +4n
+.EX
+struct encoded_iov {
+    __aligned_u64 len;
+    __aligned_u64 unencoded_len;
+    __aligned_u64 unencoded_offset;
+    __u32 compression;
+    __u32 encryption;
+};
+.EE
+.in
+.PP
+This may be extended in the future, so
+.I iov[0].iov_len
+must be set to
+.I "sizeof(struct\ encoded_iov)"
+for forward/backward compatibility.
+The remaining buffers contain the encoded data.
+.PP
+.I compression
+and
+.I encryption
+are the encoding fields.
+.I compression
+is
+.B ENCODED_IOV_COMPRESSION_NONE
+(zero)
+or a filesystem-specific
+.B ENCODED_IOV_COMPRESSION
+constant;
+see
+.BR Filesystem\ support .
+.I encryption
+is currently always
+.B ENCODED_IOV_ENCRYPTION_NONE
+(zero).
+.PP
+.I unencoded_len
+is the length of the unencoded (i.e., decrypted and decompressed) data.
+.I unencoded_offset
+is the offset into the unencoded data where the data in the file begins
+(less than or equal to
+.IR unencoded_len ).
+.I len
+is the length of the data in the file
+(less than or equal to
+.I unencoded_len
+-
+.IR unencoded_offset ).
+See
+.B Extent layout
+below for some examples.
+.I
+.PP
+If the unencoded data is actually longer than
+.IR unencoded_len ,
+then it is truncated;
+if it is shorter, then it is extended with zeroes.
+.PP
+
+.BR pwritev2 ()
+uses the metadata specified in
+.IR iov[0] ,
+writes the encoded data from the remaining buffers,
+and returns the number of encoded bytes written
+(that is, the sum of
+.I iov[n].iov_len
+for 1 <=
+.I n
+<
+.IR iovcnt ;
+partial writes will not occur).
+At least one encoding field must be non-zero.
+Note that the encoded data is not validated when it is written;
+if it is not valid (e.g., it cannot be decompressed),
+then a subsequent read may return an error.
+If the
+.I offset
+argument to
+.BR pwritev2 ()
+is -1, then the file offset is incremented by
+.IR len .
+If
+.I iov[0].iov_len
+is less than
+.I "sizeof(struct\ encoded_iov)"
+in the kernel,
+then any fields unknown to userspace are treated as if they were zero;
+if it is greater and any fields unknown to the kernel are non-zero,
+then this returns -1 and sets
+.I errno
+to
+.BR E2BIG .
+.PP
+.BR preadv2 ()
+populates the metadata in
+.IR iov[0] ,
+the encoded data in the remaining buffers,
+and returns the number of encoded bytes read.
+This will only return one extent per call.
+This can also read data which is not encoded;
+all encoding fields will be zero in that case.
+If the
+.I offset
+argument to
+.BR preadv2 ()
+is -1, then the file offset is incremented by
+.IR len .
+If
+.I iov[0].iov_len
+is less than
+.I "sizeof(struct\ encoded_iov)"
+in the kernel and any fields unknown to userspace are non-zero,
+then
+.BR preadv2 ()
+returns -1 and sets
+.I errno
+to
+.BR E2BIG ;
+if it is greater,
+then any fields unknown to the kernel are returned as zero.
+If the provided buffers are not large enough to return an entire encoded
+extent,
+then
+.BR preadv2 ()
+returns -1 and sets
+.I errno
+to
+.BR ENOBUFS .
+.PP
+As the filesystem page cache typically contains decoded data,
+encoded I/O bypasses the page cache.
+.SS Extent layout
+By using
+.IR len ,
+.IR unencoded_len ,
+and
+.IR unencoded_offset ,
+it is possible to refer to a subset of an unencoded extent.
+.PP
+In the simplest case,
+.I len
+is equal to
+.I unencoded_len
+and
+.I unencoded_offset
+is zero.
+This means that the entire unencoded extent is used.
+.PP
+However, suppose we read 50 bytes into a file
+which contains a single compressed extent.
+The filesystem must still return the entire compressed extent
+for us to be able to decompress it,
+so
+.I unencoded_len
+would be the length of the entire decompressed extent.
+However, because the read was at offset 50,
+the first 50 bytes should be ignored.
+Therefore,
+.I unencoded_offset
+would be 50,
+and
+.I len
+would accordingly be
+.IR unencoded_len\ -\ 50 .
+.PP
+Additionally, suppose we want to create an encrypted file with length 500,
+but the file is encrypted with a block cipher using a block size of 4096.
+The unencoded data would therefore include the appropriate padding,
+and
+.I unencoded_len
+would be 4096.
+However, to represent the logical size of the file,
+.I len
+would be 500
+(and
+.I unencoded_offset
+would be 0).
+.PP
+Similar situations can arise in other cases:
+.IP * 3
+If the filesystem pads data to the filesystem block size before compressing,
+then compressed files with a size unaligned to the filesystem block size will
+end with an extent with
+.I len
+<
+.IR unencoded_len .
+.IP *
+Extents cloned from the middle of a larger encoded extent with
+.B FICLONERANGE
+may have a non-zero
+.I unencoded_offset
+and/or
+.I len
+<
+.IR unencoded_len .
+.IP *
+If the middle of an encoded extent is overwritten,
+the filesystem may create extents with a non-zero
+.I unencoded_offset
+and/or
+.I len
+<
+.I unencoded_len
+for the parts that were not overwritten.
+.SS Security
+Encoded I/O creates the potential for some security issues:
+.IP * 3
+Encoded writes allow writing arbitrary data which the kernel will decode on
+a subsequent read. Decompression algorithms are complex and may have bugs
+which can be exploited by maliciously crafted data.
+.IP *
+Encoded reads may return data which is not logically present in the file
+(see the discussion of
+.I len
+vs.
+.I unencoded_len
+above).
+It may not be intended for this data to be readable.
+.PP
+Therefore, encoded I/O requires privilege.
+Namely, the
+.B RWF_ENCODED
+flag may only be used when the file was opened with the
+.B O_ALLOW_ENCODED
+flag to
+.BR open (2),
+which requires the
+.B CAP_SYS_ADMIN
+capability.
+The
+.B O_CLOEXEC
+flag must be specified in conjunction with
+.BR O_ALLOW_ENCODED .
+This avoids accidentally leaking the encoded I/O privilege
+(it is not cleared on
+.BR fork (2)
+or
+.BR execve (2)
+otherwise).
+If
+.B O_ALLOW_ENCODED
+without
+.B O_CLOEXEC
+is desired,
+.B O_CLOEXEC
+can be cleared afterwards with
+.BR fnctl (2).
+.BR fcntl (2)
+can also clear or set
+.B O_ALLOW_ENCODED
+(including without
+.BR O_CLOEXEC ).
+.SS Filesystem support
+Encoded I/O is supported on the following filesystems:
+.TP
+Btrfs (since Linux 5.12)
+.IP
+Btrfs supports encoded reads and writes of compressed data.
+The data is encoded as follows:
+.RS
+.IP * 3
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
+then the encoded data is a single zlib stream.
+.IP *
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
+then the encoded data is a single zstd frame compressed with the
+.I windowLog
+compression parameter set to no more than 17.
+.IP *
+If
+.I compression
+is one of
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
+or
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
+then the encoded data is compressed page by page
+(using the page size indicated by the name of the constant)
+with LZO1X
+and wrapped in the format documented in the Linux kernel source file
+.IR fs/btrfs/lzo.c .
+.RE
+.IP
+Additionally, there are some restrictions on
+.BR pwritev2 ():
+.RS
+.IP * 3
+.I offset
+(or the current file offset if
+.I offset
+is -1) must be aligned to the sector size of the filesystem.
+.IP *
+.I len
+must be aligned to the sector size of the filesystem
+unless the data ends at or beyond the current end of the file.
+.IP *
+.I unencoded_len
+and the length of the encoded data must each be no more than 128 KiB.
+This limit may increase in the future.
+.IP *
+The length of the encoded data must be less than or equal to
+.IR unencoded_len .
+.IP *
+If using LZO, the filesystem's page size must match the compression page size.
+.RE
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 01/11] iov_iter: add copy_struct_from_iter()
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
  2020-11-18 19:18 ` [PATCH man-pages v6] Document encoded I/O Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-18 19:18 ` [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

This is essentially copy_struct_from_user() but for an iov_iter.

Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/uio.h |  2 ++
 lib/iov_iter.c      | 82 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 72d88566694e..f4e6ea85a269 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -121,6 +121,8 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
+int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i,
+			  size_t usize);
 
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1635111c5bd2..9696cc981590 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -948,6 +948,88 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+/**
+ * copy_struct_from_iter - copy a struct from an iov_iter
+ * @dst: Destination buffer.
+ * @ksize: Size of @dst struct.
+ * @i: Source iterator.
+ * @usize: (Alleged) size of struct in @i.
+ *
+ * Copies a struct from an iov_iter in a way that guarantees
+ * backwards-compatibility for struct arguments in an iovec (as long as the
+ * rules for copy_struct_from_user() are followed).
+ *
+ * The recommended usage is that @usize be taken from the current segment:
+ *
+ *   int do_foo(struct iov_iter *i)
+ *   {
+ *     size_t usize = iov_iter_single_seg_count(i);
+ *     struct foo karg;
+ *     int err;
+ *
+ *     if (usize > PAGE_SIZE)
+ *       return -E2BIG;
+ *     if (usize < FOO_SIZE_VER0)
+ *       return -EINVAL;
+ *     err = copy_struct_from_iter(&karg, sizeof(karg), i, usize);
+ *     if (err)
+ *       return err;
+ *
+ *     // ...
+ *   }
+ *
+ * Return: 0 on success, -errno on error (see copy_struct_from_user()).
+ *
+ * On success, the iterator is advanced @usize bytes. On error, the iterator is
+ * not advanced.
+ */
+int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i,
+			  size_t usize)
+{
+	if (usize <= ksize) {
+		if (!copy_from_iter_full(dst, usize, i))
+			return -EFAULT;
+		memset(dst + usize, 0, ksize - usize);
+	} else {
+		size_t copied = 0, copy;
+		int ret;
+
+		if (WARN_ON(iov_iter_is_pipe(i)) || unlikely(i->count < usize))
+			return -EFAULT;
+		if (iter_is_iovec(i))
+			might_fault();
+		iterate_all_kinds(i, usize, v, ({
+			copy = min(ksize - copied, v.iov_len);
+			if (copy && copyin(dst + copied, v.iov_base, copy))
+				return -EFAULT;
+			copied += copy;
+			ret = check_zeroed_user(v.iov_base + copy,
+						v.iov_len - copy);
+			if (ret <= 0)
+				return ret ?: -E2BIG;
+			0;}), ({
+			char *addr = kmap_atomic(v.bv_page);
+			copy = min_t(size_t, ksize - copied, v.bv_len);
+			memcpy(dst + copied, addr + v.bv_offset, copy);
+			copied += copy;
+			ret = memchr_inv(addr + v.bv_offset + copy, 0,
+					 v.bv_len - copy) ? -E2BIG : 0;
+			kunmap_atomic(addr);
+			if (ret)
+				return ret;
+			}), ({
+			copy = min(ksize - copied, v.iov_len);
+			memcpy(dst + copied, v.iov_base, copy);
+			if (memchr_inv(v.iov_base, 0, v.iov_len))
+				return -E2BIG;
+			})
+		)
+		iov_iter_advance(i, usize);
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(copy_struct_from_iter);
+
 static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
  2020-11-18 19:18 ` [PATCH man-pages v6] Document encoded I/O Omar Sandoval
  2020-11-18 19:18 ` [PATCH v6 01/11] iov_iter: add copy_struct_from_iter() Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-19  7:02   ` Amir Goldstein
  2020-11-18 19:18 ` [PATCH v6 03/11] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

The upcoming RWF_ENCODED operation introduces some security concerns:

1. Compressed writes will pass arbitrary data to decompression
   algorithms in the kernel.
2. Compressed reads can leak truncated/hole punched data.

Therefore, we need to require privilege for RWF_ENCODED. It's not
possible to do the permissions checks at the time of the read or write
because, e.g., io_uring submits IO from a worker thread. So, add an open
flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
fcntl(). The flag is not cleared in any way on fork or exec. It must be
combined with O_CLOEXEC when opening to avoid accidental leaks (if
needed, it may be set without O_CLOEXEC by using fnctl()).

Note that the usual issue that unknown open flags are ignored doesn't
really matter for O_ALLOW_ENCODED; if the kernel doesn't support
O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 arch/alpha/include/uapi/asm/fcntl.h  |  1 +
 arch/parisc/include/uapi/asm/fcntl.h |  1 +
 arch/sparc/include/uapi/asm/fcntl.h  |  1 +
 fs/fcntl.c                           | 10 ++++++++--
 fs/namei.c                           |  4 ++++
 fs/open.c                            |  7 +++++++
 include/linux/fcntl.h                |  2 +-
 include/uapi/asm-generic/fcntl.h     |  4 ++++
 8 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..391e0d112e41 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
 
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
+#define O_ALLOW_ENCODED	0200000000
 
 #define F_GETLK		7
 #define F_SETLK		8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03dee816cb13..72ea9bdf5f04 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -19,6 +19,7 @@
 
 #define O_PATH		020000000
 #define __O_TMPFILE	040000000
+#define O_ALLOW_ENCODED	100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..ac3e8c9cb32c 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_ALLOW_ENCODED	0x8000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 19ac5baad50f..9302f68fe698 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -30,7 +30,8 @@
 #include <asm/siginfo.h>
 #include <linux/uaccess.h>
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
+		    O_ALLOW_ENCODED)
 
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
@@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 
+	/* O_ALLOW_ENCODED can only be set by superuser */
+	if ((arg & O_ALLOW_ENCODED) && !(filp->f_flags & O_ALLOW_ENCODED) &&
+	    !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	/* required for strict SunOS emulation */
 	if (O_NONBLOCK != O_NDELAY)
 	       if (arg & O_NDELAY)
@@ -1033,7 +1039,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index d4a6dd772303..fbf64ce61088 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2890,6 +2890,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
 	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
 		return -EPERM;
 
+	/* O_ALLOW_ENCODED can only be set by superuser */
+	if ((flag & O_ALLOW_ENCODED) && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	return 0;
 }
 
diff --git a/fs/open.c b/fs/open.c
index 9af548fb841b..f2863aaf78e7 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1040,6 +1040,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		acc_mode = 0;
 	}
 
+	/*
+	 * O_ALLOW_ENCODED must be combined with O_CLOEXEC to avoid accidentally
+	 * leaking encoded I/O privileges.
+	 */
+	if ((how->flags & (O_ALLOW_ENCODED | O_CLOEXEC)) == O_ALLOW_ENCODED)
+		return -EINVAL;
+
 	/*
 	 * O_SYNC is implemented as __O_SYNC|O_DSYNC.  As many places only
 	 * check for O_DSYNC if the need any syncing at all we enforce it's
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 921e750843e6..dc66c557b7d0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ALLOW_ENCODED)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..75321c7a66ac 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_ALLOW_ENCODED
+#define O_ALLOW_ENCODED	040000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 03/11] fs: add RWF_ENCODED for reading/writing compressed data
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (2 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-19  7:38   ` Amir Goldstein
  2020-11-18 19:18 ` [PATCH v6 04/11] btrfs: fix btrfs_write_check() Omar Sandoval
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

Btrfs supports transparent compression: data written by the user can be
compressed when written to disk and decompressed when read back.
However, we'd like to add an interface to write pre-compressed data
directly to the filesystem, and the matching interface to read
compressed data without decompressing it. This adds support for
so-called "encoded I/O" via preadv2() and pwritev2().

A new RWF_ENCODED flags indicates that a read or write is "encoded". If
this flag is set, iov[0].iov_base points to a struct encoded_iov which
is used for metadata: namely, the compression algorithm, unencoded
(i.e., decompressed) length, and what subrange of the unencoded data
should be used (needed for truncated or hole-punched extents and when
reading in the middle of an extent). For reads, the filesystem returns
this information; for writes, the caller provides it to the filesystem.
iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
used to extend the interface in the future a la copy_struct_from_user().
The remaining iovecs contain the encoded extent.

This adds the VFS helpers for supporting encoded I/O and documentation
for filesystem support.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 Documentation/filesystems/encoded_io.rst |  74 ++++++++++
 Documentation/filesystems/index.rst      |   1 +
 fs/read_write.c                          | 167 +++++++++++++++++++++--
 include/linux/fs.h                       |  11 ++
 include/uapi/linux/fs.h                  |  41 +++++-
 5 files changed, 280 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/encoded_io.rst

diff --git a/Documentation/filesystems/encoded_io.rst b/Documentation/filesystems/encoded_io.rst
new file mode 100644
index 000000000000..50405276d866
--- /dev/null
+++ b/Documentation/filesystems/encoded_io.rst
@@ -0,0 +1,74 @@
+===========
+Encoded I/O
+===========
+
+Encoded I/O is a mechanism for reading and writing encoded (e.g., compressed
+and/or encrypted) data directly from/to the filesystem. The userspace interface
+is thoroughly described in the :manpage:`encoded_io(7)` man page; this document
+describes the requirements for filesystem support.
+
+First of all, a filesystem supporting encoded I/O must indicate this by setting
+the ``FMODE_ENCODED_IO`` flag in its ``file_open`` file operation::
+
+    static int foo_file_open(struct inode *inode, struct file *filp)
+    {
+            ...
+            filep->f_mode |= FMODE_ENCODED_IO;
+            ...
+    }
+
+Encoded I/O goes through ``read_iter`` and ``write_iter``, designated by the
+``IOCB_ENCODED`` flag in ``kiocb->ki_flags``.
+
+Reads
+=====
+
+Encoded ``read_iter`` should:
+
+1. Call ``generic_encoded_read_checks()`` to validate the file and buffers
+   provided by userspace.
+2. Initialize the ``encoded_iov`` appropriately.
+3. Copy it to the user with ``copy_encoded_iov_to_iter()``.
+4. Copy the encoded data to the user.
+5. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
+6. Return the size of the encoded data read, not including the ``encoded_iov``.
+
+There are a few details to be aware of:
+
+* Encoded ``read_iter`` should support reading unencoded data if the extent is
+  not encoded.
+* If the buffers provided by the user are not large enough to contain an entire
+  encoded extent, then ``read_iter`` should return ``-ENOBUFS``. This is to
+  avoid confusing userspace with truncated data that cannot be properly
+  decoded.
+* Reads in the middle of an encoded extent can be returned by setting
+  ``encoded_iov->unencoded_offset`` to non-zero.
+* Truncated unencoded data (e.g., because the file does not end on a block
+  boundary) may be returned by setting ``encoded_iov->len`` to a value smaller
+  value than ``encoded_iov->unencoded_len - encoded_iov->unencoded_offset``.
+
+Writes
+======
+
+Encoded ``write_iter`` should (in addition to the usual accounting/checks done
+by ``write_iter``):
+
+1. Call ``copy_encoded_iov_from_iter()`` to get and validate the
+   ``encoded_iov``.
+2. Call ``generic_encoded_write_checks()`` instead of
+   ``generic_write_checks()``.
+3. Check that the provided encoding in ``encoded_iov`` is supported.
+4. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
+5. Return the size of the encoded data written.
+
+Again, there are a few details:
+
+* Encoded ``write_iter`` doesn't need to support writing unencoded data.
+* ``write_iter`` should either write all of the encoded data or none of it; it
+  must not do partial writes.
+* ``write_iter`` doesn't need to validate the encoded data; a subsequent read
+  may return, e.g., ``-EIO`` if the data is not valid.
+* The user may lie about the unencoded size of the data; a subsequent read
+  should truncate or zero-extend the unencoded data rather than returning an
+  error.
+* Be careful of page cache coherency.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 98f59a864242..6d9e3ff0a455 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -53,6 +53,7 @@ filesystem implementations.
    journalling
    fscrypt
    fsverity
+   encoded_io
 
 Filesystems
 ===========
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..e2ad418d2987 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1625,24 +1625,15 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 	return 0;
 }
 
-/*
- * Performs necessary checks before doing a write
- *
- * Can adjust writing position or amount of bytes to write.
- * Returns appropriate error code that caller should return or
- * zero in case that write should be allowed.
- */
-ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
-	loff_t count;
-	int ret;
 
 	if (IS_SWAPFILE(inode))
 		return -ETXTBSY;
 
-	if (!iov_iter_count(from))
+	if (!*count)
 		return 0;
 
 	/* FIXME: this is for backwards compatibility with 2.4 */
@@ -1652,8 +1643,22 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
 		return -EINVAL;
 
-	count = iov_iter_count(from);
-	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
+	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
+}
+
+/*
+ * Performs necessary checks before doing a write
+ *
+ * Can adjust writing position or amount of bytes to write.
+ * Returns appropriate error code that caller should return or
+ * zero in case that write should be allowed.
+ */
+ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	loff_t count = iov_iter_count(from);
+	int ret;
+
+	ret = generic_write_checks_common(iocb, &count);
 	if (ret)
 		return ret;
 
@@ -1684,3 +1689,139 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
 
 	return 0;
 }
+
+/**
+ * generic_encoded_write_checks() - check an encoded write
+ * @iocb: I/O context.
+ * @encoded: Encoding metadata.
+ *
+ * This should be called by RWF_ENCODED write implementations rather than
+ * generic_write_checks(). Unlike generic_write_checks(), it returns -EFBIG
+ * instead of adjusting the size of the write.
+ *
+ * Return: 0 on success, -errno on error.
+ */
+int generic_encoded_write_checks(struct kiocb *iocb,
+				 const struct encoded_iov *encoded)
+{
+	loff_t count = encoded->len;
+	int ret;
+
+	if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
+		return -EPERM;
+
+	ret = generic_write_checks_common(iocb, &count);
+	if (ret)
+		return ret;
+
+	if (count != encoded->len) {
+		/*
+		 * The write got truncated by generic_write_checks_common(). We
+		 * can't do a partial encoded write.
+		 */
+		return -EFBIG;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(generic_encoded_write_checks);
+
+/**
+ * copy_encoded_iov_from_iter() - copy a &struct encoded_iov from userspace
+ * @encoded: Returned encoding metadata.
+ * @from: Source iterator.
+ *
+ * This copies in the &struct encoded_iov and does some basic sanity checks.
+ * This should always be used rather than a plain copy_from_iter(), as it does
+ * the proper handling for backward- and forward-compatibility.
+ *
+ * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if the
+ *         copied structure contained non-zero fields that this kernel doesn't
+ *         support, -EINVAL if the copied structure was invalid.
+ */
+int copy_encoded_iov_from_iter(struct encoded_iov *encoded,
+			       struct iov_iter *from)
+{
+	size_t usize;
+	int ret;
+
+	usize = iov_iter_single_seg_count(from);
+	if (usize > PAGE_SIZE)
+		return -E2BIG;
+	if (usize < ENCODED_IOV_SIZE_VER0)
+		return -EINVAL;
+	ret = copy_struct_from_iter(encoded, sizeof(*encoded), from, usize);
+	if (ret)
+		return ret;
+
+	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
+	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
+		return -EINVAL;
+	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
+	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
+		return -EINVAL;
+	if (encoded->unencoded_offset > encoded->unencoded_len)
+		return -EINVAL;
+	if (encoded->len > encoded->unencoded_len - encoded->unencoded_offset)
+		return -EINVAL;
+	return 0;
+}
+EXPORT_SYMBOL(copy_encoded_iov_from_iter);
+
+/**
+ * generic_encoded_read_checks() - sanity check an RWF_ENCODED read
+ * @iocb: I/O context.
+ * @iter: Destination iterator for read.
+ *
+ * This should always be called by RWF_ENCODED read implementations before
+ * returning any data.
+ *
+ * Return: Number of bytes available to return encoded data in @iter on success,
+ *         -EPERM if the file was not opened with O_ALLOW_ENCODED, -EINVAL if
+ *         the size of the &struct encoded_iov iovec is invalid.
+ */
+ssize_t generic_encoded_read_checks(struct kiocb *iocb, struct iov_iter *iter)
+{
+	size_t usize;
+
+	if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
+		return -EPERM;
+	usize = iov_iter_single_seg_count(iter);
+	if (usize > PAGE_SIZE || usize < ENCODED_IOV_SIZE_VER0)
+		return -EINVAL;
+	return iov_iter_count(iter) - usize;
+}
+EXPORT_SYMBOL(generic_encoded_read_checks);
+
+/**
+ * copy_encoded_iov_to_iter() - copy a &struct encoded_iov to userspace
+ * @encoded: Encoding metadata to return.
+ * @to: Destination iterator.
+ *
+ * This should always be used by RWF_ENCODED read implementations rather than a
+ * plain copy_to_iter(), as it does the proper handling for backward- and
+ * forward-compatibility. The iterator must be sanity-checked with
+ * generic_encoded_read_checks() before this is called.
+ *
+ * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if there
+ *         were non-zero fields in @encoded that the user buffer could not
+ *         accommodate.
+ */
+int copy_encoded_iov_to_iter(const struct encoded_iov *encoded,
+			     struct iov_iter *to)
+{
+	size_t ksize = sizeof(*encoded);
+	size_t usize = iov_iter_single_seg_count(to);
+	size_t size = min(ksize, usize);
+
+	/* We already sanity-checked usize in generic_encoded_read_checks(). */
+
+	if (usize < ksize &&
+	    memchr_inv((char *)encoded + usize, 0, ksize - usize))
+		return -E2BIG;
+	if (copy_to_iter(encoded, size, to) != size ||
+	    (usize > ksize &&
+	     iov_iter_zero(usize - ksize, to) != usize - ksize))
+		return -EFAULT;
+	return 0;
+}
+EXPORT_SYMBOL(copy_encoded_iov_to_iter);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8667d0cdc71e..67810bf6fb1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -178,6 +178,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports async buffered reads */
 #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
 
+/* File supports encoded IO */
+#define FMODE_ENCODED_IO	((__force fmode_t)0x80000000)
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
@@ -308,6 +311,7 @@ enum rw_hint {
 #define IOCB_SYNC		(__force int) RWF_SYNC
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
+#define IOCB_ENCODED		(__force int) RWF_ENCODED
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -2964,6 +2968,13 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_write_check_limits(struct file *file, loff_t pos,
 		loff_t *count);
+struct encoded_iov;
+extern int generic_encoded_write_checks(struct kiocb *,
+					const struct encoded_iov *);
+extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *);
+extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *);
+extern int copy_encoded_iov_to_iter(const struct encoded_iov *,
+				    struct iov_iter *);
 extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
 extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		struct iov_iter *to, ssize_t already_read);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f44eb0a04afd..95493420117a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -279,6 +279,42 @@ struct fsxattr {
 					 SYNC_FILE_RANGE_WAIT_BEFORE | \
 					 SYNC_FILE_RANGE_WAIT_AFTER)
 
+enum {
+	ENCODED_IOV_COMPRESSION_NONE,
+#define ENCODED_IOV_COMPRESSION_NONE ENCODED_IOV_COMPRESSION_NONE
+	ENCODED_IOV_COMPRESSION_BTRFS_ZLIB,
+#define ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ENCODED_IOV_COMPRESSION_BTRFS_ZLIB
+	ENCODED_IOV_COMPRESSION_BTRFS_ZSTD,
+#define ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ENCODED_IOV_COMPRESSION_BTRFS_ZSTD
+	ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K,
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K
+	ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K,
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K
+	ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K,
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K
+	ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K,
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K
+	ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K,
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K
+	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K,
+};
+
+enum {
+	ENCODED_IOV_ENCRYPTION_NONE,
+#define ENCODED_IOV_ENCRYPTION_NONE ENCODED_IOV_ENCRYPTION_NONE
+	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
+};
+
+struct encoded_iov {
+	__aligned_u64 len;
+	__aligned_u64 unencoded_len;
+	__aligned_u64 unencoded_offset;
+	__u32 compression;
+	__u32 encryption;
+};
+
+#define ENCODED_IOV_SIZE_VER0 32
+
 /*
  * Flags for preadv2/pwritev2:
  */
@@ -300,8 +336,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* encoded (e.g., compressed and/or encrypted) IO */
+#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ENCODED)
 
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 04/11] btrfs: fix btrfs_write_check()
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (3 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 03/11] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-23 17:08   ` David Sterba
  2020-11-18 19:18 ` [PATCH v6 05/11] btrfs: fix check_data_csum() error message for direct I/O Omar Sandoval
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

btrfs_write_check() has two related bugs:

1. It gets the iov_iter count before calling generic_write_checks(), but
   generic_write_checks() may truncate the iov_iter.
2. It returns the count or negative errno as a size_t, which the callers
   cast to an int. If the count is greater than INT_MAX, this overflows.

To fix both of these, pull the call to generic_write_checks() out of
btrfs_write_check(), use the new iov_iter count returned from
generic_write_checks(), and have btrfs_write_check() return 0 or a
negative errno as an int instead of the count. This rearrangement also
paves the way for RWF_ENCODED write support.

Fixes: f945968ff64c ("btrfs: introduce btrfs_write_check()")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/file.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d217b739b164..7225b63b62a9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1583,21 +1583,17 @@ static void update_time_for_write(struct inode *inode)
 		inode_inc_iversion(inode);
 }
 
-static size_t btrfs_write_check(struct kiocb *iocb, struct iov_iter *from)
+static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
+			     size_t count)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	loff_t pos = iocb->ki_pos;
-	size_t count = iov_iter_count(from);
 	int err;
 	loff_t oldsize;
 	loff_t start_pos;
 
-	err = generic_write_checks(iocb, from);
-	if (err <= 0)
-		return err;
-
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		size_t nocow_bytes = count;
 
@@ -1639,7 +1635,7 @@ static size_t btrfs_write_check(struct kiocb *iocb, struct iov_iter *from)
 		}
 	}
 
-	return count;
+	return 0;
 }
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
@@ -1656,7 +1652,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 	u64 lockend;
 	size_t num_written = 0;
 	int nrptrs;
-	int ret = 0;
+	ssize_t ret;
 	bool only_release_metadata = false;
 	bool force_page_uptodate = false;
 	loff_t old_isize = i_size_read(inode);
@@ -1669,10 +1665,14 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 	if (ret < 0)
 		return ret;
 
-	ret = btrfs_write_check(iocb, i);
+	ret = generic_write_checks(iocb, i);
 	if (ret <= 0)
 		goto out;
 
+	ret = btrfs_write_check(iocb, i, ret);
+	if (ret < 0)
+		goto out;
+
 	pos = iocb->ki_pos;
 	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
 			PAGE_SIZE / (sizeof(struct page *)));
@@ -1904,7 +1904,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t written = 0;
 	ssize_t written_buffered;
 	loff_t endbyte;
-	int err;
+	ssize_t err;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio = NULL;
 
@@ -1920,8 +1920,14 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (err < 0)
 		return err;
 
-	err = btrfs_write_check(iocb, from);
+	err = generic_write_checks(iocb, from);
 	if (err <= 0) {
+		btrfs_inode_unlock(inode, ilock_flags);
+		return err;
+	}
+
+	err = btrfs_write_check(iocb, from, err);
+	if (err < 0) {
 		btrfs_inode_unlock(inode, ilock_flags);
 		goto out;
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 05/11] btrfs: fix check_data_csum() error message for direct I/O
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (4 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 04/11] btrfs: fix btrfs_write_check() Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-23 17:09   ` David Sterba
  2020-11-18 19:18 ` [PATCH v6 06/11] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to
check_data_csum()") replaced the start parameter to check_data_csum()
with page_offset(), but page_offset() is not meaningful for direct I/O
pages. Bring back the start parameter.

Fixes: 1dae796aabf6 ("btrfs: inode: sink parameter start and len to check_data_csum()")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index abc0fd162f6c..c5fa1bd3dfe7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2939,11 +2939,12 @@ void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
  * @icsum:	checksum index in the io_bio->csum array, size of csum_size
  * @page:	page where is the data to be verified
  * @pgoff:	offset inside the page
+ * @start:	logical offset in the file
  *
  * The length of such check is always one sector size.
  */
 static int check_data_csum(struct inode *inode, struct btrfs_io_bio *io_bio,
-			   int icsum, struct page *page, int pgoff)
+			   int icsum, struct page *page, int pgoff, u64 start)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
@@ -2968,8 +2969,8 @@ static int check_data_csum(struct inode *inode, struct btrfs_io_bio *io_bio,
 	kunmap_atomic(kaddr);
 	return 0;
 zeroit:
-	btrfs_print_data_csum_error(BTRFS_I(inode), page_offset(page) + pgoff,
-				    csum, csum_expected, io_bio->mirror_num);
+	btrfs_print_data_csum_error(BTRFS_I(inode), start, csum, csum_expected,
+				    io_bio->mirror_num);
 	if (io_bio->device)
 		btrfs_dev_stat_inc_and_print(io_bio->device,
 					     BTRFS_DEV_STAT_CORRUPTION_ERRS);
@@ -3010,7 +3011,7 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
 	}
 
 	phy_offset >>= root->fs_info->sectorsize_bits;
-	return check_data_csum(inode, io_bio, phy_offset, page, offset);
+	return check_data_csum(inode, io_bio, phy_offset, page, offset, start);
 }
 
 /*
@@ -7733,7 +7734,8 @@ static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
 			ASSERT(pgoff < PAGE_SIZE);
 			if (uptodate &&
 			    (!csum || !check_data_csum(inode, io_bio, icsum,
-						       bvec.bv_page, pgoff))) {
+						       bvec.bv_page, pgoff,
+						       start))) {
 				clean_io_failure(fs_info, failure_tree, io_tree,
 						 start, bvec.bv_page,
 						 btrfs_ino(BTRFS_I(inode)),
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 06/11] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio()
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (5 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 05/11] btrfs: fix check_data_csum() error message for direct I/O Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-18 19:18 ` [PATCH v6 07/11] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

btrfs_csum_one_bio() loops over each filesystem block in the bio while
keeping a cursor of its current logical position in the file in order to
look up the ordered extent to add the checksums to. However, this
doesn't make much sense for compressed extents, as a sector on disk does
not correspond to a sector of decompressed file data. It happens to work
because 1) the compressed bio always covers one ordered extent and 2)
the size of the bio is always less than the size of the ordered extent.
However, the second point will not always be true for encoded writes.

Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
it can assume that the bio only covers one ordered extent. Since we're
already changing the signature, let's get rid of the contig parameter
and make it implied by the offset parameter, similar to the change we
recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
nr_sectors to blockcount to make it clear that it's the number of
filesystem blocks, not the number of 512-byte sectors.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c |  5 +++--
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/file-item.c   | 35 ++++++++++++++++-------------------
 fs/btrfs/inode.c       |  8 ++++----
 4 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4e022ed72d2f..eaa6fe21c08e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -436,7 +436,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			BUG_ON(ret); /* -ENOMEM */
 
 			if (!skip_sum) {
-				ret = btrfs_csum_one_bio(inode, bio, start, 1);
+				ret = btrfs_csum_one_bio(inode, bio, start,
+							 true);
 				BUG_ON(ret); /* -ENOMEM */
 			}
 
@@ -468,7 +469,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	BUG_ON(ret); /* -ENOMEM */
 
 	if (!skip_sum) {
-		ret = btrfs_csum_one_bio(inode, bio, start, 1);
+		ret = btrfs_csum_one_bio(inode, bio, start, true);
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index cb90a870b235..09536ecd62c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3027,7 +3027,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct btrfs_ordered_sum *sums);
 blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-				u64 file_start, int contig);
+				u64 offset, bool one_ordered);
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 			     struct list_head *list, int search_commit);
 void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 1a5651bebbaf..200cf1be774d 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -516,28 +516,28 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
  * btrfs_csum_one_bio - Calculates checksums of the data contained inside a bio
  * @inode:	 Owner of the data inside the bio
  * @bio:	 Contains the data to be checksummed
- * @file_start:  offset in file this bio begins to describe
- * @contig:	 Boolean. If true/1 means all bio vecs in this bio are
- *		 contiguous and they begin at @file_start in the file. False/0
- *		 means this bio can contains potentially discontigous bio vecs
- *		 so the logical offset of each should be calculated separately.
+ * @offset:      If (u64)-1, @bio may contain discontiguous bio vecs, so the
+ *               file offsets are determined from the page offsets in the bio.
+ *               Otherwise, this is the starting file offset of the bio vecs in
+ *               @bio, which must be contiguous.
+ * @one_ordered: If true, @bio only refers to one ordered extent.
  */
 blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-		       u64 file_start, int contig)
+				u64 offset, bool one_ordered)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	struct btrfs_ordered_sum *sums;
 	struct btrfs_ordered_extent *ordered = NULL;
+	const bool page_offsets = (offset == (u64)-1);
 	char *data;
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	int index;
-	int nr_sectors;
+	int blockcount;
 	unsigned long total_bytes = 0;
 	unsigned long this_sum_bytes = 0;
 	int i;
-	u64 offset;
 	unsigned nofs_flag;
 
 	nofs_flag = memalloc_nofs_save();
@@ -551,18 +551,13 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 	sums->len = bio->bi_iter.bi_size;
 	INIT_LIST_HEAD(&sums->list);
 
-	if (contig)
-		offset = file_start;
-	else
-		offset = 0; /* shut up gcc */
-
 	sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
 	index = 0;
 
 	shash->tfm = fs_info->csum_shash;
 
 	bio_for_each_segment(bvec, bio, iter) {
-		if (!contig)
+		if (page_offsets)
 			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
 
 		if (!ordered) {
@@ -570,13 +565,14 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 			BUG_ON(!ordered); /* Logic error */
 		}
 
-		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info,
+		blockcount = BTRFS_BYTES_TO_BLKS(fs_info,
 						 bvec.bv_len + fs_info->sectorsize
 						 - 1);
 
-		for (i = 0; i < nr_sectors; i++) {
-			if (offset >= ordered->file_offset + ordered->num_bytes ||
-			    offset < ordered->file_offset) {
+		for (i = 0; i < blockcount; i++) {
+			if (!one_ordered &&
+			    (offset >= ordered->file_offset + ordered->num_bytes ||
+			     offset < ordered->file_offset)) {
 				unsigned long bytes_left;
 
 				sums->len = this_sum_bytes;
@@ -607,7 +603,8 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 					    sums->sums + index);
 			kunmap_atomic(data);
 			index += fs_info->csum_size;
-			offset += fs_info->sectorsize;
+			if (!one_ordered)
+				offset += fs_info->sectorsize;
 			this_sum_bytes += fs_info->sectorsize;
 			total_bytes += fs_info->sectorsize;
 		}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c5fa1bd3dfe7..31e8f3cd18c2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2206,7 +2206,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
 					   u64 bio_offset)
 {
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
+	return btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false);
 }
 
 /*
@@ -2274,7 +2274,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 					  0, btrfs_submit_bio_start);
 		goto out;
 	} else if (!skip_sum) {
-		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
+		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false);
 		if (ret)
 			goto out;
 	}
@@ -7808,7 +7808,7 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
 static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
 						     struct bio *bio, u64 offset)
 {
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, offset, 1);
+	return btrfs_csum_one_bio(BTRFS_I(inode), bio, offset, false);
 }
 
 static void btrfs_end_dio_bio(struct bio *bio)
@@ -7867,7 +7867,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 		 * If we aren't doing async submit, calculate the csum of the
 		 * bio now.
 		 */
-		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, 1);
+		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, false);
 		if (ret)
 			goto err;
 	} else {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 07/11] btrfs: add ram_bytes and offset to btrfs_ordered_extent
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (6 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 06/11] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-18 19:18 ` [PATCH v6 08/11] btrfs: support different disk extent size for delalloc Omar Sandoval
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, we only create ordered extents when ram_bytes == num_bytes
and offset == 0. However, RWF_ENCODED writes may create extents which
only refer to a subset of the full unencoded extent, so we need to plumb
these fields through the ordered extent infrastructure and pass them
down to insert_reserved_file_extent().

Since we're changing the btrfs_add_ordered_extent* signature, let's get
rid of the trivial wrappers and add a kernel-doc.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c        | 56 ++++++++++++++++++---------------
 fs/btrfs/ordered-data.c | 68 ++++++++++++++++-------------------------
 fs/btrfs/ordered-data.h | 16 ++++------
 3 files changed, 64 insertions(+), 76 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 31e8f3cd18c2..8f261be36d1b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -913,13 +913,12 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			goto out_free_reserve;
 		free_extent_map(em);
 
-		ret = btrfs_add_ordered_extent_compress(inode,
-						async_extent->start,
-						ins.objectid,
-						async_extent->ram_size,
-						ins.offset,
-						BTRFS_ORDERED_COMPRESSED,
-						async_extent->compress_type);
+		ret = btrfs_add_ordered_extent(inode, async_extent->start,
+					       async_extent->ram_size,
+					       async_extent->ram_size,
+					       ins.objectid, ins.offset, 0,
+					       1 << BTRFS_ORDERED_COMPRESSED,
+					       async_extent->compress_type);
 		if (ret) {
 			btrfs_drop_extent_cache(inode, async_extent->start,
 						async_extent->start +
@@ -1127,8 +1126,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		}
 		free_extent_map(em);
 
-		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
-					       ram_size, cur_alloc_size, 0);
+		ret = btrfs_add_ordered_extent(inode, start, ram_size, ram_size,
+					       ins.objectid, cur_alloc_size, 0,
+					       0, BTRFS_COMPRESS_NONE);
 		if (ret)
 			goto out_drop_extent_cache;
 
@@ -1760,10 +1760,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 				goto error;
 			}
 			free_extent_map(em);
-			ret = btrfs_add_ordered_extent(inode, cur_offset,
-						       disk_bytenr, num_bytes,
-						       num_bytes,
-						       BTRFS_ORDERED_PREALLOC);
+			ret = btrfs_add_ordered_extent(inode,
+					cur_offset, num_bytes, num_bytes,
+					disk_bytenr, num_bytes, 0,
+					1 << BTRFS_ORDERED_PREALLOC,
+					BTRFS_COMPRESS_NONE);
 			if (ret) {
 				btrfs_drop_extent_cache(inode, cur_offset,
 							cur_offset + num_bytes - 1,
@@ -1772,9 +1773,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			}
 		} else {
 			ret = btrfs_add_ordered_extent(inode, cur_offset,
+						       num_bytes, num_bytes,
 						       disk_bytenr, num_bytes,
-						       num_bytes,
-						       BTRFS_ORDERED_NOCOW);
+						       0,
+						       1 << BTRFS_ORDERED_NOCOW,
+						       BTRFS_COMPRESS_NONE);
 			if (ret)
 				goto error;
 		}
@@ -2578,6 +2581,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_key ins;
 	u64 disk_num_bytes = btrfs_stack_file_extent_disk_num_bytes(stack_fi);
 	u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi);
+	u64 offset = btrfs_stack_file_extent_offset(stack_fi);
 	u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi);
 	u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi);
 	struct btrfs_drop_extents_args drop_args = { 0 };
@@ -2652,7 +2656,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 		goto out;
 
 	ret = btrfs_alloc_reserved_file_extent(trans, root, btrfs_ino(inode),
-					       file_pos, qgroup_reserved, &ins);
+					       file_pos - offset,
+					       qgroup_reserved, &ins);
 out:
 	btrfs_free_path(path);
 
@@ -2678,20 +2683,20 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 					     struct btrfs_ordered_extent *oe)
 {
 	struct btrfs_file_extent_item stack_fi;
-	u64 logical_len;
 	bool update_inode_bytes;
+	u64 num_bytes = oe->num_bytes;
+	u64 ram_bytes = oe->ram_bytes;
 
 	memset(&stack_fi, 0, sizeof(stack_fi));
 	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG);
 	btrfs_set_stack_file_extent_disk_bytenr(&stack_fi, oe->disk_bytenr);
 	btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi,
 						   oe->disk_num_bytes);
+	btrfs_set_stack_file_extent_offset(&stack_fi, oe->offset);
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags))
-		logical_len = oe->truncated_len;
-	else
-		logical_len = oe->num_bytes;
-	btrfs_set_stack_file_extent_num_bytes(&stack_fi, logical_len);
-	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, logical_len);
+		num_bytes = ram_bytes = oe->truncated_len;
+	btrfs_set_stack_file_extent_num_bytes(&stack_fi, num_bytes);
+	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, ram_bytes);
 	btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type);
 	/* Encryption and other encoding is reserved and all 0 */
 
@@ -7042,8 +7047,11 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		if (IS_ERR(em))
 			goto out;
 	}
-	ret = btrfs_add_ordered_extent_dio(inode, start, block_start, len,
-					   block_len, type);
+	ret = btrfs_add_ordered_extent(inode, start, len, len, block_start,
+				       block_len, 0,
+				       (1 << type) |
+				       (1 << BTRFS_ORDERED_DIRECT),
+				       BTRFS_COMPRESS_NONE);
 	if (ret) {
 		if (em) {
 			free_extent_map(em);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 0d61f9fefc02..76dc47315016 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -153,16 +153,27 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
 	return ret;
 }
 
-/*
- * Allocate and add a new ordered_extent into the per-inode tree.
+/**
+ * btrfs_add_ordered_extent - Add an ordered extent to the per-inode tree.
+ * @inode: inode that this extent is for.
+ * @file_offset: Logical offset in file where the extent starts.
+ * @num_bytes: Logical length of extent in file.
+ * @ram_bytes: Full length of unencoded data.
+ * @disk_bytenr: Offset of extent on disk.
+ * @disk_num_bytes: Size of extent on disk.
+ * @offset: Offset into unencoded data where file data starts.
+ * @flags: Flags specifying type of extent (1 << BTRFS_ORDERED_*).
+ * @compress_type: Compression algorithm used for data.
  *
- * The tree is given a single reference on the ordered extent that was
- * inserted.
+ * Most of these parameters correspond to &struct btrfs_file_extent_item. The
+ * tree is given a single reference on the ordered extent that was inserted.
+ *
+ * Return: 0 or -ENOMEM.
  */
-static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int type, int dio,
-				      int compress_type)
+int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, int flags,
+			     int compress_type)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -171,7 +182,8 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	struct btrfs_ordered_extent *entry;
 	int ret;
 
-	if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
+	if (flags &
+	    ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC))) {
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
 		if (ret < 0)
@@ -191,21 +203,21 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 		return -ENOMEM;
 
 	entry->file_offset = file_offset;
-	entry->disk_bytenr = disk_bytenr;
 	entry->num_bytes = num_bytes;
+	entry->ram_bytes = ram_bytes;
+	entry->disk_bytenr = disk_bytenr;
 	entry->disk_num_bytes = disk_num_bytes;
+	entry->offset = offset;
 	entry->bytes_left = num_bytes;
 	entry->inode = igrab(&inode->vfs_inode);
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
 	entry->qgroup_rsv = ret;
-	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
-		set_bit(type, &entry->flags);
 
-	if (dio) {
+	entry->flags = flags;
+	if (flags & (1 << BTRFS_ORDERED_DIRECT)) {
 		percpu_counter_add_batch(&fs_info->dio_bytes, num_bytes,
 					 fs_info->delalloc_batch);
-		set_bit(BTRFS_ORDERED_DIRECT, &entry->flags);
 	}
 
 	/* one ref for the tree */
@@ -252,34 +264,6 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	return 0;
 }
 
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
-			     int type)
-{
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes, type, 0,
-					  BTRFS_COMPRESS_NONE);
-}
-
-int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset,
-				 u64 disk_bytenr, u64 num_bytes,
-				 u64 disk_num_bytes, int type)
-{
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes, type, 1,
-					  BTRFS_COMPRESS_NONE);
-}
-
-int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int type,
-				      int compress_type)
-{
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes, type, 0,
-					  compress_type);
-}
-
 /*
  * Add a struct btrfs_ordered_sum into the list of checksums to be inserted
  * when an ordered extent is finished.  If the list covers more than one
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 367269effd6a..f2798c1fc7b0 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -72,9 +72,11 @@ struct btrfs_ordered_extent {
 	 * These fields directly correspond to the same fields in
 	 * btrfs_file_extent_item.
 	 */
-	u64 disk_bytenr;
 	u64 num_bytes;
+	u64 ram_bytes;
+	u64 disk_bytenr;
 	u64 disk_num_bytes;
+	u64 offset;
 
 	/* number of bytes that still need writing */
 	u64 bytes_left;
@@ -160,15 +162,9 @@ int btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
 				   u64 *file_offset, u64 io_size,
 				   int uptodate);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
-			     int type);
-int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset,
-				 u64 disk_bytenr, u64 num_bytes,
-				 u64 disk_num_bytes, int type);
-int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int type,
-				      int compress_type);
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, int flags,
+			     int compress_type);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 08/11] btrfs: support different disk extent size for delalloc
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (7 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 07/11] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-18 19:18 ` [PATCH v6 09/11] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For RWF_ENCODED writes, we know the exact size of the extent on
disk, which may be less than or greater than (for bookends) the size in
the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h          |  3 ++-
 fs/btrfs/delalloc-space.c | 18 ++++++++++--------
 fs/btrfs/file.c           |  3 ++-
 fs/btrfs/inode.c          |  2 +-
 fs/btrfs/relocation.c     |  4 ++--
 5 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 09536ecd62c7..6ab2ab002bf6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2737,7 +2737,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root,
 				      struct btrfs_block_rsv *rsv);
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+				    u64 disk_num_bytes);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
 				   u64 start, u64 end);
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index bacee09b7bfd..948b78f03f63 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -265,11 +265,11 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 }
 
 static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
-				    u64 num_bytes, u64 *meta_reserve,
-				    u64 *qgroup_reserve)
+				    u64 num_bytes, u64 disk_num_bytes,
+				    u64 *meta_reserve, u64 *qgroup_reserve)
 {
 	u64 nr_extents = count_max_extents(num_bytes);
-	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
+	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes);
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
 	*meta_reserve = btrfs_calc_insert_metadata_size(fs_info,
@@ -283,7 +283,8 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+				    u64 disk_num_bytes)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -313,6 +314,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	}
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
+	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
 
 	/*
 	 * We always want to do it this way, every other way is wrong and ends
@@ -324,8 +326,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	 * everything out and try again, which is bad.  This way we just
 	 * over-reserve slightly, and clean up the mess when we are done.
 	 */
-	calc_inode_reservations(fs_info, num_bytes, &meta_reserve,
-				&qgroup_reserve);
+	calc_inode_reservations(fs_info, num_bytes, disk_num_bytes,
+				&meta_reserve, &qgroup_reserve);
 	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true);
 	if (ret)
 		return ret;
@@ -344,7 +346,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	spin_lock(&inode->lock);
 	nr_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
-	inode->csum_bytes += num_bytes;
+	inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
@@ -448,7 +450,7 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode,
 	ret = btrfs_check_data_free_space(inode, reserved, start, len);
 	if (ret < 0)
 		return ret;
-	ret = btrfs_delalloc_reserve_metadata(inode, len);
+	ret = btrfs_delalloc_reserve_metadata(inode, len, len);
 	if (ret < 0)
 		btrfs_free_reserved_data_space(inode, *reserved, start, len);
 	return ret;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7225b63b62a9..224295f8f1e1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1734,7 +1734,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 					 fs_info->sectorsize);
 		WARN_ON(reserve_bytes == 0);
 		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				reserve_bytes);
+						      reserve_bytes,
+						      reserve_bytes);
 		if (ret) {
 			if (!only_release_metadata)
 				btrfs_free_reserved_data_space(BTRFS_I(inode),
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8f261be36d1b..8e5ceeb4c686 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4697,7 +4697,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 			goto out;
 		}
 	}
-	ret = btrfs_delalloc_reserve_metadata(inode, blocksize);
+	ret = btrfs_delalloc_reserve_metadata(inode, blocksize, blocksize);
 	if (ret < 0) {
 		if (!only_release_metadata)
 			btrfs_free_reserved_data_space(inode, data_reserved,
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index c5774a8e6ff7..038182f8233d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2678,8 +2678,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
 	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				PAGE_SIZE);
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE,
+						      PAGE_SIZE);
 		if (ret)
 			goto out;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 09/11] btrfs: optionally extend i_size in cow_file_range_inline()
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (8 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 08/11] btrfs: support different disk extent size for delalloc Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-11-18 19:18 ` [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads Omar Sandoval
  2020-11-18 19:18 ` [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes Omar Sandoval
  11 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, an inline extent is always created after i_size is extended
from btrfs_dirty_pages(). However, for encoded writes, we only want to
update i_size after we successfully created the inline extent. Add an
update_i_size parameter to cow_file_range_inline() and
insert_inline_extent() and pass in the size of the extent rather than
determining it from i_size. Since the start parameter is always passed
as 0, get rid of it and simplify the logic in these two functions. While
we're here, let's document the requirements for creating an inline
extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 100 +++++++++++++++++++++++------------------------
 1 file changed, 48 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e5ceeb4c686..1ff903f5c5a4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -204,9 +204,10 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
 static int insert_inline_extent(struct btrfs_trans_handle *trans,
 				struct btrfs_path *path, bool extent_inserted,
 				struct btrfs_root *root, struct inode *inode,
-				u64 start, size_t size, size_t compressed_size,
+				size_t size, size_t compressed_size,
 				int compress_type,
-				struct page **compressed_pages)
+				struct page **compressed_pages,
+				bool update_i_size)
 {
 	struct extent_buffer *leaf;
 	struct page *page = NULL;
@@ -215,7 +216,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_file_extent_item *ei;
 	int ret;
 	size_t cur_size = size;
-	unsigned long offset;
+	u64 i_size;
 
 	ASSERT((compressed_size > 0 && compressed_pages) ||
 	       (compressed_size == 0 && !compressed_pages));
@@ -228,7 +229,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		size_t datasize;
 
 		key.objectid = btrfs_ino(BTRFS_I(inode));
-		key.offset = start;
+		key.offset = 0;
 		key.type = BTRFS_EXTENT_DATA_KEY;
 
 		datasize = btrfs_file_extent_calc_inline_size(cur_size);
@@ -266,12 +267,10 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		btrfs_set_file_extent_compression(leaf, ei,
 						  compress_type);
 	} else {
-		page = find_get_page(inode->i_mapping,
-				     start >> PAGE_SHIFT);
+		page = find_get_page(inode->i_mapping, 0);
 		btrfs_set_file_extent_compression(leaf, ei, 0);
 		kaddr = kmap_atomic(page);
-		offset = offset_in_page(start);
-		write_extent_buffer(leaf, kaddr + offset, ptr, size);
+		write_extent_buffer(leaf, kaddr, ptr, size);
 		kunmap_atomic(kaddr);
 		put_page(page);
 	}
@@ -282,8 +281,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * We align size to sectorsize for inline extents just for simplicity
 	 * sake.
 	 */
-	size = ALIGN(size, root->fs_info->sectorsize);
-	ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), start, size);
+	ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), 0,
+					ALIGN(size, root->fs_info->sectorsize));
 	if (ret)
 		goto fail;
 
@@ -296,7 +295,13 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * before we unlock the pages.  Otherwise we
 	 * could end up racing with unlink.
 	 */
-	BTRFS_I(inode)->disk_i_size = inode->i_size;
+	i_size = i_size_read(inode);
+	if (update_i_size && size > i_size) {
+		i_size_write(inode, size);
+		i_size = size;
+	}
+	BTRFS_I(inode)->disk_i_size = i_size;
+
 fail:
 	return ret;
 }
@@ -307,35 +312,31 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
  * does the checks required to make sure the data is small enough
  * to fit as an inline extent.
  */
-static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
-					  u64 end, size_t compressed_size,
+static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size,
+					  size_t compressed_size,
 					  int compress_type,
-					  struct page **compressed_pages)
+					  struct page **compressed_pages,
+					  bool update_i_size)
 {
 	struct btrfs_drop_extents_args drop_args = { 0 };
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
-	u64 isize = i_size_read(&inode->vfs_inode);
-	u64 actual_end = min(end + 1, isize);
-	u64 inline_len = actual_end - start;
-	u64 aligned_end = ALIGN(end, fs_info->sectorsize);
-	u64 data_len = inline_len;
+	u64 data_len = compressed_size ? compressed_size : size;
 	int ret;
 	struct btrfs_path *path;
 
-	if (compressed_size)
-		data_len = compressed_size;
-
-	if (start > 0 ||
-	    actual_end > fs_info->sectorsize ||
+	/*
+	 * We can create an inline extent if it ends at or beyond the current
+	 * i_size, is no larger than a sector (decompressed), and the (possibly
+	 * compressed) data fits in a leaf and the configured maximum inline
+	 * size.
+	 */
+	if (size < i_size_read(&inode->vfs_inode) ||
+	    size > fs_info->sectorsize ||
 	    data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) ||
-	    (!compressed_size &&
-	    (actual_end & (fs_info->sectorsize - 1)) == 0) ||
-	    end + 1 < isize ||
-	    data_len > fs_info->max_inline) {
+	    data_len > fs_info->max_inline)
 		return 1;
-	}
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -349,30 +350,21 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 	trans->block_rsv = &inode->block_rsv;
 
 	drop_args.path = path;
-	drop_args.start = start;
-	drop_args.end = aligned_end;
+	drop_args.start = 0;
+	drop_args.end = fs_info->sectorsize;
 	drop_args.drop_cache = true;
 	drop_args.replace_extent = true;
-
-	if (compressed_size && compressed_pages)
-		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
-		   compressed_size);
-	else
-		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
-		    inline_len);
-
+	drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(data_len);
 	ret = btrfs_drop_extents(trans, root, inode, &drop_args);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
 	}
 
-	if (isize > actual_end)
-		inline_len = min_t(u64, isize, actual_end);
-	ret = insert_inline_extent(trans, path, drop_args.extent_inserted,
-				   root, &inode->vfs_inode, start,
-				   inline_len, compressed_size,
-				   compress_type, compressed_pages);
+	ret = insert_inline_extent(trans, path, drop_args.extent_inserted, root,
+				   &inode->vfs_inode, size, compressed_size,
+				   compress_type, compressed_pages,
+				   update_i_size);
 	if (ret && ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
@@ -381,7 +373,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 		goto out;
 	}
 
-	btrfs_update_inode_bytes(inode, inline_len, drop_args.bytes_found);
+	btrfs_update_inode_bytes(inode, size, drop_args.bytes_found);
 	ret = btrfs_update_inode(trans, root, inode);
 	if (ret && ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, ret);
@@ -662,14 +654,15 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 			/* we didn't compress the entire range, try
 			 * to make an uncompressed inline extent.
 			 */
-			ret = cow_file_range_inline(BTRFS_I(inode), start, end,
+			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    0, BTRFS_COMPRESS_NONE,
-						    NULL);
+						    NULL, false);
 		} else {
 			/* try making a compressed inline extent */
-			ret = cow_file_range_inline(BTRFS_I(inode), start, end,
+			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    total_compressed,
-						    compress_type, pages);
+						    compress_type, pages,
+						    false);
 		}
 		if (ret <= 0) {
 			unsigned long clear_flags = EXTENT_DELALLOC |
@@ -1057,9 +1050,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
 	if (start == 0) {
+		u64 actual_end = min_t(u64, i_size_read(&inode->vfs_inode),
+				       end + 1);
+
 		/* lets try to make an inline extent */
-		ret = cow_file_range_inline(inode, start, end, 0,
-					    BTRFS_COMPRESS_NONE, NULL);
+		ret = cow_file_range_inline(inode, actual_end, 0,
+					    BTRFS_COMPRESS_NONE, NULL, false);
 		if (ret == 0) {
 			/*
 			 * We use DO_ACCOUNTING here because we need the
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (9 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 09/11] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-12-03 14:32   ` Josef Bacik
  2020-11-18 19:18 ` [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes Omar Sandoval
  11 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

There are 4 main cases:

1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
   from disk.
4. Regular, compressed extents: we read the entire compressed extent
   from disk and indicate what subset of the decompressed extent is in
   the file.

This initial implementation simplifies a few things that can be improved
in the future:

- We hold the inode lock during the operation.
- Cases 1, 3, and 4 allocate temporary memory to read into before
  copying out to userspace.
- We don't do read repair, because it turns out that read repair is
  currently broken for compressed data.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h |   2 +
 fs/btrfs/file.c  |   5 +
 fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 503 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6ab2ab002bf6..ce78424f1d98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3133,6 +3133,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
+
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
 extern const struct iomap_dio_ops btrfs_dio_ops;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 224295f8f1e1..193477565200 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3629,6 +3629,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret = 0;
 
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EOPNOTSUPP;
+		return btrfs_encoded_read(iocb, to);
+	}
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = btrfs_direct_read(iocb, to);
 		if (ret < 0 || !iov_iter_count(to) ||
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1ff903f5c5a4..b0e800897b3b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9936,6 +9936,502 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
 	}
 }
 
+static int encoded_iov_compression_from_btrfs(unsigned int compress_type)
+{
+	switch (compress_type) {
+	case BTRFS_COMPRESS_NONE:
+		return ENCODED_IOV_COMPRESSION_NONE;
+	case BTRFS_COMPRESS_ZLIB:
+		return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB;
+	case BTRFS_COMPRESS_LZO:
+		/*
+		 * The LZO format depends on the page size. 64k is the maximum
+		 * sectorsize (and thus page size) that we support.
+		 */
+		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
+			return -EINVAL;
+		return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12);
+	case BTRFS_COMPRESS_ZSTD:
+		return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD;
+	default:
+		return -EUCLEAN;
+	}
+}
+
+static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
+					 struct iov_iter *iter, u64 start,
+					 u64 lockend,
+					 struct extent_state **cached_state,
+					 u64 extent_start, size_t count,
+					 struct encoded_iov *encoded,
+					 bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *item;
+	u64 ram_bytes;
+	unsigned long ptr;
+	void *tmp;
+	ssize_t ret;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
+				       btrfs_ino(BTRFS_I(inode)), extent_start,
+				       0);
+	if (ret) {
+		if (ret > 0) {
+			/* The extent item disappeared? */
+			ret = -EIO;
+		}
+		goto out;
+	}
+	leaf = path->nodes[0];
+	item = btrfs_item_ptr(leaf, path->slots[0],
+			      struct btrfs_file_extent_item);
+
+	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
+	ptr = btrfs_file_extent_inline_start(item);
+
+	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
+			iocb->ki_pos);
+	ret = encoded_iov_compression_from_btrfs(
+				 btrfs_file_extent_compression(leaf, item));
+	if (ret < 0)
+		goto out;
+	encoded->compression = ret;
+	if (encoded->compression) {
+		size_t inline_size;
+
+		inline_size = btrfs_file_extent_inline_item_len(leaf,
+						btrfs_item_nr(path->slots[0]));
+		if (inline_size > count) {
+			ret = -ENOBUFS;
+			goto out;
+		}
+		count = inline_size;
+		encoded->unencoded_len = ram_bytes;
+		encoded->unencoded_offset = iocb->ki_pos - extent_start;
+	} else {
+		encoded->len = encoded->unencoded_len = count =
+			min_t(u64, count, encoded->len);
+		ptr += iocb->ki_pos - extent_start;
+	}
+
+	tmp = kmalloc(count, GFP_NOFS);
+	if (!tmp) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	read_extent_buffer(leaf, tmp, ptr, count);
+	btrfs_release_path(path);
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock_shared(inode);
+	*unlocked = true;
+
+	ret = copy_encoded_iov_to_iter(encoded, iter);
+	if (ret)
+		goto out_free;
+	ret = copy_to_iter(tmp, count, iter);
+	if (ret != count)
+		ret = -EFAULT;
+out_free:
+	kfree(tmp);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+struct btrfs_encoded_read_private {
+	struct inode *inode;
+	wait_queue_head_t wait;
+	atomic_t pending;
+	blk_status_t status;
+	bool skip_csum;
+};
+
+static blk_status_t submit_encoded_read_bio(struct inode *inode,
+					    struct bio *bio, int mirror_num,
+					    unsigned long bio_flags)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	blk_status_t ret;
+
+	if (!priv->skip_csum) {
+		ret = btrfs_lookup_bio_sums(inode, bio, io_bio->logical, NULL);
+		if (ret)
+			return ret;
+	}
+
+	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
+	if (ret) {
+		btrfs_io_bio_free_csum(io_bio);
+		return ret;
+	}
+
+	atomic_inc(&priv->pending);
+	ret = btrfs_map_bio(fs_info, bio, mirror_num);
+	if (ret) {
+		atomic_dec(&priv->pending);
+		btrfs_io_bio_free_csum(io_bio);
+	}
+	return ret;
+}
+
+static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio)
+{
+	const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK;
+	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
+	struct inode *inode = priv->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u32 sectorsize = fs_info->sectorsize;
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+	u64 start = io_bio->logical;
+	int icsum = 0;
+
+	if (priv->skip_csum || !uptodate)
+		return io_bio->bio.bi_status;
+
+	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
+		unsigned int i, nr_sectors, pgoff;
+
+		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
+		pgoff = bvec->bv_offset;
+		for (i = 0; i < nr_sectors; i++) {
+			ASSERT(pgoff < PAGE_SIZE);
+			if (check_data_csum(inode, io_bio, icsum, bvec->bv_page,
+					    pgoff, start))
+				return BLK_STS_IOERR;
+			start += sectorsize;
+			icsum++;
+			pgoff += sectorsize;
+		}
+	}
+	return BLK_STS_OK;
+}
+
+static void btrfs_encoded_read_endio(struct bio *bio)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	blk_status_t status;
+
+	status = btrfs_encoded_read_check_bio(io_bio);
+	if (status) {
+		/*
+		 * The memory barrier implied by the atomic_dec_return() here
+		 * pairs with the memory barrier implied by the
+		 * atomic_dec_return() or io_wait_event() in
+		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
+		 * write is observed before the load of status in
+		 * btrfs_encoded_read_regular_fill_pages().
+		 */
+		WRITE_ONCE(priv->status, status);
+	}
+	if (!atomic_dec_return(&priv->pending))
+		wake_up(&priv->wait);
+	btrfs_io_bio_free_csum(io_bio);
+	bio_put(bio);
+}
+
+static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset,
+						 u64 disk_io_size, struct page **pages)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_encoded_read_private priv = {
+		.inode = inode,
+		.pending = ATOMIC_INIT(1),
+		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
+	};
+	unsigned long i = 0;
+	u64 cur = 0;
+	int ret;
+
+	init_waitqueue_head(&priv.wait);
+	/*
+	 * Submit bios for the extent, splitting due to bio or stripe limits as
+	 * necessary.
+	 */
+	while (cur < disk_io_size) {
+		struct btrfs_io_geometry geom;
+		struct bio *bio = NULL;
+		u64 remaining;
+
+		ret = btrfs_get_io_geometry(fs_info, BTRFS_MAP_READ,
+					    offset + cur, disk_io_size - cur,
+					    &geom);
+		if (ret) {
+			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
+			break;
+		}
+		remaining = min(geom.len, disk_io_size - cur);
+		while (bio || remaining) {
+			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
+
+			if (!bio) {
+				bio = btrfs_bio_alloc(offset + cur);
+				bio->bi_end_io = btrfs_encoded_read_endio;
+				bio->bi_private = &priv;
+				bio->bi_opf = REQ_OP_READ;
+			}
+
+			if (!bytes ||
+			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
+				blk_status_t status;
+
+				status = submit_encoded_read_bio(inode, bio, 0,
+								 0);
+				if (status) {
+					WRITE_ONCE(priv.status, status);
+					bio_put(bio);
+					goto out;
+				}
+				bio = NULL;
+				continue;
+			}
+
+			i++;
+			cur += bytes;
+			remaining -= bytes;
+		}
+	}
+
+out:
+	if (atomic_dec_return(&priv.pending))
+		io_wait_event(priv.wait, !atomic_read(&priv.pending));
+	/* See btrfs_encoded_read_endio() for ordering. */
+	return blk_status_to_errno(READ_ONCE(priv.status));
+}
+
+static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
+					  struct iov_iter *iter,
+					  u64 start, u64 lockend,
+					  struct extent_state **cached_state,
+					  u64 offset, u64 disk_io_size,
+					  size_t count,
+					  const struct encoded_iov *encoded,
+					  bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct page **pages;
+	unsigned long nr_pages, i;
+	u64 cur;
+	size_t page_offset;
+	ssize_t ret;
+
+	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size,
+						    pages);
+	if (ret)
+		goto out;
+
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock_shared(inode);
+	*unlocked = true;
+
+	ret = copy_encoded_iov_to_iter(encoded, iter);
+	if (ret)
+		goto out;
+	if (encoded->compression) {
+		i = 0;
+		page_offset = 0;
+	} else {
+		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
+		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
+	}
+	cur = 0;
+	while (cur < count) {
+		size_t bytes = min_t(size_t, count - cur,
+				     PAGE_SIZE - page_offset);
+
+		if (copy_page_to_iter(pages[i], page_offset, bytes,
+				      iter) != bytes) {
+			ret = -EFAULT;
+			goto out;
+		}
+		i++;
+		cur += bytes;
+		page_offset = 0;
+	}
+	ret = count;
+out:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kfree(pages);
+	return ret;
+}
+
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	ssize_t ret;
+	size_t count;
+	u64 start, lockend, offset, disk_io_size;
+	struct extent_state *cached_state = NULL;
+	struct extent_map *em;
+	struct encoded_iov encoded = {};
+	bool unlocked = false;
+
+	ret = generic_encoded_read_checks(iocb, iter);
+	if (ret < 0)
+		return ret;
+	if (ret == 0)
+		return copy_encoded_iov_to_iter(&encoded, iter);
+	count = ret;
+
+	file_accessed(iocb->ki_filp);
+
+	inode_lock_shared(inode);
+
+	if (iocb->ki_pos >= inode->i_size) {
+		inode_unlock_shared(inode);
+		return copy_encoded_iov_to_iter(&encoded, iter);
+	}
+	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
+	/*
+	 * We don't know how long the extent containing iocb->ki_pos is, but if
+	 * it's compressed we know that it won't be longer than this.
+	 */
+	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start,
+					       lockend - start + 1);
+		if (ret)
+			goto out_unlock_inode;
+		lock_extent_bits(io_tree, start, lockend, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     lockend - start + 1);
+		if (!ordered)
+			break;
+		btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		cond_resched();
+	}
+
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
+			      lockend - start + 1);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_unlock_extent;
+	}
+
+	if (em->block_start == EXTENT_MAP_INLINE) {
+		u64 extent_start = em->start;
+
+		/*
+		 * For inline extents we get everything we need out of the
+		 * extent item.
+		 */
+		free_extent_map(em);
+		em = NULL;
+		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
+						&cached_state, extent_start,
+						count, &encoded, &unlocked);
+		goto out;
+	}
+
+	/*
+	 * We only want to return up to EOF even if the extent extends beyond
+	 * that.
+	 */
+	encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) -
+		       iocb->ki_pos);
+	if (em->block_start == EXTENT_MAP_HOLE ||
+	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
+		offset = EXTENT_MAP_HOLE;
+		encoded.len = encoded.unencoded_len = count =
+			min_t(u64, count, encoded.len);
+	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+		offset = em->block_start;
+		/*
+		 * Bail if the buffer isn't large enough to return the whole
+		 * compressed extent.
+		 */
+		if (em->block_len > count) {
+			ret = -ENOBUFS;
+			goto out_em;
+		}
+		disk_io_size = count = em->block_len;
+		encoded.unencoded_len = em->ram_bytes;
+		encoded.unencoded_offset = iocb->ki_pos - em->orig_start;
+		ret = encoded_iov_compression_from_btrfs(em->compress_type);
+		if (ret < 0)
+			goto out_em;
+		encoded.compression = ret;
+	} else {
+		offset = em->block_start + (start - em->start);
+		if (encoded.len > count)
+			encoded.len = count;
+		/*
+		 * Don't read beyond what we locked. This also limits the page
+		 * allocations that we'll do.
+		 */
+		disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start;
+		encoded.len = encoded.unencoded_len = count =
+			start + disk_io_size - iocb->ki_pos;
+		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
+	}
+	free_extent_map(em);
+	em = NULL;
+
+	if (offset == EXTENT_MAP_HOLE) {
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		inode_unlock_shared(inode);
+		unlocked = true;
+		ret = copy_encoded_iov_to_iter(&encoded, iter);
+		if (ret)
+			goto out;
+		ret = iov_iter_zero(count, iter);
+		if (ret != count)
+			ret = -EFAULT;
+	} else {
+		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
+						 &cached_state, offset,
+						 disk_io_size, count, &encoded,
+						 &unlocked);
+	}
+
+out:
+	if (ret >= 0)
+		iocb->ki_pos += encoded.len;
+out_em:
+	free_extent_map(em);
+out_unlock_extent:
+	if (!unlocked)
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+out_unlock_inode:
+	if (!unlocked)
+		inode_unlock_shared(inode);
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes
  2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (10 preceding siblings ...)
  2020-11-18 19:18 ` [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads Omar Sandoval
@ 2020-11-18 19:18 ` Omar Sandoval
  2020-12-02 22:03   ` Josef Bacik
  2020-12-03 14:37   ` Josef Bacik
  11 siblings, 2 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-18 19:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

From: Omar Sandoval <osandov@fb.com>

The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.

Now that read and write are implemented, this also sets the
FMODE_ENCODED_IO flag in btrfs_file_open().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c  |   7 +-
 fs/btrfs/compression.h  |   6 +-
 fs/btrfs/ctree.h        |   2 +
 fs/btrfs/file.c         |  37 +++++-
 fs/btrfs/inode.c        | 259 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ordered-data.c |  12 +-
 fs/btrfs/ordered-data.h |   2 +
 7 files changed, 313 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index eaa6fe21c08e..015c9e5d75b9 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -336,7 +336,8 @@ static void end_compressed_bio_write(struct bio *bio)
 			bio->bi_status == BLK_STS_OK);
 	cb->compressed_pages[0]->mapping = NULL;
 
-	end_compressed_writeback(inode, cb);
+	if (cb->writeback)
+		end_compressed_writeback(inode, cb);
 	/* note, our inode could be gone now */
 
 	/*
@@ -372,7 +373,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 				 struct page **compressed_pages,
 				 unsigned long nr_pages,
 				 unsigned int write_flags,
-				 struct cgroup_subsys_state *blkcg_css)
+				 struct cgroup_subsys_state *blkcg_css,
+				 bool writeback)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct bio *bio = NULL;
@@ -396,6 +398,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	cb->mirror_num = 0;
 	cb->compressed_pages = compressed_pages;
 	cb->compressed_len = compressed_len;
+	cb->writeback = writeback;
 	cb->orig_bio = NULL;
 	cb->nr_pages = nr_pages;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 8001b700ea3a..f95cdc16f503 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -49,6 +49,9 @@ struct compressed_bio {
 	/* the compression algorithm for this bio */
 	int compress_type;
 
+	/* Whether this is a write for writeback. */
+	bool writeback;
+
 	/* number of compressed pages in the array */
 	unsigned long nr_pages;
 
@@ -96,7 +99,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 				  struct page **compressed_pages,
 				  unsigned long nr_pages,
 				  unsigned int write_flags,
-				  struct cgroup_subsys_state *blkcg_css);
+				  struct cgroup_subsys_state *blkcg_css,
+				  bool writeback);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ce78424f1d98..9b585ac9c7a9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3134,6 +3134,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			       struct encoded_iov *encoded);
 
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 193477565200..f815ffb93d43 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1994,6 +1994,32 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	return written ? written : err;
 }
 
+static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	struct encoded_iov encoded;
+	ssize_t ret;
+
+	ret = copy_encoded_iov_from_iter(&encoded, from);
+	if (ret)
+		return ret;
+
+	btrfs_inode_lock(inode, 0);
+	ret = generic_encoded_write_checks(iocb, &encoded);
+	if (ret || encoded.len == 0)
+		goto out;
+
+	ret = btrfs_write_check(iocb, from, encoded.len);
+	if (ret < 0)
+		goto out;
+
+	ret = btrfs_do_encoded_write(iocb, from, &encoded);
+out:
+	btrfs_inode_unlock(inode, 0);
+	return ret;
+}
+
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 				    struct iov_iter *from)
 {
@@ -2012,14 +2038,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state))
 		return -EROFS;
 
-	if (!(iocb->ki_flags & IOCB_DIRECT) &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+	    (!(iocb->ki_flags & IOCB_DIRECT) ||
+	     (iocb->ki_flags & IOCB_ENCODED)))
 		return -EOPNOTSUPP;
 
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT)
+	if (iocb->ki_flags & IOCB_ENCODED)
+		num_written = btrfs_encoded_write(iocb, from);
+	else if (iocb->ki_flags & IOCB_DIRECT)
 		num_written = btrfs_direct_write(iocb, from);
 	else
 		num_written = btrfs_buffered_write(iocb, from);
@@ -3586,7 +3615,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_ENCODED_IO;
 	return generic_file_open(inode, filp);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b0e800897b3b..2bf7b487939f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -935,7 +935,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				    ins.offset, async_extent->pages,
 				    async_extent->nr_pages,
 				    async_chunk->write_flags,
-				    async_chunk->blkcg_css)) {
+				    async_chunk->blkcg_css, true)) {
 			struct page *p = async_extent->pages[0];
 			const u64 start = async_extent->start;
 			const u64 end = start + async_extent->ram_size - 1;
@@ -2703,6 +2703,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 	 * except if the ordered extent was truncated.
 	 */
 	update_inode_bytes = test_bit(BTRFS_ORDERED_DIRECT, &oe->flags) ||
+	                     test_bit(BTRFS_ORDERED_ENCODED, &oe->flags) ||
 			     test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags);
 
 	return insert_reserved_file_extent(trans, BTRFS_I(oe->inode),
@@ -2737,7 +2738,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
 	    !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) &&
-	    !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags))
+	    !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags) &&
+	    !test_bit(BTRFS_ORDERED_ENCODED, &ordered_extent->flags))
 		clear_bits |= EXTENT_DELALLOC_NEW;
 
 	freespace_inode = btrfs_is_free_space_inode(inode);
@@ -10432,6 +10434,259 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
+ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			       struct encoded_iov *encoded)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_changeset *data_reserved = NULL;
+	struct extent_state *cached_state = NULL;
+	int compression;
+	size_t orig_count;
+	u64 start, end;
+	u64 num_bytes, ram_bytes, disk_num_bytes;
+	unsigned long nr_pages, i;
+	struct page **pages;
+	struct btrfs_key ins;
+	bool extent_reserved = false;
+	struct extent_map *em;
+	ssize_t ret;
+
+	switch (encoded->compression) {
+	case ENCODED_IOV_COMPRESSION_BTRFS_ZLIB:
+		compression = BTRFS_COMPRESS_ZLIB;
+		break;
+	case ENCODED_IOV_COMPRESSION_BTRFS_ZSTD:
+		compression = BTRFS_COMPRESS_ZSTD;
+		break;
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K:
+		/* The page size must match for LZO. */
+		if (encoded->compression -
+		    ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + 12 != PAGE_SHIFT)
+			return -EINVAL;
+		compression = BTRFS_COMPRESS_LZO;
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (encoded->encryption != ENCODED_IOV_ENCRYPTION_NONE)
+		return -EINVAL;
+
+	orig_count = iov_iter_count(from);
+
+	/* The extent size must be sane. */
+	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
+	    orig_count > BTRFS_MAX_COMPRESSED || orig_count == 0)
+		return -EINVAL;
+
+	/*
+	 * The compressed data must be smaller than the decompressed data.
+	 *
+	 * It's of course possible for data to compress to larger or the same
+	 * size, but the buffered I/O path falls back to no compression for such
+	 * data, and we don't want to break any assumptions by creating these
+	 * extents.
+	 *
+	 * Note that this is less strict than the current check we have that the
+	 * compressed data must be at least one sector smaller than the
+	 * decompressed data. We only want to enforce the weaker requirement
+	 * from old kernels that it is at least one byte smaller.
+	 */
+	if (orig_count >= encoded->unencoded_len)
+		return -EINVAL;
+
+	/* The extent must start on a sector boundary. */
+	start = iocb->ki_pos;
+	if (!IS_ALIGNED(start, fs_info->sectorsize))
+		return -EINVAL;
+
+	/*
+	 * The extent must end on a sector boundary. However, we allow a write
+	 * which ends at or extends i_size to have an unaligned length; we round
+	 * up the extent size and set i_size to the unaligned end.
+	 */
+	if (start + encoded->len < inode->i_size &&
+	    !IS_ALIGNED(start + encoded->len, fs_info->sectorsize))
+		return -EINVAL;
+
+	/* Finally, the offset in the unencoded data must be sector-aligned. */
+	if (!IS_ALIGNED(encoded->unencoded_offset, fs_info->sectorsize))
+		return -EINVAL;
+
+	num_bytes = ALIGN(encoded->len, fs_info->sectorsize);
+	ram_bytes = ALIGN(encoded->unencoded_len, fs_info->sectorsize);
+	end = start + num_bytes - 1;
+
+	/*
+	 * If the extent cannot be inline, the compressed data on disk must be
+	 * sector-aligned. For convenience, we extend it with zeroes if it
+	 * isn't.
+	 */
+	disk_num_bytes = ALIGN(orig_count, fs_info->sectorsize);
+	nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
+	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
+		char *kaddr;
+
+		pages[i] = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out_pages;
+		}
+		kaddr = kmap(pages[i]);
+		if (copy_from_iter(kaddr, bytes, from) != bytes) {
+			kunmap(pages[i]);
+			ret = -EFAULT;
+			goto out_pages;
+		}
+		if (bytes < PAGE_SIZE)
+			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
+		kunmap(pages[i]);
+	}
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start, num_bytes);
+		if (ret)
+			goto out_pages;
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+						    start >> PAGE_SHIFT,
+						    end >> PAGE_SHIFT);
+		if (ret)
+			goto out_pages;
+		lock_extent_bits(io_tree, start, end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     num_bytes);
+		if (!ordered &&
+		    !filemap_range_has_page(inode->i_mapping, start, end))
+			break;
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, end, &cached_state);
+		cond_resched();
+	}
+
+	/*
+	 * We don't use the higher-level delalloc space functions because our
+	 * num_bytes and disk_num_bytes are different.
+	 */
+	ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), disk_num_bytes);
+	if (ret)
+		goto out_unlock;
+	ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, start,
+					num_bytes);
+	if (ret)
+		goto out_free_data_space;
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), num_bytes,
+					      disk_num_bytes);
+	if (ret)
+		goto out_qgroup_free_data;
+
+	/* Try an inline extent first. */
+	if (start == 0 && encoded->unencoded_len == encoded->len &&
+	    encoded->unencoded_offset == 0) {
+		ret = cow_file_range_inline(BTRFS_I(inode), encoded->len,
+					    orig_count, compression, pages,
+					    true);
+		if (ret <= 0) {
+			if (ret == 0)
+				ret = orig_count;
+			goto out_delalloc_release;
+		}
+	}
+
+	ret = btrfs_reserve_extent(root, disk_num_bytes, disk_num_bytes,
+				   disk_num_bytes, 0, 0, &ins, 1, 1);
+	if (ret)
+		goto out_delalloc_release;
+	extent_reserved = true;
+
+	em = create_io_em(BTRFS_I(inode), start, num_bytes,
+			  start - encoded->unencoded_offset, ins.objectid,
+			  ins.offset, ins.offset, ram_bytes, compression,
+			  BTRFS_ORDERED_COMPRESSED);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_free_reserved;
+	}
+	free_extent_map(em);
+
+	ret = btrfs_add_ordered_extent(BTRFS_I(inode), start, num_bytes,
+				       ram_bytes, ins.objectid, ins.offset,
+				       encoded->unencoded_offset,
+				       (1 << BTRFS_ORDERED_ENCODED) |
+				       (1 << BTRFS_ORDERED_COMPRESSED),
+				       compression);
+	if (ret) {
+		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
+		goto out_free_reserved;
+	}
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+
+	if (start + encoded->len > inode->i_size)
+		i_size_write(inode, start + encoded->len);
+
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes);
+
+	if (btrfs_submit_compressed_write(BTRFS_I(inode), start, num_bytes,
+					  ins.objectid, ins.offset, pages,
+					  nr_pages, 0, NULL, false)) {
+		struct page *page = pages[0];
+
+		page->mapping = inode->i_mapping;
+		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
+		page->mapping = NULL;
+		ret = -EIO;
+		goto out_pages;
+	}
+	ret = orig_count;
+	goto out;
+
+out_free_reserved:
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+out_delalloc_release:
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes);
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), disk_num_bytes,
+					ret < 0);
+out_qgroup_free_data:
+	if (ret < 0) {
+		btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, start,
+				       num_bytes);
+	}
+out_free_data_space:
+	/*
+	 * If btrfs_reserve_extent() succeeded, then we already decremented
+	 * bytes_may_use.
+	 */
+	if (!extent_reserved)
+		btrfs_free_reserved_data_space_noquota(fs_info, disk_num_bytes);
+out_unlock:
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+out_pages:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kvfree(pages);
+out:
+	if (ret >= 0)
+		iocb->ki_pos += encoded->len;
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 76dc47315016..34f5fb548fb5 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -460,9 +460,15 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 	spin_lock(&btrfs_inode->lock);
 	btrfs_mod_outstanding_extents(btrfs_inode, -1);
 	spin_unlock(&btrfs_inode->lock);
-	if (root != fs_info->tree_root)
-		btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes,
-						false);
+	if (root != fs_info->tree_root) {
+		u64 release;
+
+		if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags))
+			release = entry->disk_num_bytes;
+		else
+			release = entry->num_bytes;
+		btrfs_delalloc_release_metadata(btrfs_inode, release, false);
+	}
 
 	if (test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
 		percpu_counter_add_batch(&fs_info->dio_bytes, -entry->num_bytes,
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index f2798c1fc7b0..19b7ea0354c0 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -62,6 +62,8 @@ enum {
 	BTRFS_ORDERED_LOGGED_CSUM,
 	/* We wait for this extent to complete in the current transaction */
 	BTRFS_ORDERED_PENDING,
+	/* RWF_ENCODED I/O */
+	BTRFS_ORDERED_ENCODED,
 };
 
 struct btrfs_ordered_extent {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag
  2020-11-18 19:18 ` [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
@ 2020-11-19  7:02   ` Amir Goldstein
  2020-11-20 23:41     ` Jann Horn
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2020-11-19  7:02 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, Linux Btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Aleksa Sarai, Linux API, kernel-team

On Wed, Nov 18, 2020 at 9:18 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> The upcoming RWF_ENCODED operation introduces some security concerns:
>
> 1. Compressed writes will pass arbitrary data to decompression
>    algorithms in the kernel.
> 2. Compressed reads can leak truncated/hole punched data.
>
> Therefore, we need to require privilege for RWF_ENCODED. It's not
> possible to do the permissions checks at the time of the read or write
> because, e.g., io_uring submits IO from a worker thread. So, add an open
> flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> fcntl(). The flag is not cleared in any way on fork or exec. It must be
> combined with O_CLOEXEC when opening to avoid accidental leaks (if
> needed, it may be set without O_CLOEXEC by using fnctl()).
>
> Note that the usual issue that unknown open flags are ignored doesn't
> really matter for O_ALLOW_ENCODED; if the kernel doesn't support
> O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  arch/alpha/include/uapi/asm/fcntl.h  |  1 +
>  arch/parisc/include/uapi/asm/fcntl.h |  1 +
>  arch/sparc/include/uapi/asm/fcntl.h  |  1 +
>  fs/fcntl.c                           | 10 ++++++++--
>  fs/namei.c                           |  4 ++++
>  fs/open.c                            |  7 +++++++
>  include/linux/fcntl.h                |  2 +-
>  include/uapi/asm-generic/fcntl.h     |  4 ++++
>  8 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
> index 50bdc8e8a271..391e0d112e41 100644
> --- a/arch/alpha/include/uapi/asm/fcntl.h
> +++ b/arch/alpha/include/uapi/asm/fcntl.h
> @@ -34,6 +34,7 @@
>
>  #define O_PATH         040000000
>  #define __O_TMPFILE    0100000000
> +#define O_ALLOW_ENCODED        0200000000
>
>  #define F_GETLK                7
>  #define F_SETLK                8
> diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
> index 03dee816cb13..72ea9bdf5f04 100644
> --- a/arch/parisc/include/uapi/asm/fcntl.h
> +++ b/arch/parisc/include/uapi/asm/fcntl.h
> @@ -19,6 +19,7 @@
>
>  #define O_PATH         020000000
>  #define __O_TMPFILE    040000000
> +#define O_ALLOW_ENCODED        100000000
>
>  #define F_GETLK64      8
>  #define F_SETLK64      9
> diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
> index 67dae75e5274..ac3e8c9cb32c 100644
> --- a/arch/sparc/include/uapi/asm/fcntl.h
> +++ b/arch/sparc/include/uapi/asm/fcntl.h
> @@ -37,6 +37,7 @@
>
>  #define O_PATH         0x1000000
>  #define __O_TMPFILE    0x2000000
> +#define O_ALLOW_ENCODED        0x8000000
>
>  #define F_GETOWN       5       /*  for sockets. */
>  #define F_SETOWN       6       /*  for sockets. */
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 19ac5baad50f..9302f68fe698 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -30,7 +30,8 @@
>  #include <asm/siginfo.h>
>  #include <linux/uaccess.h>
>
> -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> +                   O_ALLOW_ENCODED)
>
>  static int setfl(int fd, struct file * filp, unsigned long arg)
>  {
> @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
>                 if (!inode_owner_or_capable(inode))
>                         return -EPERM;
>
> +       /* O_ALLOW_ENCODED can only be set by superuser */
> +       if ((arg & O_ALLOW_ENCODED) && !(filp->f_flags & O_ALLOW_ENCODED) &&
> +           !capable(CAP_SYS_ADMIN))
> +               return -EPERM;
> +
>         /* required for strict SunOS emulation */
>         if (O_NONBLOCK != O_NDELAY)
>                if (arg & O_NDELAY)
> @@ -1033,7 +1039,7 @@ static int __init fcntl_init(void)
>          * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>          * is defined as O_NONBLOCK on some platforms and not on others.
>          */
> -       BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> +       BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
>                 HWEIGHT32(
>                         (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
>                         __FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/namei.c b/fs/namei.c
> index d4a6dd772303..fbf64ce61088 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2890,6 +2890,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
>         if (flag & O_NOATIME && !inode_owner_or_capable(inode))
>                 return -EPERM;
>
> +       /* O_ALLOW_ENCODED can only be set by superuser */
> +       if ((flag & O_ALLOW_ENCODED) && !capable(CAP_SYS_ADMIN))
> +               return -EPERM;
> +
>         return 0;
>  }
>
> diff --git a/fs/open.c b/fs/open.c
> index 9af548fb841b..f2863aaf78e7 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1040,6 +1040,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>                 acc_mode = 0;
>         }
>
> +       /*
> +        * O_ALLOW_ENCODED must be combined with O_CLOEXEC to avoid accidentally
> +        * leaking encoded I/O privileges.
> +        */
> +       if ((how->flags & (O_ALLOW_ENCODED | O_CLOEXEC)) == O_ALLOW_ENCODED)
> +               return -EINVAL;
> +


dup() can also result in accidental leak.
We could fail dup() of fd without O_CLOEXEC. Should we?

If we should than what error code should it be? We could return EPERM,
but since we do allow to clear O_CLOEXEC or set O_ALLOW_ENCODED
after open, EPERM seems a tad harsh.
EINVAL seems inappropriate because the error has nothing to do with
input args of dup() and EBADF would also be confusing.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 03/11] fs: add RWF_ENCODED for reading/writing compressed data
  2020-11-18 19:18 ` [PATCH v6 03/11] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
@ 2020-11-19  7:38   ` Amir Goldstein
  2021-01-11 23:06     ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2020-11-19  7:38 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, Linux Btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Aleksa Sarai, Linux API, kernel-team

On Wed, Nov 18, 2020 at 9:18 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
>
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data
> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future a la copy_struct_from_user().
> The remaining iovecs contain the encoded extent.
>
> This adds the VFS helpers for supporting encoded I/O and documentation
> for filesystem support.
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  Documentation/filesystems/encoded_io.rst |  74 ++++++++++
>  Documentation/filesystems/index.rst      |   1 +
>  fs/read_write.c                          | 167 +++++++++++++++++++++--
>  include/linux/fs.h                       |  11 ++
>  include/uapi/linux/fs.h                  |  41 +++++-
>  5 files changed, 280 insertions(+), 14 deletions(-)
>  create mode 100644 Documentation/filesystems/encoded_io.rst
>
> diff --git a/Documentation/filesystems/encoded_io.rst b/Documentation/filesystems/encoded_io.rst
> new file mode 100644
> index 000000000000..50405276d866
> --- /dev/null
> +++ b/Documentation/filesystems/encoded_io.rst
> @@ -0,0 +1,74 @@
> +===========
> +Encoded I/O
> +===========
> +
> +Encoded I/O is a mechanism for reading and writing encoded (e.g., compressed
> +and/or encrypted) data directly from/to the filesystem. The userspace interface
> +is thoroughly described in the :manpage:`encoded_io(7)` man page; this document
> +describes the requirements for filesystem support.
> +
> +First of all, a filesystem supporting encoded I/O must indicate this by setting
> +the ``FMODE_ENCODED_IO`` flag in its ``file_open`` file operation::
> +

Should this be FMODE_ALLOW_ENCODED_IO?
How come I see no checks for this flag in vfs code?
You seem to only be checking the O_ flag.
Do we really want to allow setting the O_ flag after open or should we
deny that?

> +    static int foo_file_open(struct inode *inode, struct file *filp)
> +    {
> +            ...
> +            filep->f_mode |= FMODE_ENCODED_IO;
> +            ...
> +    }
> +
> +Encoded I/O goes through ``read_iter`` and ``write_iter``, designated by the
> +``IOCB_ENCODED`` flag in ``kiocb->ki_flags``.
> +
> +Reads
> +=====
> +
> +Encoded ``read_iter`` should:
> +
> +1. Call ``generic_encoded_read_checks()`` to validate the file and buffers
> +   provided by userspace.
> +2. Initialize the ``encoded_iov`` appropriately.
> +3. Copy it to the user with ``copy_encoded_iov_to_iter()``.
> +4. Copy the encoded data to the user.
> +5. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
> +6. Return the size of the encoded data read, not including the ``encoded_iov``.
> +
> +There are a few details to be aware of:
> +
> +* Encoded ``read_iter`` should support reading unencoded data if the extent is
> +  not encoded.
> +* If the buffers provided by the user are not large enough to contain an entire
> +  encoded extent, then ``read_iter`` should return ``-ENOBUFS``. This is to
> +  avoid confusing userspace with truncated data that cannot be properly
> +  decoded.
> +* Reads in the middle of an encoded extent can be returned by setting
> +  ``encoded_iov->unencoded_offset`` to non-zero.
> +* Truncated unencoded data (e.g., because the file does not end on a block
> +  boundary) may be returned by setting ``encoded_iov->len`` to a value smaller
> +  value than ``encoded_iov->unencoded_len - encoded_iov->unencoded_offset``.
> +
> +Writes
> +======
> +
> +Encoded ``write_iter`` should (in addition to the usual accounting/checks done
> +by ``write_iter``):
> +
> +1. Call ``copy_encoded_iov_from_iter()`` to get and validate the
> +   ``encoded_iov``.
> +2. Call ``generic_encoded_write_checks()`` instead of
> +   ``generic_write_checks()``.
> +3. Check that the provided encoding in ``encoded_iov`` is supported.
> +4. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
> +5. Return the size of the encoded data written.
> +
> +Again, there are a few details:
> +
> +* Encoded ``write_iter`` doesn't need to support writing unencoded data.
> +* ``write_iter`` should either write all of the encoded data or none of it; it
> +  must not do partial writes.
> +* ``write_iter`` doesn't need to validate the encoded data; a subsequent read
> +  may return, e.g., ``-EIO`` if the data is not valid.
> +* The user may lie about the unencoded size of the data; a subsequent read
> +  should truncate or zero-extend the unencoded data rather than returning an
> +  error.
> +* Be careful of page cache coherency.
> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> index 98f59a864242..6d9e3ff0a455 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -53,6 +53,7 @@ filesystem implementations.
>     journalling
>     fscrypt
>     fsverity
> +   encoded_io
>
>  Filesystems
>  ===========
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 75f764b43418..e2ad418d2987 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1625,24 +1625,15 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
>         return 0;
>  }
>
> -/*
> - * Performs necessary checks before doing a write
> - *
> - * Can adjust writing position or amount of bytes to write.
> - * Returns appropriate error code that caller should return or
> - * zero in case that write should be allowed.
> - */
> -ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
>  {
>         struct file *file = iocb->ki_filp;
>         struct inode *inode = file->f_mapping->host;
> -       loff_t count;
> -       int ret;
>
>         if (IS_SWAPFILE(inode))
>                 return -ETXTBSY;
>
> -       if (!iov_iter_count(from))
> +       if (!*count)
>                 return 0;
>
>         /* FIXME: this is for backwards compatibility with 2.4 */
> @@ -1652,8 +1643,22 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>         if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
>                 return -EINVAL;
>
> -       count = iov_iter_count(from);
> -       ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> +       return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> +}
> +
> +/*
> + * Performs necessary checks before doing a write
> + *
> + * Can adjust writing position or amount of bytes to write.
> + * Returns appropriate error code that caller should return or
> + * zero in case that write should be allowed.
> + */
> +ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +{
> +       loff_t count = iov_iter_count(from);
> +       int ret;
> +
> +       ret = generic_write_checks_common(iocb, &count);
>         if (ret)
>                 return ret;
>
> @@ -1684,3 +1689,139 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>
>         return 0;
>  }
> +
> +/**
> + * generic_encoded_write_checks() - check an encoded write
> + * @iocb: I/O context.
> + * @encoded: Encoding metadata.
> + *
> + * This should be called by RWF_ENCODED write implementations rather than
> + * generic_write_checks(). Unlike generic_write_checks(), it returns -EFBIG
> + * instead of adjusting the size of the write.
> + *
> + * Return: 0 on success, -errno on error.
> + */
> +int generic_encoded_write_checks(struct kiocb *iocb,
> +                                const struct encoded_iov *encoded)
> +{
> +       loff_t count = encoded->len;
> +       int ret;
> +
> +       if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
> +               return -EPERM;
> +
> +       ret = generic_write_checks_common(iocb, &count);
> +       if (ret)
> +               return ret;
> +
> +       if (count != encoded->len) {
> +               /*
> +                * The write got truncated by generic_write_checks_common(). We
> +                * can't do a partial encoded write.
> +                */
> +               return -EFBIG;
> +       }
> +       return 0;
> +}
> +EXPORT_SYMBOL(generic_encoded_write_checks);
> +
> +/**
> + * copy_encoded_iov_from_iter() - copy a &struct encoded_iov from userspace
> + * @encoded: Returned encoding metadata.
> + * @from: Source iterator.
> + *
> + * This copies in the &struct encoded_iov and does some basic sanity checks.
> + * This should always be used rather than a plain copy_from_iter(), as it does
> + * the proper handling for backward- and forward-compatibility.
> + *
> + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if the
> + *         copied structure contained non-zero fields that this kernel doesn't
> + *         support, -EINVAL if the copied structure was invalid.
> + */
> +int copy_encoded_iov_from_iter(struct encoded_iov *encoded,
> +                              struct iov_iter *from)
> +{
> +       size_t usize;
> +       int ret;
> +
> +       usize = iov_iter_single_seg_count(from);
> +       if (usize > PAGE_SIZE)
> +               return -E2BIG;
> +       if (usize < ENCODED_IOV_SIZE_VER0)
> +               return -EINVAL;
> +       ret = copy_struct_from_iter(encoded, sizeof(*encoded), from, usize);
> +       if (ret)
> +               return ret;
> +
> +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> +               return -EINVAL;
> +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> +               return -EINVAL;
> +       if (encoded->unencoded_offset > encoded->unencoded_len)
> +               return -EINVAL;
> +       if (encoded->len > encoded->unencoded_len - encoded->unencoded_offset)
> +               return -EINVAL;
> +       return 0;
> +}
> +EXPORT_SYMBOL(copy_encoded_iov_from_iter);
> +
> +/**
> + * generic_encoded_read_checks() - sanity check an RWF_ENCODED read
> + * @iocb: I/O context.
> + * @iter: Destination iterator for read.
> + *
> + * This should always be called by RWF_ENCODED read implementations before
> + * returning any data.
> + *
> + * Return: Number of bytes available to return encoded data in @iter on success,
> + *         -EPERM if the file was not opened with O_ALLOW_ENCODED, -EINVAL if
> + *         the size of the &struct encoded_iov iovec is invalid.
> + */
> +ssize_t generic_encoded_read_checks(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +       size_t usize;
> +
> +       if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
> +               return -EPERM;
> +       usize = iov_iter_single_seg_count(iter);
> +       if (usize > PAGE_SIZE || usize < ENCODED_IOV_SIZE_VER0)
> +               return -EINVAL;
> +       return iov_iter_count(iter) - usize;
> +}
> +EXPORT_SYMBOL(generic_encoded_read_checks);
> +
> +/**
> + * copy_encoded_iov_to_iter() - copy a &struct encoded_iov to userspace
> + * @encoded: Encoding metadata to return.
> + * @to: Destination iterator.
> + *
> + * This should always be used by RWF_ENCODED read implementations rather than a
> + * plain copy_to_iter(), as it does the proper handling for backward- and
> + * forward-compatibility. The iterator must be sanity-checked with
> + * generic_encoded_read_checks() before this is called.
> + *
> + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if there
> + *         were non-zero fields in @encoded that the user buffer could not
> + *         accommodate.
> + */
> +int copy_encoded_iov_to_iter(const struct encoded_iov *encoded,
> +                            struct iov_iter *to)
> +{
> +       size_t ksize = sizeof(*encoded);
> +       size_t usize = iov_iter_single_seg_count(to);
> +       size_t size = min(ksize, usize);
> +
> +       /* We already sanity-checked usize in generic_encoded_read_checks(). */
> +
> +       if (usize < ksize &&
> +           memchr_inv((char *)encoded + usize, 0, ksize - usize))
> +               return -E2BIG;
> +       if (copy_to_iter(encoded, size, to) != size ||
> +           (usize > ksize &&
> +            iov_iter_zero(usize - ksize, to) != usize - ksize))
> +               return -EFAULT;
> +       return 0;
> +}
> +EXPORT_SYMBOL(copy_encoded_iov_to_iter);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8667d0cdc71e..67810bf6fb1c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -178,6 +178,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File supports async buffered reads */
>  #define FMODE_BUF_RASYNC       ((__force fmode_t)0x40000000)
>
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO       ((__force fmode_t)0x80000000)
> +
>  /*
>   * Attribute flags.  These should be or-ed together to figure out what
>   * has been changed!
> @@ -308,6 +311,7 @@ enum rw_hint {
>  #define IOCB_SYNC              (__force int) RWF_SYNC
>  #define IOCB_NOWAIT            (__force int) RWF_NOWAIT
>  #define IOCB_APPEND            (__force int) RWF_APPEND
> +#define IOCB_ENCODED           (__force int) RWF_ENCODED
>
>  /* non-RWF related bits - start at 16 */
>  #define IOCB_EVENTFD           (1 << 16)
> @@ -2964,6 +2968,13 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
>  extern int generic_write_check_limits(struct file *file, loff_t pos,
>                 loff_t *count);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *,
> +                                       const struct encoded_iov *);
> +extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *);
> +extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *);
> +extern int copy_encoded_iov_to_iter(const struct encoded_iov *,
> +                                   struct iov_iter *);
>  extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
>  extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
>                 struct iov_iter *to, ssize_t already_read);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..95493420117a 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -279,6 +279,42 @@ struct fsxattr {
>                                          SYNC_FILE_RANGE_WAIT_BEFORE | \
>                                          SYNC_FILE_RANGE_WAIT_AFTER)
>
> +enum {
> +       ENCODED_IOV_COMPRESSION_NONE,
> +#define ENCODED_IOV_COMPRESSION_NONE ENCODED_IOV_COMPRESSION_NONE
> +       ENCODED_IOV_COMPRESSION_BTRFS_ZLIB,
> +#define ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ENCODED_IOV_COMPRESSION_BTRFS_ZLIB
> +       ENCODED_IOV_COMPRESSION_BTRFS_ZSTD,
> +#define ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ENCODED_IOV_COMPRESSION_BTRFS_ZSTD
> +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K,
> +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K
> +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K,
> +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K
> +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K,
> +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K
> +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K,
> +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K
> +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K,
> +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K
> +       ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K,
> +};
> +

I am not a fan of this trick.
There is no shortage of enums in uapi headers, but I think that if we want
to set values in stone, the values should be set explicitly and not
auto assigned
by compiler.

If anybody ever adds a line, say ENCODED_IOV_COMPRESSION_BTRFS_ZLIB_V2
in the middle of the enum list, it won't be obvious that it's a uapi breakage.

In principle, we could have partitioned the encoding types by domains
(e.g. btrfs),
and the btrfs specific encodings would have been a part of a btrfs
header, but it's
not that important.

However, please move all encoded_io stuff to a new uapi header and do
not include it
from fs.h to avoid having to compile most filesystems every time a new
btrfs private encoding
type is added.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-11-18 19:18 ` [PATCH man-pages v6] Document encoded I/O Omar Sandoval
@ 2020-11-19 23:29   ` Alejandro Colomar (mailing lists; readonly)
  2020-11-20 14:06     ` Alejandro Colomar (man-pages)
  0 siblings, 1 reply; 43+ messages in thread
From: Alejandro Colomar (mailing lists; readonly) @ 2020-11-19 23:29 UTC (permalink / raw)
  To: Omar Sandoval, Michael Kerrisk
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team, linux-man

Hi Omar,

Please, see some fixes below:

Michael, I've also some questions for you below
(you can grep for mtk to find those).

Thanks,

Alex

On 11/18/20 8:18 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This adds a new page, encoded_io(7), providing an overview of encoded
> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> reference it.
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: linux-man <linux-man@vger.kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> This feature is not yet upstream.
> 
>  man2/fcntl.2      |  10 +-
>  man2/open.2       |  23 +++
>  man2/readv.2      |  70 +++++++++
>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 471 insertions(+), 1 deletion(-)
>  create mode 100644 man7/encoded_io.7
> 
> diff --git a/man2/fcntl.2 b/man2/fcntl.2
> index 546016617..b0d7fa2c3 100644
> --- a/man2/fcntl.2
> +++ b/man2/fcntl.2
> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>  .BR O_ASYNC ,
>  .BR O_DIRECT ,
>  .BR O_NOATIME ,
> +.BR O_NONBLOCK ,
>  and
> -.B O_NONBLOCK
> +.B O_ALLOW_ENCODED
>  flags.
>  It is not possible to change the
>  .BR O_DSYNC
> @@ -1820,6 +1821,13 @@ Attempted to clear the
>  flag on a file that has the append-only attribute set.
>  .TP
>  .B EPERM
> +Attempted to set the
> +.B O_ALLOW_ENCODED
> +flag and the calling process did not have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
> +.B EPERM
>  .I cmd
>  was
>  .BR F_ADD_SEALS ,
> diff --git a/man2/open.2 b/man2/open.2
> index f587b0d95..84697dfa8 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -437,6 +437,16 @@ was followed by a call to
>  .BR fdatasync (2)).
>  .IR "See NOTES below" .
>  .TP
> +.B O_ALLOW_ENCODED

The list is alphabetically sorted;
please, follow that
(O_ALLOW_ENCODED should be the first one).

> +Open the file with encoded I/O permissions;
> +see
> +.BR encoded_io (7).
> +.B O_CLOEXEC
> +must be specified in conjuction with this flag.
> +The caller must have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
>  .B O_EXCL
>  Ensure that this call creates the file:
>  if this flag is specified in conjunction with
> @@ -1082,6 +1092,14 @@ is invalid
>  (e.g., it contains characters not permitted by the underlying filesystem).
>  .TP
>  .B EINVAL
> +.B O_ALLOW_ENCODED
> +was specified in
> +.IR flags ,
> +but
> +.B O_CLOEXEC
> +was not specified.
> +.TP
> +.B EINVAL
>  The final component ("basename") of
>  .I pathname
>  is invalid
> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>  The operation was prevented by a file seal; see
>  .BR fcntl (2).
>  .TP
> +.B EPERM
> +The
> +.B O_ALLOW_ENCODED
> +flag was specified, but the caller was not privileged.
> +.TP
>  .B EROFS
>  .I pathname
>  refers to a file on a read-only filesystem and write access was
> diff --git a/man2/readv.2 b/man2/readv.2
> index 5a8b74168..c9933acf0 100644
> --- a/man2/readv.2
> +++ b/man2/readv.2
> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>  However, if the
>  .I offset
>  argument is \-1, the current file offset is updated.
> +.TP
> +.BR RWF_ENCODED " (since Linux 5.12)"
> +Read or write encoded (e.g., compressed) data.
> +See
> +.BR encoded_io (7).
>  .SH RETURN VALUE
>  On success,
>  .BR readv (),
> @@ -283,6 +288,13 @@ than requested (see
>  and
>  .BR write (2)).
>  .PP
> +If
> +.B
> +RWF_ENCODED

RWF_ENCODED should go in the same line as .B:

[
.B RWF_ENCODED
]

> +was specified in
> +.IR flags ,
> +then the return value is the number of encoded bytes.
> +.PP
>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>  .SH ERRORS
>  The errors are as given for
> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>  .TP
>  .B EOPNOTSUPP
>  An unknown flag is specified in \fIflags\fP.
> +.TP
> +.B EOPNOTSUPP
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the filesystem does not implement encoded I/O.
> +.TP
> +.B EPERM
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the file was not opened with the
> +.B O_ALLOW_ENCODED
> +flag.
> +.PP
> +.BR preadv2 ()
> +can fail for the following reasons:

The wording is a bit unclear:

Above your additions (old text, not yours),
it says that some errors apply to preadv2
(as well as to other functions):

[
ERRORS
       The errors are as given for read(2) and write(2).  Furthermore,
       preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
       for the same reasons as lseek(2).  Additionally, the  following
       errors are defined:

       EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
              value.

       EINVAL The vector count, iovcnt, is less than zero  or  greater
              than the permitted maximum.

       EOPNOTSUPP
              An unknown flag is specified in flags.

       EOPNOTSUPP
              RWF_ENCODED  is  specified  in  flags and the filesystem
              does not implement encoded I/O.

       EPERM  RWF_ENCODED is specified in flags and the file  was  not
              opened with the O_ALLOW_ENCODED flag.
]

And then you added a line that says:

[
       preadv2() can fail for the following reasons:
]

Which if read strictly, it says that [only] the following errors apply.

Did you mean that
"preadv3() can _additionally_ fail for the following reasons"?

Could you please be a bit more specific?

The same applies for pwritev2() below.

> +.TP
> +.B E2BIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and
> +.I iov[0]
> +is not large enough to return the encoding metadata.
> +.TP
> +.B ENOBUFS
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the buffers in
> +.I iov
> +are not big enough to return the encoded data.
> +.PP
> +.BR pwritev2 ()
> +can fail for the following reasons:
> +.TP
> +.B E2BIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and
> +.I iov[0]
> +contains non-zero fields
> +after the kernel's
> +.IR "sizeof(struct\ encoded_iov)" .

Don't escape the space, if the string is already in "".

> +.TP
> +.B EINVAL
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the encoding is unknown or not supported by the filesystem.
> +.TP
> +.B EINVAL
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the alignment and/or size requirements are not met.
>  .SH VERSIONS
>  .BR preadv ()
>  and
> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
> new file mode 100644
> index 000000000..106fa587b
> --- /dev/null
> +++ b/man7/encoded_io.7
> @@ -0,0 +1,369 @@
> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\"
> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +encoded_io \- overview of encoded I/O
> +.SH DESCRIPTION
> +Several filesystems (e.g., Btrfs) support transparent encoding
> +(e.g., compression, encryption) of data on disk:
> +written data is encoded by the kernel before it is written to disk,
> +and read data is decoded before being returned to the user.
> +In some cases, it is useful to skip this encoding step.

Here I would use ';' instead of '.'
(and next letter would be lowercase, then).

> +For example, the user may want to read the compressed contents of a file
> +or write pre-compressed data directly to a file.
> +This is referred to as "encoded I/O".
> +.SS Encoded I/O API
> +Encoded I/O is specified with the
> +.B RWF_ENCODED
> +flag to
> +.BR preadv2 (2)
> +and
> +.BR pwritev2 (2).
> +If
> +.B RWF_ENCODED
> +is specified, then
> +.I iov[0].iov_base
> +points to an
> +.I
> +encoded_iov

On the same line, please.

> +structure, defined in
> +.I <linux/fs.h>
> +as:
> +.PP
> +.in +4n
> +.EX
> +struct encoded_iov {
> +    __aligned_u64 len;
> +    __aligned_u64 unencoded_len;
> +    __aligned_u64 unencoded_offset;
> +    __u32 compression;
> +    __u32 encryption;
> +};
> +.EE
> +.in
> +.PP
> +This may be extended in the future, so
> +.I iov[0].iov_len
> +must be set to
> +.I "sizeof(struct\ encoded_iov)"
> +for forward/backward compatibility.
> +The remaining buffers contain the encoded data.
> +.PP
> +.I compression
> +and
> +.I encryption
> +are the encoding fields.
> +.I compression
> +is
> +.B ENCODED_IOV_COMPRESSION_NONE
> +(zero)
> +or a filesystem-specific
> +.B ENCODED_IOV_COMPRESSION

Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/

> +constant;
> +see
> +.BR Filesystem\ support .

Please, write it as [.BR "Filesystem support" .]

and maybe I would change it, to be more specific, to the following:

[
see
.B Filesystem support
below.
]

So that the reader clearly understands it's on the same page.

> +.I encryption
> +is currently always
> +.B ENCODED_IOV_ENCRYPTION_NONE
> +(zero).
> +.PP
> +.I unencoded_len
> +is the length of the unencoded (i.e., decrypted and decompressed) data.
> +.I unencoded_offset
> +is the offset into the unencoded data where the data in the file begins

The above wording is a bit unclear to me.

I suggest the following:

[
.I unencoded_offset
is the offset from the begining of the file
to the first byte of the unencoded data
]

> +(less than or equal to
> +.IR unencoded_len ).
> +.I len
> +is the length of the data in the file
> +(less than or equal to
> +.I unencoded_len
> +-

Here's a question for Michael (mtk):

I've seen (many) cases where these math operations
are written without spaces,
and in the same line (e.g., [.IR a + b]).

I'd like to know your preferences on this,
or what is actually more extended in the manual pages,
to stick with only one of them.

> +.IR unencoded_offset ).
> +See
> +.B Extent layout
> +below for some examples.
> +.I

Were you maybe going to add something there?

If not, please remove that [.I].

> +.PP
> +If the unencoded data is actually longer than
> +.IR unencoded_len ,
> +then it is truncated;
> +if it is shorter, then it is extended with zeroes.
> +.PP
> +

Please, remove that blank line.

> +.BR pwritev2 ()

Should be [.BR pwritev2 (2)]

Michael (mtk),

Am I right in that?  Please, confirm.

> +uses the metadata specified in
> +.IR iov[0] ,
> +writes the encoded data from the remaining buffers,
> +and returns the number of encoded bytes written
> +(that is, the sum of
> +.I iov[n].iov_len
> +for 1 <=
> +.I n
> +<
> +.IR iovcnt ;
> +partial writes will not occur).
> +At least one encoding field must be non-zero.
> +Note that the encoded data is not validated when it is written;
> +if it is not valid (e.g., it cannot be decompressed),
> +then a subsequent read may return an error.
> +If the
> +.I offset
> +argument to
> +.BR pwritev2 ()

Same as above: specify (2).

> +is -1, then the file offset is incremented by
> +.IR len .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"

[.I] allows spaces, so it should be:

[
.I sizeof(struct encoded_iov)
]

> +in the kernel,
> +then any fields unknown to userspace are treated as if they were zero;

s/userspace/user space/

See man-pages(7)::STYLE GUIDE::Preferred terms

> +if it is greater and any fields unknown to the kernel are non-zero,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG .
> +.PP
> +.BR preadv2 ()

Same as above: specify (2).

> +populates the metadata in
> +.IR iov[0] ,
> +the encoded data in the remaining buffers,
> +and returns the number of encoded bytes read.
> +This will only return one extent per call.
> +This can also read data which is not encoded;
> +all encoding fields will be zero in that case.
> +If the
> +.I offset
> +argument to
> +.BR preadv2 ()

Smae as above: specify (2).

> +is -1, then the file offset is incremented by
> +.IR len .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"

Don't need '"' nor '\', as above.

> +in the kernel and any fields unknown to userspace are non-zero,

s/userspace/user space/

> +then
> +.BR preadv2 ()

(2)

> +returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG ;
> +if it is greater,
> +then any fields unknown to the kernel are returned as zero.
> +If the provided buffers are not large enough to return an entire encoded
> +extent,

Please use semantic newlines.
I haven't checked that in the text above,
so if you happen to find that there's any other line
that should also be fixed in that sense, please do so.

To understand 'semantic newlines',
please have a look at
man-pages(7)::STYLE GUIDE::Use semantic newlines

Basically, split lines at the most natural separation point,
instead of just when the line gets over the margin.

> +then
> +.BR preadv2 ()

(2)

> +returns -1 and sets
> +.I errno
> +to
> +.BR ENOBUFS .
> +.PP
> +As the filesystem page cache typically contains decoded data,
> +encoded I/O bypasses the page cache.
> +.SS Extent layout
> +By using
> +.IR len ,
> +.IR unencoded_len ,
> +and
> +.IR unencoded_offset ,
> +it is possible to refer to a subset of an unencoded extent.
> +.PP
> +In the simplest case,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +This means that the entire unencoded extent is used.
> +.PP
> +However, suppose we read 50 bytes into a file
> +which contains a single compressed extent.
> +The filesystem must still return the entire compressed extent
> +for us to be able to decompress it,
> +so
> +.I unencoded_len
> +would be the length of the entire decompressed extent.
> +However, because the read was at offset 50,
> +the first 50 bytes should be ignored.
> +Therefore,
> +.I unencoded_offset
> +would be 50,
> +and
> +.I len
> +would accordingly be
> +.IR unencoded_len\ -\ 50 .

This formats everything as I, except for the last dot.
Replace by:

[
.I unencoded
- 50.
]

Michael (mtk), same as above:
to space, or not to space?  That is the question :p

Personally, I find spaces more clear.

> +.PP
> +Additionally, suppose we want to create an encrypted file with length 500,
> +but the file is encrypted with a block cipher using a block size of 4096.
> +The unencoded data would therefore include the appropriate padding,
> +and
> +.I unencoded_len
> +would be 4096.
> +However, to represent the logical size of the file,
> +.I len
> +would be 500
> +(and
> +.I unencoded_offset
> +would be 0).
> +.PP
> +Similar situations can arise in other cases:
> +.IP * 3
> +If the filesystem pads data to the filesystem block size before compressing,
> +then compressed files with a size unaligned to the filesystem block size will
> +end with an extent with
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +Extents cloned from the middle of a larger encoded extent with
> +.B FICLONERANGE
> +may have a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +If the middle of an encoded extent is overwritten,
> +the filesystem may create extents with a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.I unencoded_len
> +for the parts that were not overwritten.
> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by maliciously crafted data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.

Please, s/vs./vs/
See the reasons below:

Michael (mtk),

Here the renderer outputs a double space
(as for separating two sentences).

Are you okay with that?

I haven't found any other "\<vs\>\.".
However, I've found a few "\<vs\>[^\.]".

> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ALLOW_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +The
> +.B O_CLOEXEC
> +flag must be specified in conjunction with
> +.BR O_ALLOW_ENCODED .
> +This avoids accidentally leaking the encoded I/O privilege
> +(it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2)
> +otherwise).
> +If
> +.B O_ALLOW_ENCODED
> +without
> +.B O_CLOEXEC
> +is desired,
> +.B O_CLOEXEC
> +can be cleared afterwards with
> +.BR fnctl (2).
> +.BR fcntl (2)
> +can also clear or set
> +.B O_ALLOW_ENCODED
> +(including without
> +.BR O_CLOEXEC ).
> +.SS Filesystem support
> +Encoded I/O is supported on the following filesystems:
> +.TP
> +Btrfs (since Linux 5.12)
> +.IP
> +Btrfs supports encoded reads and writes of compressed data.
> +The data is encoded as follows:
> +.RS
> +.IP * 3
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
> +then the encoded data is a single zlib stream.
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
> +then the encoded data is a single zstd frame compressed with the
> +.I windowLog
> +compression parameter set to no more than 17.
> +.IP *
> +If
> +.I compression
> +is one of
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
> +or
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
> +then the encoded data is compressed page by page
> +(using the page size indicated by the name of the constant)
> +with LZO1X
> +and wrapped in the format documented in the Linux kernel source file
> +.IR fs/btrfs/lzo.c .
> +.RE
> +.IP
> +Additionally, there are some restrictions on
> +.BR pwritev2 ():

(2)

> +.RS
> +.IP * 3
> +.I offset
> +(or the current file offset if
> +.I offset
> +is -1) must be aligned to the sector size of the filesystem.
> +.IP *
> +.I len
> +must be aligned to the sector size of the filesystem
> +unless the data ends at or beyond the current end of the file.
> +.IP *
> +.I unencoded_len
> +and the length of the encoded data must each be no more than 128 KiB.
> +This limit may increase in the future.
> +.IP *
> +The length of the encoded data must be less than or equal to
> +.IR unencoded_len .
> +.IP *
> +If using LZO, the filesystem's page size must match the compression page size.
> +.RE
> 

Please, add a SEE ALSO section, which should at least point to
preadv2(2) (or pwritev2(2), if you prefer):

[
.SH SEE ALSO
.BR preadv2 (2)
]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-11-19 23:29   ` Alejandro Colomar (mailing lists; readonly)
@ 2020-11-20 14:06     ` Alejandro Colomar (man-pages)
  2020-11-20 15:03       ` Alejandro Colomar (man-pages)
  0 siblings, 1 reply; 43+ messages in thread
From: Alejandro Colomar (man-pages) @ 2020-11-20 14:06 UTC (permalink / raw)
  To: Omar Sandoval, Michael Kerrisk
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team, linux-man

Hi Omar and Michael,

please, see below.

Thanks,

Alex

On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
> Hi Omar,
> 
> Please, see some fixes below:
> 
> Michael, I've also some questions for you below
> (you can grep for mtk to find those).
> 
> Thanks,
> 
> Alex
> 
> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> This adds a new page, encoded_io(7), providing an overview of encoded
>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>> reference it.
>>
>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: linux-man <linux-man@vger.kernel.org>
>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>> ---
>> This feature is not yet upstream.
>>
>>  man2/fcntl.2      |  10 +-
>>  man2/open.2       |  23 +++
>>  man2/readv.2      |  70 +++++++++
>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>  create mode 100644 man7/encoded_io.7
>>
>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>> index 546016617..b0d7fa2c3 100644
>> --- a/man2/fcntl.2
>> +++ b/man2/fcntl.2
>> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>>  .BR O_ASYNC ,
>>  .BR O_DIRECT ,
>>  .BR O_NOATIME ,
>> +.BR O_NONBLOCK ,
>>  and
>> -.B O_NONBLOCK
>> +.B O_ALLOW_ENCODED
>>  flags.
>>  It is not possible to change the
>>  .BR O_DSYNC
>> @@ -1820,6 +1821,13 @@ Attempted to clear the
>>  flag on a file that has the append-only attribute set.
>>  .TP
>>  .B EPERM
>> +Attempted to set the
>> +.B O_ALLOW_ENCODED
>> +flag and the calling process did not have the
>> +.B CAP_SYS_ADMIN
>> +capability.
>> +.TP
>> +.B EPERM
>>  .I cmd
>>  was
>>  .BR F_ADD_SEALS ,
>> diff --git a/man2/open.2 b/man2/open.2
>> index f587b0d95..84697dfa8 100644
>> --- a/man2/open.2
>> +++ b/man2/open.2
>> @@ -437,6 +437,16 @@ was followed by a call to
>>  .BR fdatasync (2)).
>>  .IR "See NOTES below" .
>>  .TP
>> +.B O_ALLOW_ENCODED
> 
> The list is alphabetically sorted;
> please, follow that
> (O_ALLOW_ENCODED should be the first one).
> 
>> +Open the file with encoded I/O permissions;
>> +see
>> +.BR encoded_io (7).
>> +.B O_CLOEXEC
>> +must be specified in conjuction with this flag.
>> +The caller must have the
>> +.B CAP_SYS_ADMIN
>> +capability.
>> +.TP
>>  .B O_EXCL
>>  Ensure that this call creates the file:
>>  if this flag is specified in conjunction with
>> @@ -1082,6 +1092,14 @@ is invalid
>>  (e.g., it contains characters not permitted by the underlying filesystem).
>>  .TP
>>  .B EINVAL
>> +.B O_ALLOW_ENCODED
>> +was specified in
>> +.IR flags ,
>> +but
>> +.B O_CLOEXEC
>> +was not specified.
>> +.TP
>> +.B EINVAL
>>  The final component ("basename") of
>>  .I pathname
>>  is invalid
>> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>>  The operation was prevented by a file seal; see
>>  .BR fcntl (2).
>>  .TP
>> +.B EPERM
>> +The
>> +.B O_ALLOW_ENCODED
>> +flag was specified, but the caller was not privileged.
>> +.TP
>>  .B EROFS
>>  .I pathname
>>  refers to a file on a read-only filesystem and write access was
>> diff --git a/man2/readv.2 b/man2/readv.2
>> index 5a8b74168..c9933acf0 100644
>> --- a/man2/readv.2
>> +++ b/man2/readv.2
>> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>>  However, if the
>>  .I offset
>>  argument is \-1, the current file offset is updated.
>> +.TP
>> +.BR RWF_ENCODED " (since Linux 5.12)"
>> +Read or write encoded (e.g., compressed) data.
>> +See
>> +.BR encoded_io (7).
>>  .SH RETURN VALUE
>>  On success,
>>  .BR readv (),
>> @@ -283,6 +288,13 @@ than requested (see
>>  and
>>  .BR write (2)).
>>  .PP
>> +If
>> +.B
>> +RWF_ENCODED
> 
> RWF_ENCODED should go in the same line as .B:
> 
> [
> .B RWF_ENCODED
> ]
> 
>> +was specified in
>> +.IR flags ,
>> +then the return value is the number of encoded bytes.
>> +.PP
>>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>>  .SH ERRORS
>>  The errors are as given for
>> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>>  .TP
>>  .B EOPNOTSUPP
>>  An unknown flag is specified in \fIflags\fP.
>> +.TP
>> +.B EOPNOTSUPP
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the filesystem does not implement encoded I/O.
>> +.TP
>> +.B EPERM
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the file was not opened with the
>> +.B O_ALLOW_ENCODED
>> +flag.
>> +.PP
>> +.BR preadv2 ()
>> +can fail for the following reasons:
> 
> The wording is a bit unclear:
> 
> Above your additions (old text, not yours),
> it says that some errors apply to preadv2
> (as well as to other functions):
> 
> [
> ERRORS
>        The errors are as given for read(2) and write(2).  Furthermore,
>        preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
>        for the same reasons as lseek(2).  Additionally, the  following
>        errors are defined:
> 
>        EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
>               value.
> 
>        EINVAL The vector count, iovcnt, is less than zero  or  greater
>               than the permitted maximum.
> 
>        EOPNOTSUPP
>               An unknown flag is specified in flags.
> 
>        EOPNOTSUPP
>               RWF_ENCODED  is  specified  in  flags and the filesystem
>               does not implement encoded I/O.
> 
>        EPERM  RWF_ENCODED is specified in flags and the file  was  not
>               opened with the O_ALLOW_ENCODED flag.
> ]
> 
> And then you added a line that says:
> 
> [
>        preadv2() can fail for the following reasons:
> ]
> 
> Which if read strictly, it says that [only] the following errors apply.
> 
> Did you mean that
> "preadv3() can _additionally_ fail for the following reasons"?
> 
> Could you please be a bit more specific?
> 
> The same applies for pwritev2() below.
> 
>> +.TP
>> +.B E2BIG
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and
>> +.I iov[0]
>> +is not large enough to return the encoding metadata.
>> +.TP
>> +.B ENOBUFS
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the buffers in
>> +.I iov
>> +are not big enough to return the encoded data.
>> +.PP
>> +.BR pwritev2 ()
>> +can fail for the following reasons:
>> +.TP
>> +.B E2BIG
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and
>> +.I iov[0]
>> +contains non-zero fields
>> +after the kernel's
>> +.IR "sizeof(struct\ encoded_iov)" .
> 
> Don't escape the space, if the string is already in "".
> 
>> +.TP
>> +.B EINVAL
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the encoding is unknown or not supported by the filesystem.
>> +.TP
>> +.B EINVAL
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the alignment and/or size requirements are not met.
>>  .SH VERSIONS
>>  .BR preadv ()
>>  and
>> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
>> new file mode 100644
>> index 000000000..106fa587b
>> --- /dev/null
>> +++ b/man7/encoded_io.7
>> @@ -0,0 +1,369 @@
>> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
>> +.\"
>> +.\" %%%LICENSE_START(VERBATIM)
>> +.\" Permission is granted to make and distribute verbatim copies of this
>> +.\" manual provided the copyright notice and this permission notice are
>> +.\" preserved on all copies.
>> +.\"
>> +.\" Permission is granted to copy and distribute modified versions of this
>> +.\" manual under the conditions for verbatim copying, provided that the
>> +.\" entire resulting derived work is distributed under the terms of a
>> +.\" permission notice identical to this one.
>> +.\"
>> +.\" Since the Linux kernel and libraries are constantly changing, this
>> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
>> +.\" responsibility for errors or omissions, or for damages resulting from
>> +.\" the use of the information contained herein.  The author(s) may not
>> +.\" have taken the same level of care in the production of this manual,
>> +.\" which is licensed free of charge, as they might when working
>> +.\" professionally.
>> +.\"
>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>> +.\" the source, must acknowledge the copyright and authors of this work.
>> +.\" %%%LICENSE_END
>> +.\"
>> +.\"
>> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +encoded_io \- overview of encoded I/O
>> +.SH DESCRIPTION
>> +Several filesystems (e.g., Btrfs) support transparent encoding
>> +(e.g., compression, encryption) of data on disk:
>> +written data is encoded by the kernel before it is written to disk,
>> +and read data is decoded before being returned to the user.
>> +In some cases, it is useful to skip this encoding step.
> 
> Here I would use ';' instead of '.'
> (and next letter would be lowercase, then).
> 
>> +For example, the user may want to read the compressed contents of a file
>> +or write pre-compressed data directly to a file.
>> +This is referred to as "encoded I/O".
>> +.SS Encoded I/O API
>> +Encoded I/O is specified with the
>> +.B RWF_ENCODED
>> +flag to
>> +.BR preadv2 (2)
>> +and
>> +.BR pwritev2 (2).
>> +If
>> +.B RWF_ENCODED
>> +is specified, then
>> +.I iov[0].iov_base
>> +points to an
>> +.I
>> +encoded_iov
> 
> On the same line, please.
> 
>> +structure, defined in
>> +.I <linux/fs.h>
>> +as:
>> +.PP
>> +.in +4n
>> +.EX
>> +struct encoded_iov {
>> +    __aligned_u64 len;
>> +    __aligned_u64 unencoded_len;
>> +    __aligned_u64 unencoded_offset;
>> +    __u32 compression;
>> +    __u32 encryption;
>> +};
>> +.EE
>> +.in
>> +.PP
>> +This may be extended in the future, so
>> +.I iov[0].iov_len
>> +must be set to
>> +.I "sizeof(struct\ encoded_iov)"
>> +for forward/backward compatibility.
>> +The remaining buffers contain the encoded data.
>> +.PP
>> +.I compression
>> +and
>> +.I encryption
>> +are the encoding fields.
>> +.I compression
>> +is
>> +.B ENCODED_IOV_COMPRESSION_NONE
>> +(zero)
>> +or a filesystem-specific
>> +.B ENCODED_IOV_COMPRESSION
> 
> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/

Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/

I'm not sure about existing practice.

Michael (mtk), what would you do here?

> 
>> +constant;
>> +see
>> +.BR Filesystem\ support .
> 
> Please, write it as [.BR "Filesystem support" .]
> 
> and maybe I would change it, to be more specific, to the following:
> 
> [
> see
> .B Filesystem support
> below.
> ]
> 
> So that the reader clearly understands it's on the same page.
> 
>> +.I encryption
>> +is currently always
>> +.B ENCODED_IOV_ENCRYPTION_NONE
>> +(zero).
>> +.PP
>> +.I unencoded_len
>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>> +.I unencoded_offset
>> +is the offset into the unencoded data where the data in the file begins
> 
> The above wording is a bit unclear to me.
> 
> I suggest the following:
> 
> [
> .I unencoded_offset
> is the offset from the begining of the file
> to the first byte of the unencoded data
> ]
> 
>> +(less than or equal to
>> +.IR unencoded_len ).
>> +.I len
>> +is the length of the data in the file
>> +(less than or equal to
>> +.I unencoded_len
>> +-
> 
> Here's a question for Michael (mtk):
> 
> I've seen (many) cases where these math operations
> are written without spaces,
> and in the same line (e.g., [.IR a + b]).
> 
> I'd like to know your preferences on this,
> or what is actually more extended in the manual pages,
> to stick with only one of them.
> 
>> +.IR unencoded_offset ).
>> +See
>> +.B Extent layout
>> +below for some examples.
>> +.I
> 
> Were you maybe going to add something there?
> 
> If not, please remove that [.I].
> 
>> +.PP
>> +If the unencoded data is actually longer than
>> +.IR unencoded_len ,
>> +then it is truncated;
>> +if it is shorter, then it is extended with zeroes.
>> +.PP
>> +
> 
> Please, remove that blank line.
> 
>> +.BR pwritev2 ()
> 
> Should be [.BR pwritev2 (2)]
> 
> Michael (mtk),
> 
> Am I right in that?  Please, confirm.
> 
>> +uses the metadata specified in
>> +.IR iov[0] ,
>> +writes the encoded data from the remaining buffers,
>> +and returns the number of encoded bytes written
>> +(that is, the sum of
>> +.I iov[n].iov_len
>> +for 1 <=
>> +.I n
>> +<
>> +.IR iovcnt ;
>> +partial writes will not occur).
>> +At least one encoding field must be non-zero.
>> +Note that the encoded data is not validated when it is written;
>> +if it is not valid (e.g., it cannot be decompressed),
>> +then a subsequent read may return an error.
>> +If the
>> +.I offset
>> +argument to
>> +.BR pwritev2 ()
> 
> Same as above: specify (2).
> 
>> +is -1, then the file offset is incremented by
>> +.IR len .
>> +If
>> +.I iov[0].iov_len
>> +is less than
>> +.I "sizeof(struct\ encoded_iov)"
> 
> [.I] allows spaces, so it should be:
> 
> [
> .I sizeof(struct encoded_iov)
> ]
> 
>> +in the kernel,
>> +then any fields unknown to userspace are treated as if they were zero;
> 
> s/userspace/user space/
> 
> See man-pages(7)::STYLE GUIDE::Preferred terms
> 
>> +if it is greater and any fields unknown to the kernel are non-zero,
>> +then this returns -1 and sets
>> +.I errno
>> +to
>> +.BR E2BIG .
>> +.PP
>> +.BR preadv2 ()
> 
> Same as above: specify (2).
> 
>> +populates the metadata in
>> +.IR iov[0] ,
>> +the encoded data in the remaining buffers,
>> +and returns the number of encoded bytes read.
>> +This will only return one extent per call.
>> +This can also read data which is not encoded;
>> +all encoding fields will be zero in that case.
>> +If the
>> +.I offset
>> +argument to
>> +.BR preadv2 ()
> 
> Smae as above: specify (2).
> 
>> +is -1, then the file offset is incremented by
>> +.IR len .
>> +If
>> +.I iov[0].iov_len
>> +is less than
>> +.I "sizeof(struct\ encoded_iov)"
> 
> Don't need '"' nor '\', as above.
> 
>> +in the kernel and any fields unknown to userspace are non-zero,
> 
> s/userspace/user space/
> 
>> +then
>> +.BR preadv2 ()
> 
> (2)
> 
>> +returns -1 and sets
>> +.I errno
>> +to
>> +.BR E2BIG ;
>> +if it is greater,
>> +then any fields unknown to the kernel are returned as zero.
>> +If the provided buffers are not large enough to return an entire encoded
>> +extent,
> 
> Please use semantic newlines.
> I haven't checked that in the text above,
> so if you happen to find that there's any other line
> that should also be fixed in that sense, please do so.
> 
> To understand 'semantic newlines',
> please have a look at
> man-pages(7)::STYLE GUIDE::Use semantic newlines
> 
> Basically, split lines at the most natural separation point,
> instead of just when the line gets over the margin.
> 
>> +then
>> +.BR preadv2 ()
> 
> (2)
> 
>> +returns -1 and sets
>> +.I errno
>> +to
>> +.BR ENOBUFS .
>> +.PP
>> +As the filesystem page cache typically contains decoded data,
>> +encoded I/O bypasses the page cache.
>> +.SS Extent layout
>> +By using
>> +.IR len ,
>> +.IR unencoded_len ,
>> +and
>> +.IR unencoded_offset ,
>> +it is possible to refer to a subset of an unencoded extent.
>> +.PP
>> +In the simplest case,
>> +.I len
>> +is equal to
>> +.I unencoded_len
>> +and
>> +.I unencoded_offset
>> +is zero.
>> +This means that the entire unencoded extent is used.
>> +.PP
>> +However, suppose we read 50 bytes into a file
>> +which contains a single compressed extent.
>> +The filesystem must still return the entire compressed extent
>> +for us to be able to decompress it,
>> +so
>> +.I unencoded_len
>> +would be the length of the entire decompressed extent.
>> +However, because the read was at offset 50,
>> +the first 50 bytes should be ignored.
>> +Therefore,
>> +.I unencoded_offset
>> +would be 50,
>> +and
>> +.I len
>> +would accordingly be
>> +.IR unencoded_len\ -\ 50 .
> 
> This formats everything as I, except for the last dot.
> Replace by:
> 
> [
> .I unencoded
> - 50.
> ]
> 
> Michael (mtk), same as above:
> to space, or not to space?  That is the question :p
> 
> Personally, I find spaces more clear.
> 
>> +.PP
>> +Additionally, suppose we want to create an encrypted file with length 500,
>> +but the file is encrypted with a block cipher using a block size of 4096.
>> +The unencoded data would therefore include the appropriate padding,
>> +and
>> +.I unencoded_len
>> +would be 4096.
>> +However, to represent the logical size of the file,
>> +.I len
>> +would be 500
>> +(and
>> +.I unencoded_offset
>> +would be 0).
>> +.PP
>> +Similar situations can arise in other cases:
>> +.IP * 3
>> +If the filesystem pads data to the filesystem block size before compressing,
>> +then compressed files with a size unaligned to the filesystem block size will
>> +end with an extent with
>> +.I len
>> +<
>> +.IR unencoded_len .
>> +.IP *
>> +Extents cloned from the middle of a larger encoded extent with
>> +.B FICLONERANGE
>> +may have a non-zero
>> +.I unencoded_offset
>> +and/or
>> +.I len
>> +<
>> +.IR unencoded_len .
>> +.IP *
>> +If the middle of an encoded extent is overwritten,
>> +the filesystem may create extents with a non-zero
>> +.I unencoded_offset
>> +and/or
>> +.I len
>> +<
>> +.I unencoded_len
>> +for the parts that were not overwritten.
>> +.SS Security
>> +Encoded I/O creates the potential for some security issues:
>> +.IP * 3
>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>> +a subsequent read. Decompression algorithms are complex and may have bugs
>> +which can be exploited by maliciously crafted data.
>> +.IP *
>> +Encoded reads may return data which is not logically present in the file
>> +(see the discussion of
>> +.I len
>> +vs.
> 
> Please, s/vs./vs/
> See the reasons below:
> 
> Michael (mtk),
> 
> Here the renderer outputs a double space
> (as for separating two sentences).
> 
> Are you okay with that?
> 
> I haven't found any other "\<vs\>\.".
> However, I've found a few "\<vs\>[^\.]".
> 
>> +.I unencoded_len
>> +above).
>> +It may not be intended for this data to be readable.
>> +.PP
>> +Therefore, encoded I/O requires privilege.
>> +Namely, the
>> +.B RWF_ENCODED
>> +flag may only be used when the file was opened with the
>> +.B O_ALLOW_ENCODED
>> +flag to
>> +.BR open (2),
>> +which requires the
>> +.B CAP_SYS_ADMIN
>> +capability.
>> +The
>> +.B O_CLOEXEC
>> +flag must be specified in conjunction with
>> +.BR O_ALLOW_ENCODED .
>> +This avoids accidentally leaking the encoded I/O privilege
>> +(it is not cleared on
>> +.BR fork (2)
>> +or
>> +.BR execve (2)
>> +otherwise).
>> +If
>> +.B O_ALLOW_ENCODED
>> +without
>> +.B O_CLOEXEC
>> +is desired,
>> +.B O_CLOEXEC
>> +can be cleared afterwards with
>> +.BR fnctl (2).
>> +.BR fcntl (2)
>> +can also clear or set
>> +.B O_ALLOW_ENCODED
>> +(including without
>> +.BR O_CLOEXEC ).
>> +.SS Filesystem support
>> +Encoded I/O is supported on the following filesystems:
>> +.TP
>> +Btrfs (since Linux 5.12)
>> +.IP
>> +Btrfs supports encoded reads and writes of compressed data.
>> +The data is encoded as follows:
>> +.RS
>> +.IP * 3
>> +If
>> +.I compression
>> +is
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
>> +then the encoded data is a single zlib stream.
>> +.IP *
>> +If
>> +.I compression
>> +is
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
>> +then the encoded data is a single zstd frame compressed with the
>> +.I windowLog
>> +compression parameter set to no more than 17.
>> +.IP *
>> +If
>> +.I compression
>> +is one of
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
>> +or
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
>> +then the encoded data is compressed page by page
>> +(using the page size indicated by the name of the constant)
>> +with LZO1X
>> +and wrapped in the format documented in the Linux kernel source file
>> +.IR fs/btrfs/lzo.c .
>> +.RE
>> +.IP
>> +Additionally, there are some restrictions on
>> +.BR pwritev2 ():
> 
> (2)
> 
>> +.RS
>> +.IP * 3
>> +.I offset
>> +(or the current file offset if
>> +.I offset
>> +is -1) must be aligned to the sector size of the filesystem.
>> +.IP *
>> +.I len
>> +must be aligned to the sector size of the filesystem
>> +unless the data ends at or beyond the current end of the file.
>> +.IP *
>> +.I unencoded_len
>> +and the length of the encoded data must each be no more than 128 KiB.
>> +This limit may increase in the future.
>> +.IP *
>> +The length of the encoded data must be less than or equal to
>> +.IR unencoded_len .
>> +.IP *
>> +If using LZO, the filesystem's page size must match the compression page size.
>> +.RE
>>
> 
> Please, add a SEE ALSO section, which should at least point to
> preadv2(2) (or pwritev2(2), if you prefer):
> 
> [
> .SH SEE ALSO
> .BR preadv2 (2)
> ]
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-11-20 14:06     ` Alejandro Colomar (man-pages)
@ 2020-11-20 15:03       ` Alejandro Colomar (man-pages)
  2020-11-30 19:35         ` Omar Sandoval
                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Alejandro Colomar (man-pages) @ 2020-11-20 15:03 UTC (permalink / raw)
  To: Omar Sandoval, Michael Kerrisk
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team, linux-man

Hi Omar,

I found a wording of mine to be a bit confusing.
Please see below.

Thanks,

Alex

On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
> Hi Omar and Michael,
> 
> please, see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
>> Hi Omar,
>>
>> Please, see some fixes below:
>>
>> Michael, I've also some questions for you below
>> (you can grep for mtk to find those).
>>
>> Thanks,
>>
>> Alex
>>
>> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> This adds a new page, encoded_io(7), providing an overview of encoded
>>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>>> reference it.
>>>
>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>> Cc: linux-man <linux-man@vger.kernel.org>
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>> This feature is not yet upstream.
>>>
>>>  man2/fcntl.2      |  10 +-
>>>  man2/open.2       |  23 +++
>>>  man2/readv.2      |  70 +++++++++
>>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>>  create mode 100644 man7/encoded_io.7
>>>
>>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>>> index 546016617..b0d7fa2c3 100644
>>> --- a/man2/fcntl.2
>>> +++ b/man2/fcntl.2
>>> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>>>  .BR O_ASYNC ,
>>>  .BR O_DIRECT ,
>>>  .BR O_NOATIME ,
>>> +.BR O_NONBLOCK ,
>>>  and
>>> -.B O_NONBLOCK
>>> +.B O_ALLOW_ENCODED
>>>  flags.
>>>  It is not possible to change the
>>>  .BR O_DSYNC
>>> @@ -1820,6 +1821,13 @@ Attempted to clear the
>>>  flag on a file that has the append-only attribute set.
>>>  .TP
>>>  .B EPERM
>>> +Attempted to set the
>>> +.B O_ALLOW_ENCODED
>>> +flag and the calling process did not have the
>>> +.B CAP_SYS_ADMIN
>>> +capability.
>>> +.TP
>>> +.B EPERM
>>>  .I cmd
>>>  was
>>>  .BR F_ADD_SEALS ,
>>> diff --git a/man2/open.2 b/man2/open.2
>>> index f587b0d95..84697dfa8 100644
>>> --- a/man2/open.2
>>> +++ b/man2/open.2
>>> @@ -437,6 +437,16 @@ was followed by a call to
>>>  .BR fdatasync (2)).
>>>  .IR "See NOTES below" .
>>>  .TP
>>> +.B O_ALLOW_ENCODED
>>
>> The list is alphabetically sorted;
>> please, follow that
>> (O_ALLOW_ENCODED should be the first one).
>>
>>> +Open the file with encoded I/O permissions;
>>> +see
>>> +.BR encoded_io (7).
>>> +.B O_CLOEXEC
>>> +must be specified in conjuction with this flag.
>>> +The caller must have the
>>> +.B CAP_SYS_ADMIN
>>> +capability.
>>> +.TP
>>>  .B O_EXCL
>>>  Ensure that this call creates the file:
>>>  if this flag is specified in conjunction with
>>> @@ -1082,6 +1092,14 @@ is invalid
>>>  (e.g., it contains characters not permitted by the underlying filesystem).
>>>  .TP
>>>  .B EINVAL
>>> +.B O_ALLOW_ENCODED
>>> +was specified in
>>> +.IR flags ,
>>> +but
>>> +.B O_CLOEXEC
>>> +was not specified.
>>> +.TP
>>> +.B EINVAL
>>>  The final component ("basename") of
>>>  .I pathname
>>>  is invalid
>>> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>>>  The operation was prevented by a file seal; see
>>>  .BR fcntl (2).
>>>  .TP
>>> +.B EPERM
>>> +The
>>> +.B O_ALLOW_ENCODED
>>> +flag was specified, but the caller was not privileged.
>>> +.TP
>>>  .B EROFS
>>>  .I pathname
>>>  refers to a file on a read-only filesystem and write access was
>>> diff --git a/man2/readv.2 b/man2/readv.2
>>> index 5a8b74168..c9933acf0 100644
>>> --- a/man2/readv.2
>>> +++ b/man2/readv.2
>>> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>>>  However, if the
>>>  .I offset
>>>  argument is \-1, the current file offset is updated.
>>> +.TP
>>> +.BR RWF_ENCODED " (since Linux 5.12)"
>>> +Read or write encoded (e.g., compressed) data.
>>> +See
>>> +.BR encoded_io (7).
>>>  .SH RETURN VALUE
>>>  On success,
>>>  .BR readv (),
>>> @@ -283,6 +288,13 @@ than requested (see
>>>  and
>>>  .BR write (2)).
>>>  .PP
>>> +If
>>> +.B
>>> +RWF_ENCODED
>>
>> RWF_ENCODED should go in the same line as .B:
>>
>> [
>> .B RWF_ENCODED
>> ]
>>
>>> +was specified in
>>> +.IR flags ,
>>> +then the return value is the number of encoded bytes.
>>> +.PP
>>>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>>>  .SH ERRORS
>>>  The errors are as given for
>>> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>>>  .TP
>>>  .B EOPNOTSUPP
>>>  An unknown flag is specified in \fIflags\fP.
>>> +.TP
>>> +.B EOPNOTSUPP
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the filesystem does not implement encoded I/O.
>>> +.TP
>>> +.B EPERM
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the file was not opened with the
>>> +.B O_ALLOW_ENCODED
>>> +flag.
>>> +.PP
>>> +.BR preadv2 ()
>>> +can fail for the following reasons:
>>
>> The wording is a bit unclear:
>>
>> Above your additions (old text, not yours),
>> it says that some errors apply to preadv2
>> (as well as to other functions):
>>
>> [
>> ERRORS
>>        The errors are as given for read(2) and write(2).  Furthermore,
>>        preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
>>        for the same reasons as lseek(2).  Additionally, the  following
>>        errors are defined:
>>
>>        EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
>>               value.
>>
>>        EINVAL The vector count, iovcnt, is less than zero  or  greater
>>               than the permitted maximum.
>>
>>        EOPNOTSUPP
>>               An unknown flag is specified in flags.
>>
>>        EOPNOTSUPP
>>               RWF_ENCODED  is  specified  in  flags and the filesystem
>>               does not implement encoded I/O.
>>
>>        EPERM  RWF_ENCODED is specified in flags and the file  was  not
>>               opened with the O_ALLOW_ENCODED flag.
>> ]
>>
>> And then you added a line that says:
>>
>> [
>>        preadv2() can fail for the following reasons:
>> ]
>>
>> Which if read strictly, it says that [only] the following errors apply.
>>
>> Did you mean that
>> "preadv3() can _additionally_ fail for the following reasons"?
>>
>> Could you please be a bit more specific?
>>
>> The same applies for pwritev2() below.
>>
>>> +.TP
>>> +.B E2BIG
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and
>>> +.I iov[0]
>>> +is not large enough to return the encoding metadata.
>>> +.TP
>>> +.B ENOBUFS
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the buffers in
>>> +.I iov
>>> +are not big enough to return the encoded data.
>>> +.PP
>>> +.BR pwritev2 ()
>>> +can fail for the following reasons:
>>> +.TP
>>> +.B E2BIG
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and
>>> +.I iov[0]
>>> +contains non-zero fields
>>> +after the kernel's
>>> +.IR "sizeof(struct\ encoded_iov)" .
>>
>> Don't escape the space, if the string is already in "".
>>
>>> +.TP
>>> +.B EINVAL
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the encoding is unknown or not supported by the filesystem.
>>> +.TP
>>> +.B EINVAL
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the alignment and/or size requirements are not met.
>>>  .SH VERSIONS
>>>  .BR preadv ()
>>>  and
>>> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
>>> new file mode 100644
>>> index 000000000..106fa587b
>>> --- /dev/null
>>> +++ b/man7/encoded_io.7
>>> @@ -0,0 +1,369 @@
>>> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
>>> +.\"
>>> +.\" %%%LICENSE_START(VERBATIM)
>>> +.\" Permission is granted to make and distribute verbatim copies of this
>>> +.\" manual provided the copyright notice and this permission notice are
>>> +.\" preserved on all copies.
>>> +.\"
>>> +.\" Permission is granted to copy and distribute modified versions of this
>>> +.\" manual under the conditions for verbatim copying, provided that the
>>> +.\" entire resulting derived work is distributed under the terms of a
>>> +.\" permission notice identical to this one.
>>> +.\"
>>> +.\" Since the Linux kernel and libraries are constantly changing, this
>>> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
>>> +.\" responsibility for errors or omissions, or for damages resulting from
>>> +.\" the use of the information contained herein.  The author(s) may not
>>> +.\" have taken the same level of care in the production of this manual,
>>> +.\" which is licensed free of charge, as they might when working
>>> +.\" professionally.
>>> +.\"
>>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>>> +.\" the source, must acknowledge the copyright and authors of this work.
>>> +.\" %%%LICENSE_END
>>> +.\"
>>> +.\"
>>> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
>>> +.SH NAME
>>> +encoded_io \- overview of encoded I/O
>>> +.SH DESCRIPTION
>>> +Several filesystems (e.g., Btrfs) support transparent encoding
>>> +(e.g., compression, encryption) of data on disk:
>>> +written data is encoded by the kernel before it is written to disk,
>>> +and read data is decoded before being returned to the user.
>>> +In some cases, it is useful to skip this encoding step.
>>
>> Here I would use ';' instead of '.'
>> (and next letter would be lowercase, then).
>>
>>> +For example, the user may want to read the compressed contents of a file
>>> +or write pre-compressed data directly to a file.
>>> +This is referred to as "encoded I/O".
>>> +.SS Encoded I/O API
>>> +Encoded I/O is specified with the
>>> +.B RWF_ENCODED
>>> +flag to
>>> +.BR preadv2 (2)
>>> +and
>>> +.BR pwritev2 (2).
>>> +If
>>> +.B RWF_ENCODED
>>> +is specified, then
>>> +.I iov[0].iov_base
>>> +points to an
>>> +.I
>>> +encoded_iov
>>
>> On the same line, please.
>>
>>> +structure, defined in
>>> +.I <linux/fs.h>
>>> +as:
>>> +.PP
>>> +.in +4n
>>> +.EX
>>> +struct encoded_iov {
>>> +    __aligned_u64 len;
>>> +    __aligned_u64 unencoded_len;
>>> +    __aligned_u64 unencoded_offset;
>>> +    __u32 compression;
>>> +    __u32 encryption;
>>> +};
>>> +.EE
>>> +.in
>>> +.PP
>>> +This may be extended in the future, so
>>> +.I iov[0].iov_len
>>> +must be set to
>>> +.I "sizeof(struct\ encoded_iov)"
>>> +for forward/backward compatibility.
>>> +The remaining buffers contain the encoded data.
>>> +.PP
>>> +.I compression
>>> +and
>>> +.I encryption
>>> +are the encoding fields.
>>> +.I compression
>>> +is
>>> +.B ENCODED_IOV_COMPRESSION_NONE
>>> +(zero)
>>> +or a filesystem-specific
>>> +.B ENCODED_IOV_COMPRESSION
>>
>> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/
> 
> Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/
> 
> I'm not sure about existing practice.
> 
> Michael (mtk), what would you do here?
> 
>>
>>> +constant;
>>> +see
>>> +.BR Filesystem\ support .
>>
>> Please, write it as [.BR "Filesystem support" .]
>>
>> and maybe I would change it, to be more specific, to the following:
>>
>> [
>> see
>> .B Filesystem support
>> below.
>> ]
>>
>> So that the reader clearly understands it's on the same page.
>>
>>> +.I encryption
>>> +is currently always
>>> +.B ENCODED_IOV_ENCRYPTION_NONE
>>> +(zero).
>>> +.PP
>>> +.I unencoded_len
>>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>>> +.I unencoded_offset
>>> +is the offset into the unencoded data where the data in the file begins
>>
>> The above wording is a bit unclear to me.
>>
>> I suggest the following:
>>
>> [
>> .I unencoded_offset
>> is the offset from the begining of the file
>> to the first byte of the unencoded data
>> ]

Now I've read it again, and my wording was even worse than yours.
I think yours can be understood after a few reads.

However, I'll still try to reword mine to see if I add some value:

[
.I unencoded_offset
is the offset from the first byte of the unencoded data
to the first byte of logical data.
]

If you prefer yours, or a mix, that's fine.

>>
>>> +(less than or equal to
>>> +.IR unencoded_len ).
>>> +.I len
>>> +is the length of the data in the file
>>> +(less than or equal to
>>> +.I unencoded_len
>>> +-
>>
>> Here's a question for Michael (mtk):
>>
>> I've seen (many) cases where these math operations
>> are written without spaces,
>> and in the same line (e.g., [.IR a + b]).
>>
>> I'd like to know your preferences on this,
>> or what is actually more extended in the manual pages,
>> to stick with only one of them.
>>
>>> +.IR unencoded_offset ).
>>> +See
>>> +.B Extent layout
>>> +below for some examples.
>>> +.I
>>
>> Were you maybe going to add something there?
>>
>> If not, please remove that [.I].
>>
>>> +.PP
>>> +If the unencoded data is actually longer than
>>> +.IR unencoded_len ,
>>> +then it is truncated;
>>> +if it is shorter, then it is extended with zeroes.
>>> +.PP
>>> +
>>
>> Please, remove that blank line.
>>
>>> +.BR pwritev2 ()
>>
>> Should be [.BR pwritev2 (2)]
>>
>> Michael (mtk),
>>
>> Am I right in that?  Please, confirm.
>>
>>> +uses the metadata specified in
>>> +.IR iov[0] ,
>>> +writes the encoded data from the remaining buffers,
>>> +and returns the number of encoded bytes written
>>> +(that is, the sum of
>>> +.I iov[n].iov_len
>>> +for 1 <=
>>> +.I n
>>> +<
>>> +.IR iovcnt ;
>>> +partial writes will not occur).
>>> +At least one encoding field must be non-zero.
>>> +Note that the encoded data is not validated when it is written;
>>> +if it is not valid (e.g., it cannot be decompressed),
>>> +then a subsequent read may return an error.
>>> +If the
>>> +.I offset
>>> +argument to
>>> +.BR pwritev2 ()
>>
>> Same as above: specify (2).
>>
>>> +is -1, then the file offset is incremented by
>>> +.IR len .
>>> +If
>>> +.I iov[0].iov_len
>>> +is less than
>>> +.I "sizeof(struct\ encoded_iov)"
>>
>> [.I] allows spaces, so it should be:
>>
>> [
>> .I sizeof(struct encoded_iov)
>> ]
>>
>>> +in the kernel,
>>> +then any fields unknown to userspace are treated as if they were zero;
>>
>> s/userspace/user space/
>>
>> See man-pages(7)::STYLE GUIDE::Preferred terms
>>
>>> +if it is greater and any fields unknown to the kernel are non-zero,
>>> +then this returns -1 and sets
>>> +.I errno
>>> +to
>>> +.BR E2BIG .
>>> +.PP
>>> +.BR preadv2 ()
>>
>> Same as above: specify (2).
>>
>>> +populates the metadata in
>>> +.IR iov[0] ,
>>> +the encoded data in the remaining buffers,
>>> +and returns the number of encoded bytes read.
>>> +This will only return one extent per call.
>>> +This can also read data which is not encoded;
>>> +all encoding fields will be zero in that case.
>>> +If the
>>> +.I offset
>>> +argument to
>>> +.BR preadv2 ()
>>
>> Smae as above: specify (2).
>>
>>> +is -1, then the file offset is incremented by
>>> +.IR len .
>>> +If
>>> +.I iov[0].iov_len
>>> +is less than
>>> +.I "sizeof(struct\ encoded_iov)"
>>
>> Don't need '"' nor '\', as above.
>>
>>> +in the kernel and any fields unknown to userspace are non-zero,
>>
>> s/userspace/user space/
>>
>>> +then
>>> +.BR preadv2 ()
>>
>> (2)
>>
>>> +returns -1 and sets
>>> +.I errno
>>> +to
>>> +.BR E2BIG ;
>>> +if it is greater,
>>> +then any fields unknown to the kernel are returned as zero.
>>> +If the provided buffers are not large enough to return an entire encoded
>>> +extent,
>>
>> Please use semantic newlines.
>> I haven't checked that in the text above,
>> so if you happen to find that there's any other line
>> that should also be fixed in that sense, please do so.
>>
>> To understand 'semantic newlines',
>> please have a look at
>> man-pages(7)::STYLE GUIDE::Use semantic newlines
>>
>> Basically, split lines at the most natural separation point,
>> instead of just when the line gets over the margin.
>>
>>> +then
>>> +.BR preadv2 ()
>>
>> (2)
>>
>>> +returns -1 and sets
>>> +.I errno
>>> +to
>>> +.BR ENOBUFS .
>>> +.PP
>>> +As the filesystem page cache typically contains decoded data,
>>> +encoded I/O bypasses the page cache.
>>> +.SS Extent layout
>>> +By using
>>> +.IR len ,
>>> +.IR unencoded_len ,
>>> +and
>>> +.IR unencoded_offset ,
>>> +it is possible to refer to a subset of an unencoded extent.
>>> +.PP
>>> +In the simplest case,
>>> +.I len
>>> +is equal to
>>> +.I unencoded_len
>>> +and
>>> +.I unencoded_offset
>>> +is zero.
>>> +This means that the entire unencoded extent is used.
>>> +.PP
>>> +However, suppose we read 50 bytes into a file
>>> +which contains a single compressed extent.
>>> +The filesystem must still return the entire compressed extent
>>> +for us to be able to decompress it,
>>> +so
>>> +.I unencoded_len
>>> +would be the length of the entire decompressed extent.
>>> +However, because the read was at offset 50,
>>> +the first 50 bytes should be ignored.
>>> +Therefore,
>>> +.I unencoded_offset
>>> +would be 50,
>>> +and
>>> +.I len
>>> +would accordingly be
>>> +.IR unencoded_len\ -\ 50 .
>>
>> This formats everything as I, except for the last dot.
>> Replace by:
>>
>> [
>> .I unencoded
>> - 50.
>> ]
>>
>> Michael (mtk), same as above:
>> to space, or not to space?  That is the question :p
>>
>> Personally, I find spaces more clear.
>>
>>> +.PP
>>> +Additionally, suppose we want to create an encrypted file with length 500,
>>> +but the file is encrypted with a block cipher using a block size of 4096.
>>> +The unencoded data would therefore include the appropriate padding,
>>> +and
>>> +.I unencoded_len
>>> +would be 4096.
>>> +However, to represent the logical size of the file,
>>> +.I len
>>> +would be 500
>>> +(and
>>> +.I unencoded_offset
>>> +would be 0).
>>> +.PP
>>> +Similar situations can arise in other cases:
>>> +.IP * 3
>>> +If the filesystem pads data to the filesystem block size before compressing,
>>> +then compressed files with a size unaligned to the filesystem block size will
>>> +end with an extent with
>>> +.I len
>>> +<
>>> +.IR unencoded_len .
>>> +.IP *
>>> +Extents cloned from the middle of a larger encoded extent with
>>> +.B FICLONERANGE
>>> +may have a non-zero
>>> +.I unencoded_offset
>>> +and/or
>>> +.I len
>>> +<
>>> +.IR unencoded_len .
>>> +.IP *
>>> +If the middle of an encoded extent is overwritten,
>>> +the filesystem may create extents with a non-zero
>>> +.I unencoded_offset
>>> +and/or
>>> +.I len
>>> +<
>>> +.I unencoded_len
>>> +for the parts that were not overwritten.
>>> +.SS Security
>>> +Encoded I/O creates the potential for some security issues:
>>> +.IP * 3
>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>> +which can be exploited by maliciously crafted data.
>>> +.IP *
>>> +Encoded reads may return data which is not logically present in the file
>>> +(see the discussion of
>>> +.I len
>>> +vs.
>>
>> Please, s/vs./vs/
>> See the reasons below:
>>
>> Michael (mtk),
>>
>> Here the renderer outputs a double space
>> (as for separating two sentences).
>>
>> Are you okay with that?
>>
>> I haven't found any other "\<vs\>\.".
>> However, I've found a few "\<vs\>[^\.]".
>>
>>> +.I unencoded_len
>>> +above).
>>> +It may not be intended for this data to be readable.
>>> +.PP
>>> +Therefore, encoded I/O requires privilege.
>>> +Namely, the
>>> +.B RWF_ENCODED
>>> +flag may only be used when the file was opened with the
>>> +.B O_ALLOW_ENCODED
>>> +flag to
>>> +.BR open (2),
>>> +which requires the
>>> +.B CAP_SYS_ADMIN
>>> +capability.
>>> +The
>>> +.B O_CLOEXEC
>>> +flag must be specified in conjunction with
>>> +.BR O_ALLOW_ENCODED .
>>> +This avoids accidentally leaking the encoded I/O privilege
>>> +(it is not cleared on
>>> +.BR fork (2)
>>> +or
>>> +.BR execve (2)
>>> +otherwise).
>>> +If
>>> +.B O_ALLOW_ENCODED
>>> +without
>>> +.B O_CLOEXEC
>>> +is desired,
>>> +.B O_CLOEXEC
>>> +can be cleared afterwards with
>>> +.BR fnctl (2).
>>> +.BR fcntl (2)
>>> +can also clear or set
>>> +.B O_ALLOW_ENCODED
>>> +(including without
>>> +.BR O_CLOEXEC ).
>>> +.SS Filesystem support
>>> +Encoded I/O is supported on the following filesystems:
>>> +.TP
>>> +Btrfs (since Linux 5.12)
>>> +.IP
>>> +Btrfs supports encoded reads and writes of compressed data.
>>> +The data is encoded as follows:
>>> +.RS
>>> +.IP * 3
>>> +If
>>> +.I compression
>>> +is
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
>>> +then the encoded data is a single zlib stream.
>>> +.IP *
>>> +If
>>> +.I compression
>>> +is
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
>>> +then the encoded data is a single zstd frame compressed with the
>>> +.I windowLog
>>> +compression parameter set to no more than 17.
>>> +.IP *
>>> +If
>>> +.I compression
>>> +is one of
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
>>> +or
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
>>> +then the encoded data is compressed page by page
>>> +(using the page size indicated by the name of the constant)
>>> +with LZO1X
>>> +and wrapped in the format documented in the Linux kernel source file
>>> +.IR fs/btrfs/lzo.c .
>>> +.RE
>>> +.IP
>>> +Additionally, there are some restrictions on
>>> +.BR pwritev2 ():
>>
>> (2)
>>
>>> +.RS
>>> +.IP * 3
>>> +.I offset
>>> +(or the current file offset if
>>> +.I offset
>>> +is -1) must be aligned to the sector size of the filesystem.
>>> +.IP *
>>> +.I len
>>> +must be aligned to the sector size of the filesystem
>>> +unless the data ends at or beyond the current end of the file.
>>> +.IP *
>>> +.I unencoded_len
>>> +and the length of the encoded data must each be no more than 128 KiB.
>>> +This limit may increase in the future.
>>> +.IP *
>>> +The length of the encoded data must be less than or equal to
>>> +.IR unencoded_len .
>>> +.IP *
>>> +If using LZO, the filesystem's page size must match the compression page size.
>>> +.RE
>>>
>>
>> Please, add a SEE ALSO section, which should at least point to
>> preadv2(2) (or pwritev2(2), if you prefer):
>>
>> [
>> .SH SEE ALSO
>> .BR preadv2 (2)
>> ]
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag
  2020-11-19  7:02   ` Amir Goldstein
@ 2020-11-20 23:41     ` Jann Horn
  2020-11-30 19:26       ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Jann Horn @ 2020-11-20 23:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Omar Sandoval, linux-fsdevel, Linux Btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Aleksa Sarai, Linux API,
	Kernel Team

On Thu, Nov 19, 2020 at 8:03 AM Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Nov 18, 2020 at 9:18 PM Omar Sandoval <osandov@osandov.com> wrote:
> > The upcoming RWF_ENCODED operation introduces some security concerns:
> >
> > 1. Compressed writes will pass arbitrary data to decompression
> >    algorithms in the kernel.
> > 2. Compressed reads can leak truncated/hole punched data.
> >
> > Therefore, we need to require privilege for RWF_ENCODED. It's not
> > possible to do the permissions checks at the time of the read or write
> > because, e.g., io_uring submits IO from a worker thread. So, add an open
> > flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> > fcntl(). The flag is not cleared in any way on fork or exec. It must be
> > combined with O_CLOEXEC when opening to avoid accidental leaks (if
> > needed, it may be set without O_CLOEXEC by using fnctl()).
> >
> > Note that the usual issue that unknown open flags are ignored doesn't
> > really matter for O_ALLOW_ENCODED; if the kernel doesn't support
> > O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either.
[...]
> > diff --git a/fs/open.c b/fs/open.c
> > index 9af548fb841b..f2863aaf78e7 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -1040,6 +1040,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> >                 acc_mode = 0;
> >         }
> >
> > +       /*
> > +        * O_ALLOW_ENCODED must be combined with O_CLOEXEC to avoid accidentally
> > +        * leaking encoded I/O privileges.
> > +        */
> > +       if ((how->flags & (O_ALLOW_ENCODED | O_CLOEXEC)) == O_ALLOW_ENCODED)
> > +               return -EINVAL;
> > +
>
>
> dup() can also result in accidental leak.
> We could fail dup() of fd without O_CLOEXEC. Should we?
>
> If we should than what error code should it be? We could return EPERM,
> but since we do allow to clear O_CLOEXEC or set O_ALLOW_ENCODED
> after open, EPERM seems a tad harsh.
> EINVAL seems inappropriate because the error has nothing to do with
> input args of dup() and EBADF would also be confusing.

This seems very arbitrary to me. Sure, leaking these file descriptors
wouldn't be great, but there are plenty of other types of file
descriptors that are probably more sensitive. (Writable file
descriptors to databases, to important configuration files, to
io_uring instances, and so on.) So I don't see why this specific
feature should impose such special rules on it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 04/11] btrfs: fix btrfs_write_check()
  2020-11-18 19:18 ` [PATCH v6 04/11] btrfs: fix btrfs_write_check() Omar Sandoval
@ 2020-11-23 17:08   ` David Sterba
  2020-11-30 19:18     ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: David Sterba @ 2020-11-23 17:08 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On Wed, Nov 18, 2020 at 11:18:11AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> btrfs_write_check() has two related bugs:
> 
> 1. It gets the iov_iter count before calling generic_write_checks(), but
>    generic_write_checks() may truncate the iov_iter.
> 2. It returns the count or negative errno as a size_t, which the callers
>    cast to an int. If the count is greater than INT_MAX, this overflows.
> 
> To fix both of these, pull the call to generic_write_checks() out of
> btrfs_write_check(), use the new iov_iter count returned from
> generic_write_checks(), and have btrfs_write_check() return 0 or a
> negative errno as an int instead of the count. This rearrangement also
> paves the way for RWF_ENCODED write support.
> 
> Fixes: f945968ff64c ("btrfs: introduce btrfs_write_check()")

This patch is still in misc-next and the commit id is unstable, so this
would rather be folded to the patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 05/11] btrfs: fix check_data_csum() error message for direct I/O
  2020-11-18 19:18 ` [PATCH v6 05/11] btrfs: fix check_data_csum() error message for direct I/O Omar Sandoval
@ 2020-11-23 17:09   ` David Sterba
  2020-11-30 19:20     ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: David Sterba @ 2020-11-23 17:09 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On Wed, Nov 18, 2020 at 11:18:12AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to
> check_data_csum()") replaced the start parameter to check_data_csum()
> with page_offset(), but page_offset() is not meaningful for direct I/O
> pages. Bring back the start parameter.
> 
> Fixes: 1dae796aabf6 ("btrfs: inode: sink parameter start and len to check_data_csum()")

This is part of the subpage preparatory patches still in misc-next , I
can drop the part that removes the start parameter if you're going to
use it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 04/11] btrfs: fix btrfs_write_check()
  2020-11-23 17:08   ` David Sterba
@ 2020-11-30 19:18     ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-30 19:18 UTC (permalink / raw)
  To: dsterba, linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On Mon, Nov 23, 2020 at 06:08:31PM +0100, David Sterba wrote:
> On Wed, Nov 18, 2020 at 11:18:11AM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > btrfs_write_check() has two related bugs:
> > 
> > 1. It gets the iov_iter count before calling generic_write_checks(), but
> >    generic_write_checks() may truncate the iov_iter.
> > 2. It returns the count or negative errno as a size_t, which the callers
> >    cast to an int. If the count is greater than INT_MAX, this overflows.
> > 
> > To fix both of these, pull the call to generic_write_checks() out of
> > btrfs_write_check(), use the new iov_iter count returned from
> > generic_write_checks(), and have btrfs_write_check() return 0 or a
> > negative errno as an int instead of the count. This rearrangement also
> > paves the way for RWF_ENCODED write support.
> > 
> > Fixes: f945968ff64c ("btrfs: introduce btrfs_write_check()")
> 
> This patch is still in misc-next and the commit id is unstable, so this
> would rather be folded to the patch.

Looks like you folded this in on misc-next, thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 05/11] btrfs: fix check_data_csum() error message for direct I/O
  2020-11-23 17:09   ` David Sterba
@ 2020-11-30 19:20     ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-30 19:20 UTC (permalink / raw)
  To: dsterba, linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On Mon, Nov 23, 2020 at 06:09:56PM +0100, David Sterba wrote:
> On Wed, Nov 18, 2020 at 11:18:12AM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to
> > check_data_csum()") replaced the start parameter to check_data_csum()
> > with page_offset(), but page_offset() is not meaningful for direct I/O
> > pages. Bring back the start parameter.
> > 
> > Fixes: 1dae796aabf6 ("btrfs: inode: sink parameter start and len to check_data_csum()")
> 
> This is part of the subpage preparatory patches still in misc-next , I
> can drop the part that removes the start parameter if you're going to
> use it.

To be clear, the original patch is buggy. It causes check_data_csum() to
print nonsense for checksum errors encountered during direct I/O. So,
this should be probably be folded in to the original patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag
  2020-11-20 23:41     ` Jann Horn
@ 2020-11-30 19:26       ` Omar Sandoval
  2020-12-01  8:15         ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2020-11-30 19:26 UTC (permalink / raw)
  To: Jann Horn
  Cc: Amir Goldstein, linux-fsdevel, Linux Btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Aleksa Sarai, Linux API,
	Kernel Team

On Sat, Nov 21, 2020 at 12:41:23AM +0100, Jann Horn wrote:
> On Thu, Nov 19, 2020 at 8:03 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > On Wed, Nov 18, 2020 at 9:18 PM Omar Sandoval <osandov@osandov.com> wrote:
> > > The upcoming RWF_ENCODED operation introduces some security concerns:
> > >
> > > 1. Compressed writes will pass arbitrary data to decompression
> > >    algorithms in the kernel.
> > > 2. Compressed reads can leak truncated/hole punched data.
> > >
> > > Therefore, we need to require privilege for RWF_ENCODED. It's not
> > > possible to do the permissions checks at the time of the read or write
> > > because, e.g., io_uring submits IO from a worker thread. So, add an open
> > > flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> > > fcntl(). The flag is not cleared in any way on fork or exec. It must be
> > > combined with O_CLOEXEC when opening to avoid accidental leaks (if
> > > needed, it may be set without O_CLOEXEC by using fnctl()).
> > >
> > > Note that the usual issue that unknown open flags are ignored doesn't
> > > really matter for O_ALLOW_ENCODED; if the kernel doesn't support
> > > O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either.
> [...]
> > > diff --git a/fs/open.c b/fs/open.c
> > > index 9af548fb841b..f2863aaf78e7 100644
> > > --- a/fs/open.c
> > > +++ b/fs/open.c
> > > @@ -1040,6 +1040,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> > >                 acc_mode = 0;
> > >         }
> > >
> > > +       /*
> > > +        * O_ALLOW_ENCODED must be combined with O_CLOEXEC to avoid accidentally
> > > +        * leaking encoded I/O privileges.
> > > +        */
> > > +       if ((how->flags & (O_ALLOW_ENCODED | O_CLOEXEC)) == O_ALLOW_ENCODED)
> > > +               return -EINVAL;
> > > +
> >
> >
> > dup() can also result in accidental leak.
> > We could fail dup() of fd without O_CLOEXEC. Should we?
> >
> > If we should than what error code should it be? We could return EPERM,
> > but since we do allow to clear O_CLOEXEC or set O_ALLOW_ENCODED
> > after open, EPERM seems a tad harsh.
> > EINVAL seems inappropriate because the error has nothing to do with
> > input args of dup() and EBADF would also be confusing.
> 
> This seems very arbitrary to me. Sure, leaking these file descriptors
> wouldn't be great, but there are plenty of other types of file
> descriptors that are probably more sensitive. (Writable file
> descriptors to databases, to important configuration files, to
> io_uring instances, and so on.) So I don't see why this specific
> feature should impose such special rules on it.

I agree with Jann. I'm okay with the O_CLOEXEC-on-open requirement if it
makes people more comfortable, but I don't think we should be bending
over backwards to block it anywhere else.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-11-20 15:03       ` Alejandro Colomar (man-pages)
@ 2020-11-30 19:35         ` Omar Sandoval
  2020-12-01 14:36         ` Ping: " Alejandro Colomar (man-pages)
  2020-12-01 20:12         ` Michael Kerrisk (man-pages)
  2 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-11-30 19:35 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages)
  Cc: Michael Kerrisk, linux-fsdevel, linux-btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

On Fri, Nov 20, 2020 at 04:03:44PM +0100, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> I found a wording of mine to be a bit confusing.
> Please see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
> > Hi Omar and Michael,
> > 
> > please, see below.
> > 
> > Thanks,
> > 
> > Alex
> > 
> > On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
> >> Hi Omar,
> >>
> >> Please, see some fixes below:
> >>
> >> Michael, I've also some questions for you below
> >> (you can grep for mtk to find those).
> >>
> >> Thanks,
> >>
> >> Alex

Thanks for the suggestions, I'll incorporate those into the next
submission.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag
  2020-11-30 19:26       ` Omar Sandoval
@ 2020-12-01  8:15         ` Amir Goldstein
  2020-12-01 20:31           ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2020-12-01  8:15 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jann Horn, linux-fsdevel, Linux Btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Aleksa Sarai, Linux API,
	Kernel Team

On Mon, Nov 30, 2020 at 9:26 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Sat, Nov 21, 2020 at 12:41:23AM +0100, Jann Horn wrote:
> > On Thu, Nov 19, 2020 at 8:03 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > On Wed, Nov 18, 2020 at 9:18 PM Omar Sandoval <osandov@osandov.com> wrote:
> > > > The upcoming RWF_ENCODED operation introduces some security concerns:
> > > >
> > > > 1. Compressed writes will pass arbitrary data to decompression
> > > >    algorithms in the kernel.
> > > > 2. Compressed reads can leak truncated/hole punched data.
> > > >
> > > > Therefore, we need to require privilege for RWF_ENCODED. It's not
> > > > possible to do the permissions checks at the time of the read or write
> > > > because, e.g., io_uring submits IO from a worker thread. So, add an open
> > > > flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> > > > fcntl(). The flag is not cleared in any way on fork or exec. It must be
> > > > combined with O_CLOEXEC when opening to avoid accidental leaks (if
> > > > needed, it may be set without O_CLOEXEC by using fnctl()).
> > > >
> > > > Note that the usual issue that unknown open flags are ignored doesn't
> > > > really matter for O_ALLOW_ENCODED; if the kernel doesn't support
> > > > O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either.
> > [...]
> > > > diff --git a/fs/open.c b/fs/open.c
> > > > index 9af548fb841b..f2863aaf78e7 100644
> > > > --- a/fs/open.c
> > > > +++ b/fs/open.c
> > > > @@ -1040,6 +1040,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> > > >                 acc_mode = 0;
> > > >         }
> > > >
> > > > +       /*
> > > > +        * O_ALLOW_ENCODED must be combined with O_CLOEXEC to avoid accidentally
> > > > +        * leaking encoded I/O privileges.
> > > > +        */
> > > > +       if ((how->flags & (O_ALLOW_ENCODED | O_CLOEXEC)) == O_ALLOW_ENCODED)
> > > > +               return -EINVAL;
> > > > +
> > >
> > >
> > > dup() can also result in accidental leak.
> > > We could fail dup() of fd without O_CLOEXEC. Should we?
> > >
> > > If we should than what error code should it be? We could return EPERM,
> > > but since we do allow to clear O_CLOEXEC or set O_ALLOW_ENCODED
> > > after open, EPERM seems a tad harsh.
> > > EINVAL seems inappropriate because the error has nothing to do with
> > > input args of dup() and EBADF would also be confusing.
> >
> > This seems very arbitrary to me. Sure, leaking these file descriptors
> > wouldn't be great, but there are plenty of other types of file
> > descriptors that are probably more sensitive. (Writable file
> > descriptors to databases, to important configuration files, to
> > io_uring instances, and so on.) So I don't see why this specific
> > feature should impose such special rules on it.
>
> I agree with Jann. I'm okay with the O_CLOEXEC-on-open requirement if it
> makes people more comfortable, but I don't think we should be bending
> over backwards to block it anywhere else.

I'm fine with or without the O_CLOEXEC-on-open requirement.
Just pointing out the weirdness.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Ping: [PATCH man-pages v6] Document encoded I/O
  2020-11-20 15:03       ` Alejandro Colomar (man-pages)
  2020-11-30 19:35         ` Omar Sandoval
@ 2020-12-01 14:36         ` Alejandro Colomar (man-pages)
  2020-12-01 20:12         ` Michael Kerrisk (man-pages)
  2 siblings, 0 replies; 43+ messages in thread
From: Alejandro Colomar (man-pages) @ 2020-12-01 14:36 UTC (permalink / raw)
  To: Omar Sandoval, Michael Kerrisk
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team, linux-man

Hello Michael,

Could you please have a look at a few doubts down there?
Just grep 'mtk' and you'll find them ;)

Thanks,

Alex

On 11/20/20 4:03 PM, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> I found a wording of mine to be a bit confusing.
> Please see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
>> Hi Omar and Michael,
>>
>> please, see below.
>>
>> Thanks,
>>
>> Alex
>>
>> On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
>>> Hi Omar,
>>>
>>> Please, see some fixes below:
>>>
>>> Michael, I've also some questions for you below
>>> (you can grep for mtk to find those).
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>>>> From: Omar Sandoval <osandov@fb.com>
>>>>
>>>> This adds a new page, encoded_io(7), providing an overview of encoded
>>>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>>>> reference it.
>>>>
>>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> Cc: linux-man <linux-man@vger.kernel.org>
>>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>>> ---
>>>> This feature is not yet upstream.
>>>>
>>>>  man2/fcntl.2      |  10 +-
>>>>  man2/open.2       |  23 +++
>>>>  man2/readv.2      |  70 +++++++++
>>>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>>>  create mode 100644 man7/encoded_io.7
>>>>
>>>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>>>> index 546016617..b0d7fa2c3 100644
>>>> --- a/man2/fcntl.2
>>>> +++ b/man2/fcntl.2
>>>> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>>>>  .BR O_ASYNC ,
>>>>  .BR O_DIRECT ,
>>>>  .BR O_NOATIME ,
>>>> +.BR O_NONBLOCK ,
>>>>  and
>>>> -.B O_NONBLOCK
>>>> +.B O_ALLOW_ENCODED
>>>>  flags.
>>>>  It is not possible to change the
>>>>  .BR O_DSYNC
>>>> @@ -1820,6 +1821,13 @@ Attempted to clear the
>>>>  flag on a file that has the append-only attribute set.
>>>>  .TP
>>>>  .B EPERM
>>>> +Attempted to set the
>>>> +.B O_ALLOW_ENCODED
>>>> +flag and the calling process did not have the
>>>> +.B CAP_SYS_ADMIN
>>>> +capability.
>>>> +.TP
>>>> +.B EPERM
>>>>  .I cmd
>>>>  was
>>>>  .BR F_ADD_SEALS ,
>>>> diff --git a/man2/open.2 b/man2/open.2
>>>> index f587b0d95..84697dfa8 100644
>>>> --- a/man2/open.2
>>>> +++ b/man2/open.2
>>>> @@ -437,6 +437,16 @@ was followed by a call to
>>>>  .BR fdatasync (2)).
>>>>  .IR "See NOTES below" .
>>>>  .TP
>>>> +.B O_ALLOW_ENCODED
>>>
>>> The list is alphabetically sorted;
>>> please, follow that
>>> (O_ALLOW_ENCODED should be the first one).
>>>
>>>> +Open the file with encoded I/O permissions;
>>>> +see
>>>> +.BR encoded_io (7).
>>>> +.B O_CLOEXEC
>>>> +must be specified in conjuction with this flag.
>>>> +The caller must have the
>>>> +.B CAP_SYS_ADMIN
>>>> +capability.
>>>> +.TP
>>>>  .B O_EXCL
>>>>  Ensure that this call creates the file:
>>>>  if this flag is specified in conjunction with
>>>> @@ -1082,6 +1092,14 @@ is invalid
>>>>  (e.g., it contains characters not permitted by the underlying filesystem).
>>>>  .TP
>>>>  .B EINVAL
>>>> +.B O_ALLOW_ENCODED
>>>> +was specified in
>>>> +.IR flags ,
>>>> +but
>>>> +.B O_CLOEXEC
>>>> +was not specified.
>>>> +.TP
>>>> +.B EINVAL
>>>>  The final component ("basename") of
>>>>  .I pathname
>>>>  is invalid
>>>> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>>>>  The operation was prevented by a file seal; see
>>>>  .BR fcntl (2).
>>>>  .TP
>>>> +.B EPERM
>>>> +The
>>>> +.B O_ALLOW_ENCODED
>>>> +flag was specified, but the caller was not privileged.
>>>> +.TP
>>>>  .B EROFS
>>>>  .I pathname
>>>>  refers to a file on a read-only filesystem and write access was
>>>> diff --git a/man2/readv.2 b/man2/readv.2
>>>> index 5a8b74168..c9933acf0 100644
>>>> --- a/man2/readv.2
>>>> +++ b/man2/readv.2
>>>> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>>>>  However, if the
>>>>  .I offset
>>>>  argument is \-1, the current file offset is updated.
>>>> +.TP
>>>> +.BR RWF_ENCODED " (since Linux 5.12)"
>>>> +Read or write encoded (e.g., compressed) data.
>>>> +See
>>>> +.BR encoded_io (7).
>>>>  .SH RETURN VALUE
>>>>  On success,
>>>>  .BR readv (),
>>>> @@ -283,6 +288,13 @@ than requested (see
>>>>  and
>>>>  .BR write (2)).
>>>>  .PP
>>>> +If
>>>> +.B
>>>> +RWF_ENCODED
>>>
>>> RWF_ENCODED should go in the same line as .B:
>>>
>>> [
>>> .B RWF_ENCODED
>>> ]
>>>
>>>> +was specified in
>>>> +.IR flags ,
>>>> +then the return value is the number of encoded bytes.
>>>> +.PP
>>>>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>>>>  .SH ERRORS
>>>>  The errors are as given for
>>>> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>>>>  .TP
>>>>  .B EOPNOTSUPP
>>>>  An unknown flag is specified in \fIflags\fP.
>>>> +.TP
>>>> +.B EOPNOTSUPP
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the filesystem does not implement encoded I/O.
>>>> +.TP
>>>> +.B EPERM
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the file was not opened with the
>>>> +.B O_ALLOW_ENCODED
>>>> +flag.
>>>> +.PP
>>>> +.BR preadv2 ()
>>>> +can fail for the following reasons:
>>>
>>> The wording is a bit unclear:
>>>
>>> Above your additions (old text, not yours),
>>> it says that some errors apply to preadv2
>>> (as well as to other functions):
>>>
>>> [
>>> ERRORS
>>>        The errors are as given for read(2) and write(2).  Furthermore,
>>>        preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
>>>        for the same reasons as lseek(2).  Additionally, the  following
>>>        errors are defined:
>>>
>>>        EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
>>>               value.
>>>
>>>        EINVAL The vector count, iovcnt, is less than zero  or  greater
>>>               than the permitted maximum.
>>>
>>>        EOPNOTSUPP
>>>               An unknown flag is specified in flags.
>>>
>>>        EOPNOTSUPP
>>>               RWF_ENCODED  is  specified  in  flags and the filesystem
>>>               does not implement encoded I/O.
>>>
>>>        EPERM  RWF_ENCODED is specified in flags and the file  was  not
>>>               opened with the O_ALLOW_ENCODED flag.
>>> ]
>>>
>>> And then you added a line that says:
>>>
>>> [
>>>        preadv2() can fail for the following reasons:
>>> ]
>>>
>>> Which if read strictly, it says that [only] the following errors apply.
>>>
>>> Did you mean that
>>> "preadv3() can _additionally_ fail for the following reasons"?
>>>
>>> Could you please be a bit more specific?
>>>
>>> The same applies for pwritev2() below.
>>>
>>>> +.TP
>>>> +.B E2BIG
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and
>>>> +.I iov[0]
>>>> +is not large enough to return the encoding metadata.
>>>> +.TP
>>>> +.B ENOBUFS
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the buffers in
>>>> +.I iov
>>>> +are not big enough to return the encoded data.
>>>> +.PP
>>>> +.BR pwritev2 ()
>>>> +can fail for the following reasons:
>>>> +.TP
>>>> +.B E2BIG
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and
>>>> +.I iov[0]
>>>> +contains non-zero fields
>>>> +after the kernel's
>>>> +.IR "sizeof(struct\ encoded_iov)" .
>>>
>>> Don't escape the space, if the string is already in "".
>>>
>>>> +.TP
>>>> +.B EINVAL
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the encoding is unknown or not supported by the filesystem.
>>>> +.TP
>>>> +.B EINVAL
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the alignment and/or size requirements are not met.
>>>>  .SH VERSIONS
>>>>  .BR preadv ()
>>>>  and
>>>> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
>>>> new file mode 100644
>>>> index 000000000..106fa587b
>>>> --- /dev/null
>>>> +++ b/man7/encoded_io.7
>>>> @@ -0,0 +1,369 @@
>>>> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
>>>> +.\"
>>>> +.\" %%%LICENSE_START(VERBATIM)
>>>> +.\" Permission is granted to make and distribute verbatim copies of this
>>>> +.\" manual provided the copyright notice and this permission notice are
>>>> +.\" preserved on all copies.
>>>> +.\"
>>>> +.\" Permission is granted to copy and distribute modified versions of this
>>>> +.\" manual under the conditions for verbatim copying, provided that the
>>>> +.\" entire resulting derived work is distributed under the terms of a
>>>> +.\" permission notice identical to this one.
>>>> +.\"
>>>> +.\" Since the Linux kernel and libraries are constantly changing, this
>>>> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
>>>> +.\" responsibility for errors or omissions, or for damages resulting from
>>>> +.\" the use of the information contained herein.  The author(s) may not
>>>> +.\" have taken the same level of care in the production of this manual,
>>>> +.\" which is licensed free of charge, as they might when working
>>>> +.\" professionally.
>>>> +.\"
>>>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>>>> +.\" the source, must acknowledge the copyright and authors of this work.
>>>> +.\" %%%LICENSE_END
>>>> +.\"
>>>> +.\"
>>>> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
>>>> +.SH NAME
>>>> +encoded_io \- overview of encoded I/O
>>>> +.SH DESCRIPTION
>>>> +Several filesystems (e.g., Btrfs) support transparent encoding
>>>> +(e.g., compression, encryption) of data on disk:
>>>> +written data is encoded by the kernel before it is written to disk,
>>>> +and read data is decoded before being returned to the user.
>>>> +In some cases, it is useful to skip this encoding step.
>>>
>>> Here I would use ';' instead of '.'
>>> (and next letter would be lowercase, then).
>>>
>>>> +For example, the user may want to read the compressed contents of a file
>>>> +or write pre-compressed data directly to a file.
>>>> +This is referred to as "encoded I/O".
>>>> +.SS Encoded I/O API
>>>> +Encoded I/O is specified with the
>>>> +.B RWF_ENCODED
>>>> +flag to
>>>> +.BR preadv2 (2)
>>>> +and
>>>> +.BR pwritev2 (2).
>>>> +If
>>>> +.B RWF_ENCODED
>>>> +is specified, then
>>>> +.I iov[0].iov_base
>>>> +points to an
>>>> +.I
>>>> +encoded_iov
>>>
>>> On the same line, please.
>>>
>>>> +structure, defined in
>>>> +.I <linux/fs.h>
>>>> +as:
>>>> +.PP
>>>> +.in +4n
>>>> +.EX
>>>> +struct encoded_iov {
>>>> +    __aligned_u64 len;
>>>> +    __aligned_u64 unencoded_len;
>>>> +    __aligned_u64 unencoded_offset;
>>>> +    __u32 compression;
>>>> +    __u32 encryption;
>>>> +};
>>>> +.EE
>>>> +.in
>>>> +.PP
>>>> +This may be extended in the future, so
>>>> +.I iov[0].iov_len
>>>> +must be set to
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>> +for forward/backward compatibility.
>>>> +The remaining buffers contain the encoded data.
>>>> +.PP
>>>> +.I compression
>>>> +and
>>>> +.I encryption
>>>> +are the encoding fields.
>>>> +.I compression
>>>> +is
>>>> +.B ENCODED_IOV_COMPRESSION_NONE
>>>> +(zero)
>>>> +or a filesystem-specific
>>>> +.B ENCODED_IOV_COMPRESSION
>>>
>>> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/
>>
>> Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/
>>
>> I'm not sure about existing practice.
>>
>> Michael (mtk), what would you do here?
>>
>>>
>>>> +constant;
>>>> +see
>>>> +.BR Filesystem\ support .
>>>
>>> Please, write it as [.BR "Filesystem support" .]
>>>
>>> and maybe I would change it, to be more specific, to the following:
>>>
>>> [
>>> see
>>> .B Filesystem support
>>> below.
>>> ]
>>>
>>> So that the reader clearly understands it's on the same page.
>>>
>>>> +.I encryption
>>>> +is currently always
>>>> +.B ENCODED_IOV_ENCRYPTION_NONE
>>>> +(zero).
>>>> +.PP
>>>> +.I unencoded_len
>>>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>>>> +.I unencoded_offset
>>>> +is the offset into the unencoded data where the data in the file begins
>>>
>>> The above wording is a bit unclear to me.
>>>
>>> I suggest the following:
>>>
>>> [
>>> .I unencoded_offset
>>> is the offset from the begining of the file
>>> to the first byte of the unencoded data
>>> ]
> 
> Now I've read it again, and my wording was even worse than yours.
> I think yours can be understood after a few reads.
> 
> However, I'll still try to reword mine to see if I add some value:
> 
> [
> .I unencoded_offset
> is the offset from the first byte of the unencoded data
> to the first byte of logical data.
> ]
> 
> If you prefer yours, or a mix, that's fine.
> 
>>>
>>>> +(less than or equal to
>>>> +.IR unencoded_len ).
>>>> +.I len
>>>> +is the length of the data in the file
>>>> +(less than or equal to
>>>> +.I unencoded_len
>>>> +-
>>>
>>> Here's a question for Michael (mtk):
>>>
>>> I've seen (many) cases where these math operations
>>> are written without spaces,
>>> and in the same line (e.g., [.IR a + b]).
>>>
>>> I'd like to know your preferences on this,
>>> or what is actually more extended in the manual pages,
>>> to stick with only one of them.
>>>
>>>> +.IR unencoded_offset ).
>>>> +See
>>>> +.B Extent layout
>>>> +below for some examples.
>>>> +.I
>>>
>>> Were you maybe going to add something there?
>>>
>>> If not, please remove that [.I].
>>>
>>>> +.PP
>>>> +If the unencoded data is actually longer than
>>>> +.IR unencoded_len ,
>>>> +then it is truncated;
>>>> +if it is shorter, then it is extended with zeroes.
>>>> +.PP
>>>> +
>>>
>>> Please, remove that blank line.
>>>
>>>> +.BR pwritev2 ()
>>>
>>> Should be [.BR pwritev2 (2)]
>>>
>>> Michael (mtk),
>>>
>>> Am I right in that?  Please, confirm.
>>>
>>>> +uses the metadata specified in
>>>> +.IR iov[0] ,
>>>> +writes the encoded data from the remaining buffers,
>>>> +and returns the number of encoded bytes written
>>>> +(that is, the sum of
>>>> +.I iov[n].iov_len
>>>> +for 1 <=
>>>> +.I n
>>>> +<
>>>> +.IR iovcnt ;
>>>> +partial writes will not occur).
>>>> +At least one encoding field must be non-zero.
>>>> +Note that the encoded data is not validated when it is written;
>>>> +if it is not valid (e.g., it cannot be decompressed),
>>>> +then a subsequent read may return an error.
>>>> +If the
>>>> +.I offset
>>>> +argument to
>>>> +.BR pwritev2 ()
>>>
>>> Same as above: specify (2).
>>>
>>>> +is -1, then the file offset is incremented by
>>>> +.IR len .
>>>> +If
>>>> +.I iov[0].iov_len
>>>> +is less than
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>
>>> [.I] allows spaces, so it should be:
>>>
>>> [
>>> .I sizeof(struct encoded_iov)
>>> ]
>>>
>>>> +in the kernel,
>>>> +then any fields unknown to userspace are treated as if they were zero;
>>>
>>> s/userspace/user space/
>>>
>>> See man-pages(7)::STYLE GUIDE::Preferred terms
>>>
>>>> +if it is greater and any fields unknown to the kernel are non-zero,
>>>> +then this returns -1 and sets
>>>> +.I errno
>>>> +to
>>>> +.BR E2BIG .
>>>> +.PP
>>>> +.BR preadv2 ()
>>>
>>> Same as above: specify (2).
>>>
>>>> +populates the metadata in
>>>> +.IR iov[0] ,
>>>> +the encoded data in the remaining buffers,
>>>> +and returns the number of encoded bytes read.
>>>> +This will only return one extent per call.
>>>> +This can also read data which is not encoded;
>>>> +all encoding fields will be zero in that case.
>>>> +If the
>>>> +.I offset
>>>> +argument to
>>>> +.BR preadv2 ()
>>>
>>> Smae as above: specify (2).
>>>
>>>> +is -1, then the file offset is incremented by
>>>> +.IR len .
>>>> +If
>>>> +.I iov[0].iov_len
>>>> +is less than
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>
>>> Don't need '"' nor '\', as above.
>>>
>>>> +in the kernel and any fields unknown to userspace are non-zero,
>>>
>>> s/userspace/user space/
>>>
>>>> +then
>>>> +.BR preadv2 ()
>>>
>>> (2)
>>>
>>>> +returns -1 and sets
>>>> +.I errno
>>>> +to
>>>> +.BR E2BIG ;
>>>> +if it is greater,
>>>> +then any fields unknown to the kernel are returned as zero.
>>>> +If the provided buffers are not large enough to return an entire encoded
>>>> +extent,
>>>
>>> Please use semantic newlines.
>>> I haven't checked that in the text above,
>>> so if you happen to find that there's any other line
>>> that should also be fixed in that sense, please do so.
>>>
>>> To understand 'semantic newlines',
>>> please have a look at
>>> man-pages(7)::STYLE GUIDE::Use semantic newlines
>>>
>>> Basically, split lines at the most natural separation point,
>>> instead of just when the line gets over the margin.
>>>
>>>> +then
>>>> +.BR preadv2 ()
>>>
>>> (2)
>>>
>>>> +returns -1 and sets
>>>> +.I errno
>>>> +to
>>>> +.BR ENOBUFS .
>>>> +.PP
>>>> +As the filesystem page cache typically contains decoded data,
>>>> +encoded I/O bypasses the page cache.
>>>> +.SS Extent layout
>>>> +By using
>>>> +.IR len ,
>>>> +.IR unencoded_len ,
>>>> +and
>>>> +.IR unencoded_offset ,
>>>> +it is possible to refer to a subset of an unencoded extent.
>>>> +.PP
>>>> +In the simplest case,
>>>> +.I len
>>>> +is equal to
>>>> +.I unencoded_len
>>>> +and
>>>> +.I unencoded_offset
>>>> +is zero.
>>>> +This means that the entire unencoded extent is used.
>>>> +.PP
>>>> +However, suppose we read 50 bytes into a file
>>>> +which contains a single compressed extent.
>>>> +The filesystem must still return the entire compressed extent
>>>> +for us to be able to decompress it,
>>>> +so
>>>> +.I unencoded_len
>>>> +would be the length of the entire decompressed extent.
>>>> +However, because the read was at offset 50,
>>>> +the first 50 bytes should be ignored.
>>>> +Therefore,
>>>> +.I unencoded_offset
>>>> +would be 50,
>>>> +and
>>>> +.I len
>>>> +would accordingly be
>>>> +.IR unencoded_len\ -\ 50 .
>>>
>>> This formats everything as I, except for the last dot.
>>> Replace by:
>>>
>>> [
>>> .I unencoded
>>> - 50.
>>> ]
>>>
>>> Michael (mtk), same as above:
>>> to space, or not to space?  That is the question :p
>>>
>>> Personally, I find spaces more clear.
>>>
>>>> +.PP
>>>> +Additionally, suppose we want to create an encrypted file with length 500,
>>>> +but the file is encrypted with a block cipher using a block size of 4096.
>>>> +The unencoded data would therefore include the appropriate padding,
>>>> +and
>>>> +.I unencoded_len
>>>> +would be 4096.
>>>> +However, to represent the logical size of the file,
>>>> +.I len
>>>> +would be 500
>>>> +(and
>>>> +.I unencoded_offset
>>>> +would be 0).
>>>> +.PP
>>>> +Similar situations can arise in other cases:
>>>> +.IP * 3
>>>> +If the filesystem pads data to the filesystem block size before compressing,
>>>> +then compressed files with a size unaligned to the filesystem block size will
>>>> +end with an extent with
>>>> +.I len
>>>> +<
>>>> +.IR unencoded_len .
>>>> +.IP *
>>>> +Extents cloned from the middle of a larger encoded extent with
>>>> +.B FICLONERANGE
>>>> +may have a non-zero
>>>> +.I unencoded_offset
>>>> +and/or
>>>> +.I len
>>>> +<
>>>> +.IR unencoded_len .
>>>> +.IP *
>>>> +If the middle of an encoded extent is overwritten,
>>>> +the filesystem may create extents with a non-zero
>>>> +.I unencoded_offset
>>>> +and/or
>>>> +.I len
>>>> +<
>>>> +.I unencoded_len
>>>> +for the parts that were not overwritten.
>>>> +.SS Security
>>>> +Encoded I/O creates the potential for some security issues:
>>>> +.IP * 3
>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>> +which can be exploited by maliciously crafted data.
>>>> +.IP *
>>>> +Encoded reads may return data which is not logically present in the file
>>>> +(see the discussion of
>>>> +.I len
>>>> +vs.
>>>
>>> Please, s/vs./vs/
>>> See the reasons below:
>>>
>>> Michael (mtk),
>>>
>>> Here the renderer outputs a double space
>>> (as for separating two sentences).
>>>
>>> Are you okay with that?
>>>
>>> I haven't found any other "\<vs\>\.".
>>> However, I've found a few "\<vs\>[^\.]".
>>>
>>>> +.I unencoded_len
>>>> +above).
>>>> +It may not be intended for this data to be readable.
>>>> +.PP
>>>> +Therefore, encoded I/O requires privilege.
>>>> +Namely, the
>>>> +.B RWF_ENCODED
>>>> +flag may only be used when the file was opened with the
>>>> +.B O_ALLOW_ENCODED
>>>> +flag to
>>>> +.BR open (2),
>>>> +which requires the
>>>> +.B CAP_SYS_ADMIN
>>>> +capability.
>>>> +The
>>>> +.B O_CLOEXEC
>>>> +flag must be specified in conjunction with
>>>> +.BR O_ALLOW_ENCODED .
>>>> +This avoids accidentally leaking the encoded I/O privilege
>>>> +(it is not cleared on
>>>> +.BR fork (2)
>>>> +or
>>>> +.BR execve (2)
>>>> +otherwise).
>>>> +If
>>>> +.B O_ALLOW_ENCODED
>>>> +without
>>>> +.B O_CLOEXEC
>>>> +is desired,
>>>> +.B O_CLOEXEC
>>>> +can be cleared afterwards with
>>>> +.BR fnctl (2).
>>>> +.BR fcntl (2)
>>>> +can also clear or set
>>>> +.B O_ALLOW_ENCODED
>>>> +(including without
>>>> +.BR O_CLOEXEC ).
>>>> +.SS Filesystem support
>>>> +Encoded I/O is supported on the following filesystems:
>>>> +.TP
>>>> +Btrfs (since Linux 5.12)
>>>> +.IP
>>>> +Btrfs supports encoded reads and writes of compressed data.
>>>> +The data is encoded as follows:
>>>> +.RS
>>>> +.IP * 3
>>>> +If
>>>> +.I compression
>>>> +is
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
>>>> +then the encoded data is a single zlib stream.
>>>> +.IP *
>>>> +If
>>>> +.I compression
>>>> +is
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
>>>> +then the encoded data is a single zstd frame compressed with the
>>>> +.I windowLog
>>>> +compression parameter set to no more than 17.
>>>> +.IP *
>>>> +If
>>>> +.I compression
>>>> +is one of
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
>>>> +or
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
>>>> +then the encoded data is compressed page by page
>>>> +(using the page size indicated by the name of the constant)
>>>> +with LZO1X
>>>> +and wrapped in the format documented in the Linux kernel source file
>>>> +.IR fs/btrfs/lzo.c .
>>>> +.RE
>>>> +.IP
>>>> +Additionally, there are some restrictions on
>>>> +.BR pwritev2 ():
>>>
>>> (2)
>>>
>>>> +.RS
>>>> +.IP * 3
>>>> +.I offset
>>>> +(or the current file offset if
>>>> +.I offset
>>>> +is -1) must be aligned to the sector size of the filesystem.
>>>> +.IP *
>>>> +.I len
>>>> +must be aligned to the sector size of the filesystem
>>>> +unless the data ends at or beyond the current end of the file.
>>>> +.IP *
>>>> +.I unencoded_len
>>>> +and the length of the encoded data must each be no more than 128 KiB.
>>>> +This limit may increase in the future.
>>>> +.IP *
>>>> +The length of the encoded data must be less than or equal to
>>>> +.IR unencoded_len .
>>>> +.IP *
>>>> +If using LZO, the filesystem's page size must match the compression page size.
>>>> +.RE
>>>>
>>>
>>> Please, add a SEE ALSO section, which should at least point to
>>> preadv2(2) (or pwritev2(2), if you prefer):
>>>
>>> [
>>> .SH SEE ALSO
>>> .BR preadv2 (2)
>>> ]
>>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-11-20 15:03       ` Alejandro Colomar (man-pages)
  2020-11-30 19:35         ` Omar Sandoval
  2020-12-01 14:36         ` Ping: " Alejandro Colomar (man-pages)
@ 2020-12-01 20:12         ` Michael Kerrisk (man-pages)
  2020-12-01 20:20           ` Michael Kerrisk (man-pages)
       [not found]           ` <20201201202144.ulbfnawi2ljmm6mn@localhost.localdomain>
  2 siblings, 2 replies; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-12-01 20:12 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages), Omar Sandoval
  Cc: mtk.manpages, linux-fsdevel, linux-btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

Hello Alex,

On 11/20/20 4:03 PM, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> I found a wording of mine to be a bit confusing.
> Please see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
>> Hi Omar and Michael,
>>
>> please, see below.
>>
>> Thanks,
>>
>> Alex
>>
>> On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
>>> Hi Omar,
>>>
>>> Please, see some fixes below:
>>>
>>> Michael, I've also some questions for you below
>>> (you can grep for mtk to find those).
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>>>> From: Omar Sandoval <osandov@fb.com>
>>>>
>>>> This adds a new page, encoded_io(7), providing an overview of encoded
>>>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>>>> reference it.
>>>>
>>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> Cc: linux-man <linux-man@vger.kernel.org>
>>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>>> ---
>>>> This feature is not yet upstream.
>>>>
>>>>  man2/fcntl.2      |  10 +-
>>>>  man2/open.2       |  23 +++
>>>>  man2/readv.2      |  70 +++++++++
>>>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>>>  create mode 100644 man7/encoded_io.7
>>>>
>>>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>>>> index 546016617..b0d7fa2c3 100644
>>>> --- a/man2/fcntl.2
>>>> +++ b/man2/fcntl.2
>>>> @@ -221,8 +221,9 @@ On Linux, this command can change only the

[...]

>>>> +.PP
>>>> +This may be extended in the future, so
>>>> +.I iov[0].iov_len
>>>> +must be set to
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>> +for forward/backward compatibility.
>>>> +The remaining buffers contain the encoded data.
>>>> +.PP
>>>> +.I compression
>>>> +and
>>>> +.I encryption
>>>> +are the encoding fields.
>>>> +.I compression
>>>> +is
>>>> +.B ENCODED_IOV_COMPRESSION_NONE
>>>> +(zero)
>>>> +or a filesystem-specific
>>>> +.B ENCODED_IOV_COMPRESSION
>>>
>>> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/
>>
>> Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/
>>
>> I'm not sure about existing practice.
>>
>> Michael (mtk), what would you do here?

I think I've tended towards the former
(ENCODED_IOV_COMPRESSION_*) in the past.

>>
>>>
>>>> +constant;
>>>> +see
>>>> +.BR Filesystem\ support .
>>>
>>> Please, write it as [.BR "Filesystem support" .]
>>>
>>> and maybe I would change it, to be more specific, to the following:
>>>
>>> [
>>> see
>>> .B Filesystem support
>>> below.
>>> ]
>>>
>>> So that the reader clearly understands it's on the same page.
>>>
>>>> +.I encryption
>>>> +is currently always
>>>> +.B ENCODED_IOV_ENCRYPTION_NONE
>>>> +(zero).
>>>> +.PP
>>>> +.I unencoded_len
>>>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>>>> +.I unencoded_offset
>>>> +is the offset into the unencoded data where the data in the file begins
>>>
>>> The above wording is a bit unclear to me.
>>>
>>> I suggest the following:
>>>
>>> [
>>> .I unencoded_offset
>>> is the offset from the begining of the file
>>> to the first byte of the unencoded data
>>> ]
> 
> Now I've read it again, and my wording was even worse than yours.
> I think yours can be understood after a few reads.
> 
> However, I'll still try to reword mine to see if I add some value:
> 
> [
> .I unencoded_offset
> is the offset from the first byte of the unencoded data
> to the first byte of logical data.
> ]
> 
> If you prefer yours, or a mix, that's fine.
> 
>>>
>>>> +(less than or equal to
>>>> +.IR unencoded_len ).
>>>> +.I len
>>>> +is the length of the data in the file
>>>> +(less than or equal to
>>>> +.I unencoded_len
>>>> +-
>>>
>>> Here's a question for Michael (mtk):
>>>
>>> I've seen (many) cases where these math operations
>>> are written without spaces,
>>> and in the same line (e.g., [.IR a + b]).
>>>
>>> I'd like to know your preferences on this,
>>> or what is actually more extended in the manual pages,
>>> to stick with only one of them.

I suspect there's a lot of inconsistency across pages. For simple
cases like this, I think writing it without spaces is fine, and
perhaps even preferable.

>>>
>>>> +.IR unencoded_offset ).
>>>> +See
>>>> +.B Extent layout
>>>> +below for some examples.
>>>> +.I
>>>
>>> Were you maybe going to add something there?
>>>
>>> If not, please remove that [.I].
>>>
>>>> +.PP
>>>> +If the unencoded data is actually longer than
>>>> +.IR unencoded_len ,
>>>> +then it is truncated;
>>>> +if it is shorter, then it is extended with zeroes.
>>>> +.PP
>>>> +
>>>
>>> Please, remove that blank line.
>>>
>>>> +.BR pwritev2 ()
>>>
>>> Should be [.BR pwritev2 (2)]
>>>
>>> Michael (mtk),
>>>
>>> Am I right in that?  Please, confirm.

Yes. References to functions documented in other pages should
include the section number in parentheses.

[...]

>>>> +.PP
>>>> +However, suppose we read 50 bytes into a file
>>>> +which contains a single compressed extent.
>>>> +The filesystem must still return the entire compressed extent
>>>> +for us to be able to decompress it,
>>>> +so
>>>> +.I unencoded_len
>>>> +would be the length of the entire decompressed extent.
>>>> +However, because the read was at offset 50,
>>>> +the first 50 bytes should be ignored.
>>>> +Therefore,
>>>> +.I unencoded_offset
>>>> +would be 50,
>>>> +and
>>>> +.I len
>>>> +would accordingly be
>>>> +.IR unencoded_len\ -\ 50 .
>>>
>>> This formats everything as I, except for the last dot.
>>> Replace by:
>>>
>>> [
>>> .I unencoded
>>> - 50.
>>> ]
>>>
>>> Michael (mtk), same as above:
>>> to space, or not to space?  That is the question :p

In this case, perhaps

.IR unencoded \-1

[...]


>>>> +.SS Security
>>>> +Encoded I/O creates the potential for some security issues:
>>>> +.IP * 3
>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>> +which can be exploited by maliciously crafted data.
>>>> +.IP *
>>>> +Encoded reads may return data which is not logically present in the file
>>>> +(see the discussion of
>>>> +.I len
>>>> +vs.
>>>
>>> Please, s/vs./vs/
>>> See the reasons below:
>>>
>>> Michael (mtk),
>>>
>>> Here the renderer outputs a double space
>>> (as for separating two sentences).
>>>
>>> Are you okay with that?

Yes, that should probably be avoided. I'm not sure what the
correct way is to prevent that in groff though. I mean, one
could write

.RI "vs.\ " unencoded_len

but I think that simply creates a nonbreaking space,
which is not exactly what is desired.

[....]

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-12-01 20:12         ` Michael Kerrisk (man-pages)
@ 2020-12-01 20:20           ` Michael Kerrisk (man-pages)
  2020-12-01 21:35             ` Alejandro Colomar (man-pages)
       [not found]           ` <20201201202144.ulbfnawi2ljmm6mn@localhost.localdomain>
  1 sibling, 1 reply; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-12-01 20:20 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages), Omar Sandoval
  Cc: mtk.manpages, linux-fsdevel, linux-btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

>>>>> +.SS Security
>>>>> +Encoded I/O creates the potential for some security issues:
>>>>> +.IP * 3
>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>> +which can be exploited by maliciously crafted data.
>>>>> +.IP *
>>>>> +Encoded reads may return data which is not logically present in the file
>>>>> +(see the discussion of
>>>>> +.I len
>>>>> +vs.
>>>>
>>>> Please, s/vs./vs/
>>>> See the reasons below:
>>>>
>>>> Michael (mtk),
>>>>
>>>> Here the renderer outputs a double space
>>>> (as for separating two sentences).
>>>>
>>>> Are you okay with that?
> 
> Yes, that should probably be avoided. I'm not sure what the
> correct way is to prevent that in groff though. I mean, one
> could write
> 
> .RI "vs.\ " unencoded_len
> 
> but I think that simply creates a nonbreaking space,
> which is not exactly what is desired.

Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
we can write:

vs.\&

to prevent the double space.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag
  2020-12-01  8:15         ` Amir Goldstein
@ 2020-12-01 20:31           ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2020-12-01 20:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jann Horn, linux-fsdevel, Linux Btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Aleksa Sarai, Linux API,
	Kernel Team

On Tue, Dec 01, 2020 at 10:15:58AM +0200, Amir Goldstein wrote:
> On Mon, Nov 30, 2020 at 9:26 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Sat, Nov 21, 2020 at 12:41:23AM +0100, Jann Horn wrote:
> > > On Thu, Nov 19, 2020 at 8:03 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > On Wed, Nov 18, 2020 at 9:18 PM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > The upcoming RWF_ENCODED operation introduces some security concerns:
> > > > >
> > > > > 1. Compressed writes will pass arbitrary data to decompression
> > > > >    algorithms in the kernel.
> > > > > 2. Compressed reads can leak truncated/hole punched data.
> > > > >
> > > > > Therefore, we need to require privilege for RWF_ENCODED. It's not
> > > > > possible to do the permissions checks at the time of the read or write
> > > > > because, e.g., io_uring submits IO from a worker thread. So, add an open
> > > > > flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> > > > > fcntl(). The flag is not cleared in any way on fork or exec. It must be
> > > > > combined with O_CLOEXEC when opening to avoid accidental leaks (if
> > > > > needed, it may be set without O_CLOEXEC by using fnctl()).
> > > > >
> > > > > Note that the usual issue that unknown open flags are ignored doesn't
> > > > > really matter for O_ALLOW_ENCODED; if the kernel doesn't support
> > > > > O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either.
> > > [...]
> > > > > diff --git a/fs/open.c b/fs/open.c
> > > > > index 9af548fb841b..f2863aaf78e7 100644
> > > > > --- a/fs/open.c
> > > > > +++ b/fs/open.c
> > > > > @@ -1040,6 +1040,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> > > > >                 acc_mode = 0;
> > > > >         }
> > > > >
> > > > > +       /*
> > > > > +        * O_ALLOW_ENCODED must be combined with O_CLOEXEC to avoid accidentally
> > > > > +        * leaking encoded I/O privileges.
> > > > > +        */
> > > > > +       if ((how->flags & (O_ALLOW_ENCODED | O_CLOEXEC)) == O_ALLOW_ENCODED)
> > > > > +               return -EINVAL;
> > > > > +
> > > >
> > > >
> > > > dup() can also result in accidental leak.
> > > > We could fail dup() of fd without O_CLOEXEC. Should we?
> > > >
> > > > If we should than what error code should it be? We could return EPERM,
> > > > but since we do allow to clear O_CLOEXEC or set O_ALLOW_ENCODED
> > > > after open, EPERM seems a tad harsh.
> > > > EINVAL seems inappropriate because the error has nothing to do with
> > > > input args of dup() and EBADF would also be confusing.
> > >
> > > This seems very arbitrary to me. Sure, leaking these file descriptors
> > > wouldn't be great, but there are plenty of other types of file
> > > descriptors that are probably more sensitive. (Writable file
> > > descriptors to databases, to important configuration files, to
> > > io_uring instances, and so on.) So I don't see why this specific
> > > feature should impose such special rules on it.
> >
> > I agree with Jann. I'm okay with the O_CLOEXEC-on-open requirement if it
> > makes people more comfortable, but I don't think we should be bending
> > over backwards to block it anywhere else.
> 
> I'm fine with or without the O_CLOEXEC-on-open requirement.
> Just pointing out the weirdness.

I agree, it's weird to enforce it in one place but not in others, so I
think I might as well drop the O_CLOEXEC requirement altogether.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
       [not found]           ` <20201201202144.ulbfnawi2ljmm6mn@localhost.localdomain>
@ 2020-12-01 21:34             ` Alejandro Colomar (man-pages)
  2020-12-01 21:58             ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 43+ messages in thread
From: Alejandro Colomar (man-pages) @ 2020-12-01 21:34 UTC (permalink / raw)
  To: G. Branden Robinson, Michael Kerrisk (man-pages)
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

Hi Branden,

A good read, as always!

Thanks,

Alex

On 12/1/20 9:21 PM, G. Branden Robinson wrote:
> At 2020-12-01T21:12:47+0100, Michael Kerrisk (man-pages) wrote:
>>>>>> +vs.
>>>>>
>>>>> Please, s/vs./vs/
>>>>> See the reasons below:
>>>>>
>>>>> Michael (mtk),
>>>>>
>>>>> Here the renderer outputs a double space
>>>>> (as for separating two sentences).
>>>>>
>>>>> Are you okay with that?
>>
>> Yes, that should probably be avoided. I'm not sure what the
>> correct way is to prevent that in groff though. I mean, one
>> could write
>>
>> .RI "vs.\ " unencoded_len
>>
>> but I think that simply creates a nonbreaking space,
>> which is not exactly what is desired.
> 
> Use the non-printing input break escape sequence, "\&", to suppress
> end-of-sentence detection.  This is not a groffism, it goes back to
> 1970s nroff and troff.
> 
> I'm attaching a couple of pages from some introductory material I wrote
> for the groff Texinfo manual in the forthcoming 1.23.0.
> 
> Regards,
> Branden
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-12-01 20:20           ` Michael Kerrisk (man-pages)
@ 2020-12-01 21:35             ` Alejandro Colomar (man-pages)
  2020-12-01 21:56               ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 43+ messages in thread
From: Alejandro Colomar (man-pages) @ 2020-12-01 21:35 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team, linux-man

Hi Michael,

On 12/1/20 9:20 PM, Michael Kerrisk (man-pages) wrote:
>>>>>> +.SS Security
>>>>>> +Encoded I/O creates the potential for some security issues:
>>>>>> +.IP * 3
>>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>>> +which can be exploited by maliciously crafted data.
>>>>>> +.IP *
>>>>>> +Encoded reads may return data which is not logically present in the file
>>>>>> +(see the discussion of
>>>>>> +.I len
>>>>>> +vs.
>>>>>
>>>>> Please, s/vs./vs/
>>>>> See the reasons below:
>>>>>
>>>>> Michael (mtk),
>>>>>
>>>>> Here the renderer outputs a double space
>>>>> (as for separating two sentences).
>>>>>
>>>>> Are you okay with that?
>>
>> Yes, that should probably be avoided. I'm not sure what the
>> correct way is to prevent that in groff though. I mean, one
>> could write
>>
>> .RI "vs.\ " unencoded_len
>>
>> but I think that simply creates a nonbreaking space,
>> which is not exactly what is desired.
> 
> Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
> we can write:
> 
> vs.\&
> 
> to prevent the double space.

Nice to see it's possible.
However, I would argue for simplicity,
and use a simple 'vs',
which is already in use.

Cheers,

Alex

> 
> Thanks,
> 
> Michael
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
  2020-12-01 21:35             ` Alejandro Colomar (man-pages)
@ 2020-12-01 21:56               ` Michael Kerrisk (man-pages)
  2020-12-18 10:32                 ` Ping: " Alejandro Colomar (man-pages)
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-12-01 21:56 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages), Omar Sandoval
  Cc: mtk.manpages, linux-fsdevel, linux-btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

Hi Alex,

On 12/1/20 10:35 PM, Alejandro Colomar (man-pages) wrote:
> Hi Michael,
> 
> On 12/1/20 9:20 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>> +.SS Security
>>>>>>> +Encoded I/O creates the potential for some security issues:
>>>>>>> +.IP * 3
>>>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>>>> +which can be exploited by maliciously crafted data.
>>>>>>> +.IP *
>>>>>>> +Encoded reads may return data which is not logically present in the file
>>>>>>> +(see the discussion of
>>>>>>> +.I len
>>>>>>> +vs.
>>>>>>
>>>>>> Please, s/vs./vs/
>>>>>> See the reasons below:
>>>>>>
>>>>>> Michael (mtk),
>>>>>>
>>>>>> Here the renderer outputs a double space
>>>>>> (as for separating two sentences).
>>>>>>
>>>>>> Are you okay with that?
>>>
>>> Yes, that should probably be avoided. I'm not sure what the
>>> correct way is to prevent that in groff though. I mean, one
>>> could write
>>>
>>> .RI "vs.\ " unencoded_len
>>>
>>> but I think that simply creates a nonbreaking space,
>>> which is not exactly what is desired.
>>
>> Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
>> we can write:
>>
>> vs.\&
>>
>> to prevent the double space.
> 
> Nice to see it's possible.
> However, I would argue for simplicity,
> and use a simple 'vs',
> which is already in use.

Indeed better. Thanks for noticing that.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH man-pages v6] Document encoded I/O
       [not found]           ` <20201201202144.ulbfnawi2ljmm6mn@localhost.localdomain>
  2020-12-01 21:34             ` Alejandro Colomar (man-pages)
@ 2020-12-01 21:58             ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-12-01 21:58 UTC (permalink / raw)
  To: G. Branden Robinson
  Cc: mtk.manpages, Alejandro Colomar (man-pages),
	Omar Sandoval, linux-fsdevel, linux-btrfs, Al Viro,
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

Hi Branden,

On 12/1/20 9:21 PM, G. Branden Robinson wrote:
> At 2020-12-01T21:12:47+0100, Michael Kerrisk (man-pages) wrote:
>>>>>> +vs.
>>>>>
>>>>> Please, s/vs./vs/
>>>>> See the reasons below:
>>>>>
>>>>> Michael (mtk),
>>>>>
>>>>> Here the renderer outputs a double space
>>>>> (as for separating two sentences).
>>>>>
>>>>> Are you okay with that?
>>
>> Yes, that should probably be avoided. I'm not sure what the
>> correct way is to prevent that in groff though. I mean, one
>> could write
>>
>> .RI "vs.\ " unencoded_len
>>
>> but I think that simply creates a nonbreaking space,
>> which is not exactly what is desired.
> 
> Use the non-printing input break escape sequence, "\&", to suppress
> end-of-sentence detection.  This is not a groffism, it goes back to
> 1970s nroff and troff.

Yes, I spotted it about two minutes before your mail. And before that, I 
was thinking, "should we really bother Branden with a question like 
this?" :-)

> I'm attaching a couple of pages from some introductory material I wrote
> for the groff Texinfo manual in the forthcoming 1.23.0.

As ever, thanks for jumping in, Branden.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes
  2020-11-18 19:18 ` [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes Omar Sandoval
@ 2020-12-02 22:03   ` Josef Bacik
  2020-12-03 14:37   ` Josef Bacik
  1 sibling, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2020-12-02 22:03 UTC (permalink / raw)
  To: Omar Sandoval, linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On 11/18/20 2:18 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The implementation resembles direct I/O: we have to flush any ordered
> extents, invalidate the page cache, and do the io tree/delalloc/extent
> map/ordered extent dance. From there, we can reuse the compression code
> with a minor modification to distinguish the write from writeback. This
> also creates inline extents when possible.
> 
> Now that read and write are implemented, this also sets the
> FMODE_ENCODED_IO flag in btrfs_file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>   fs/btrfs/compression.c  |   7 +-
>   fs/btrfs/compression.h  |   6 +-
>   fs/btrfs/ctree.h        |   2 +
>   fs/btrfs/file.c         |  37 +++++-
>   fs/btrfs/inode.c        | 259 +++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/ordered-data.c |  12 +-
>   fs/btrfs/ordered-data.h |   2 +
>   7 files changed, 313 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index eaa6fe21c08e..015c9e5d75b9 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -336,7 +336,8 @@ static void end_compressed_bio_write(struct bio *bio)
>   			bio->bi_status == BLK_STS_OK);
>   	cb->compressed_pages[0]->mapping = NULL;
>   
> -	end_compressed_writeback(inode, cb);
> +	if (cb->writeback)
> +		end_compressed_writeback(inode, cb);
>   	/* note, our inode could be gone now */
>   
>   	/*
> @@ -372,7 +373,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   				 struct page **compressed_pages,
>   				 unsigned long nr_pages,
>   				 unsigned int write_flags,
> -				 struct cgroup_subsys_state *blkcg_css)
> +				 struct cgroup_subsys_state *blkcg_css,
> +				 bool writeback)
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>   	struct bio *bio = NULL;
> @@ -396,6 +398,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   	cb->mirror_num = 0;
>   	cb->compressed_pages = compressed_pages;
>   	cb->compressed_len = compressed_len;
> +	cb->writeback = writeback;
>   	cb->orig_bio = NULL;
>   	cb->nr_pages = nr_pages;
>   
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 8001b700ea3a..f95cdc16f503 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -49,6 +49,9 @@ struct compressed_bio {
>   	/* the compression algorithm for this bio */
>   	int compress_type;
>   
> +	/* Whether this is a write for writeback. */
> +	bool writeback;
> +
>   	/* number of compressed pages in the array */
>   	unsigned long nr_pages;
>   
> @@ -96,7 +99,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   				  struct page **compressed_pages,
>   				  unsigned long nr_pages,
>   				  unsigned int write_flags,
> -				  struct cgroup_subsys_state *blkcg_css);
> +				  struct cgroup_subsys_state *blkcg_css,
> +				  bool writeback);
>   blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>   				 int mirror_num, unsigned long bio_flags);
>   
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index ce78424f1d98..9b585ac9c7a9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3134,6 +3134,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>   void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>   					  u64 end, int uptodate);
>   ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> +ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> +			       struct encoded_iov *encoded);
>   
>   extern const struct dentry_operations btrfs_dentry_operations;
>   extern const struct iomap_ops btrfs_dio_iomap_ops;
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 193477565200..f815ffb93d43 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1994,6 +1994,32 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>   	return written ? written : err;
>   }
>   
> +static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file_inode(file);
> +	struct encoded_iov encoded;
> +	ssize_t ret;
> +
> +	ret = copy_encoded_iov_from_iter(&encoded, from);
> +	if (ret)
> +		return ret;
> +
> +	btrfs_inode_lock(inode, 0);
> +	ret = generic_encoded_write_checks(iocb, &encoded);
> +	if (ret || encoded.len == 0)
> +		goto out;
> +
> +	ret = btrfs_write_check(iocb, from, encoded.len);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = btrfs_do_encoded_write(iocb, from, &encoded);
> +out:
> +	btrfs_inode_unlock(inode, 0);
> +	return ret;
> +}
> +
>   static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   				    struct iov_iter *from)
>   {
> @@ -2012,14 +2038,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   	if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state))
>   		return -EROFS;
>   
> -	if (!(iocb->ki_flags & IOCB_DIRECT) &&
> -	    (iocb->ki_flags & IOCB_NOWAIT))
> +	if ((iocb->ki_flags & IOCB_NOWAIT) &&
> +	    (!(iocb->ki_flags & IOCB_DIRECT) ||
> +	     (iocb->ki_flags & IOCB_ENCODED)))
>   		return -EOPNOTSUPP;
>   
>   	if (sync)
>   		atomic_inc(&BTRFS_I(inode)->sync_writers);
>   
> -	if (iocb->ki_flags & IOCB_DIRECT)
> +	if (iocb->ki_flags & IOCB_ENCODED)
> +		num_written = btrfs_encoded_write(iocb, from);
> +	else if (iocb->ki_flags & IOCB_DIRECT)
>   		num_written = btrfs_direct_write(iocb, from);
>   	else
>   		num_written = btrfs_buffered_write(iocb, from);
> @@ -3586,7 +3615,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
>   
>   static int btrfs_file_open(struct inode *inode, struct file *filp)
>   {
> -	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
> +	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_ENCODED_IO;
>   	return generic_file_open(inode, filp);
>   }
>   
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b0e800897b3b..2bf7b487939f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -935,7 +935,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>   				    ins.offset, async_extent->pages,
>   				    async_extent->nr_pages,
>   				    async_chunk->write_flags,
> -				    async_chunk->blkcg_css)) {
> +				    async_chunk->blkcg_css, true)) {
>   			struct page *p = async_extent->pages[0];
>   			const u64 start = async_extent->start;
>   			const u64 end = start + async_extent->ram_size - 1;
> @@ -2703,6 +2703,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
>   	 * except if the ordered extent was truncated.
>   	 */
>   	update_inode_bytes = test_bit(BTRFS_ORDERED_DIRECT, &oe->flags) ||
> +	                     test_bit(BTRFS_ORDERED_ENCODED, &oe->flags) ||

Gotta use our git hooks, checkpatch caught the spaces here.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads
  2020-11-18 19:18 ` [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads Omar Sandoval
@ 2020-12-03 14:32   ` Josef Bacik
  2021-01-11 20:21     ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Bacik @ 2020-12-03 14:32 UTC (permalink / raw)
  To: Omar Sandoval, linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On 11/18/20 2:18 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There are 4 main cases:
> 
> 1. Inline extents: we copy the data straight out of the extent buffer.
> 2. Hole/preallocated extents: we fill in zeroes.
> 3. Regular, uncompressed extents: we read the sectors we need directly
>     from disk.
> 4. Regular, compressed extents: we read the entire compressed extent
>     from disk and indicate what subset of the decompressed extent is in
>     the file.
> 
> This initial implementation simplifies a few things that can be improved
> in the future:
> 
> - We hold the inode lock during the operation.
> - Cases 1, 3, and 4 allocate temporary memory to read into before
>    copying out to userspace.
> - We don't do read repair, because it turns out that read repair is
>    currently broken for compressed data.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>   fs/btrfs/ctree.h |   2 +
>   fs/btrfs/file.c  |   5 +
>   fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 503 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 6ab2ab002bf6..ce78424f1d98 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3133,6 +3133,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
>   int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>   void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>   					  u64 end, int uptodate);
> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> +
>   extern const struct dentry_operations btrfs_dentry_operations;
>   extern const struct iomap_ops btrfs_dio_iomap_ops;
>   extern const struct iomap_dio_ops btrfs_dio_ops;
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 224295f8f1e1..193477565200 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3629,6 +3629,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>   {
>   	ssize_t ret = 0;
>   
> +	if (iocb->ki_flags & IOCB_ENCODED) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EOPNOTSUPP;
> +		return btrfs_encoded_read(iocb, to);
> +	}
>   	if (iocb->ki_flags & IOCB_DIRECT) {
>   		ret = btrfs_direct_read(iocb, to);
>   		if (ret < 0 || !iov_iter_count(to) ||
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1ff903f5c5a4..b0e800897b3b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9936,6 +9936,502 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
>   	}
>   }
>   
> +static int encoded_iov_compression_from_btrfs(unsigned int compress_type)
> +{
> +	switch (compress_type) {
> +	case BTRFS_COMPRESS_NONE:
> +		return ENCODED_IOV_COMPRESSION_NONE;
> +	case BTRFS_COMPRESS_ZLIB:
> +		return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB;
> +	case BTRFS_COMPRESS_LZO:
> +		/*
> +		 * The LZO format depends on the page size. 64k is the maximum
> +		 * sectorsize (and thus page size) that we support.
> +		 */
> +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> +			return -EINVAL;
> +		return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12);
> +	case BTRFS_COMPRESS_ZSTD:
> +		return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD;
> +	default:
> +		return -EUCLEAN;
> +	}
> +}
> +
> +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
> +					 struct iov_iter *iter, u64 start,
> +					 u64 lockend,
> +					 struct extent_state **cached_state,
> +					 u64 extent_start, size_t count,
> +					 struct encoded_iov *encoded,
> +					 bool *unlocked)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_path *path;
> +	struct extent_buffer *leaf;
> +	struct btrfs_file_extent_item *item;
> +	u64 ram_bytes;
> +	unsigned long ptr;
> +	void *tmp;
> +	ssize_t ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> +				       0);
> +	if (ret) {
> +		if (ret > 0) {
> +			/* The extent item disappeared? */
> +			ret = -EIO;
> +		}
> +		goto out;
> +	}
> +	leaf = path->nodes[0];
> +	item = btrfs_item_ptr(leaf, path->slots[0],
> +			      struct btrfs_file_extent_item);
> +
> +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> +	ptr = btrfs_file_extent_inline_start(item);
> +
> +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> +			iocb->ki_pos);
> +	ret = encoded_iov_compression_from_btrfs(
> +				 btrfs_file_extent_compression(leaf, item));
> +	if (ret < 0)
> +		goto out;
> +	encoded->compression = ret;
> +	if (encoded->compression) {
> +		size_t inline_size;
> +
> +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> +						btrfs_item_nr(path->slots[0]));
> +		if (inline_size > count) {
> +			ret = -ENOBUFS;
> +			goto out;
> +		}
> +		count = inline_size;
> +		encoded->unencoded_len = ram_bytes;
> +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> +	} else {
> +		encoded->len = encoded->unencoded_len = count =
> +			min_t(u64, count, encoded->len);
> +		ptr += iocb->ki_pos - extent_start;
> +	}
> +
> +	tmp = kmalloc(count, GFP_NOFS);
> +	if (!tmp) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	read_extent_buffer(leaf, tmp, ptr, count);
> +	btrfs_release_path(path);
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock_shared(inode);
> +	*unlocked = true;
> +
> +	ret = copy_encoded_iov_to_iter(encoded, iter);
> +	if (ret)
> +		goto out_free;
> +	ret = copy_to_iter(tmp, count, iter);
> +	if (ret != count)
> +		ret = -EFAULT;
> +out_free:
> +	kfree(tmp);
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +struct btrfs_encoded_read_private {
> +	struct inode *inode;
> +	wait_queue_head_t wait;
> +	atomic_t pending;
> +	blk_status_t status;
> +	bool skip_csum;
> +};
> +
> +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> +					    struct bio *bio, int mirror_num,
> +					    unsigned long bio_flags)
> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	blk_status_t ret;
> +
> +	if (!priv->skip_csum) {
> +		ret = btrfs_lookup_bio_sums(inode, bio, io_bio->logical, NULL);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> +	if (ret) {
> +		btrfs_io_bio_free_csum(io_bio);
> +		return ret;
> +	}
> +
> +	atomic_inc(&priv->pending);
> +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> +	if (ret) {
> +		atomic_dec(&priv->pending);
> +		btrfs_io_bio_free_csum(io_bio);
> +	}
> +	return ret;
> +}
> +
> +static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio)
> +{
> +	const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK;
> +	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
> +	struct inode *inode = priv->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	u32 sectorsize = fs_info->sectorsize;
> +	struct bio_vec *bvec;
> +	struct bvec_iter_all iter_all;
> +	u64 start = io_bio->logical;
> +	int icsum = 0;
> +
> +	if (priv->skip_csum || !uptodate)
> +		return io_bio->bio.bi_status;
> +
> +	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
> +		unsigned int i, nr_sectors, pgoff;
> +
> +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> +		pgoff = bvec->bv_offset;
> +		for (i = 0; i < nr_sectors; i++) {
> +			ASSERT(pgoff < PAGE_SIZE);
> +			if (check_data_csum(inode, io_bio, icsum, bvec->bv_page,
> +					    pgoff, start))
> +				return BLK_STS_IOERR;
> +			start += sectorsize;
> +			icsum++;
> +			pgoff += sectorsize;
> +		}
> +	}
> +	return BLK_STS_OK;
> +}
> +
> +static void btrfs_encoded_read_endio(struct bio *bio)
> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +	blk_status_t status;
> +
> +	status = btrfs_encoded_read_check_bio(io_bio);
> +	if (status) {
> +		/*
> +		 * The memory barrier implied by the atomic_dec_return() here
> +		 * pairs with the memory barrier implied by the
> +		 * atomic_dec_return() or io_wait_event() in
> +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> +		 * write is observed before the load of status in
> +		 * btrfs_encoded_read_regular_fill_pages().
> +		 */
> +		WRITE_ONCE(priv->status, status);
> +	}
> +	if (!atomic_dec_return(&priv->pending))
> +		wake_up(&priv->wait);
> +	btrfs_io_bio_free_csum(io_bio);
> +	bio_put(bio);
> +}
> +
> +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset,
> +						 u64 disk_io_size, struct page **pages)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_encoded_read_private priv = {
> +		.inode = inode,
> +		.pending = ATOMIC_INIT(1),
> +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> +	};
> +	unsigned long i = 0;
> +	u64 cur = 0;
> +	int ret;
> +
> +	init_waitqueue_head(&priv.wait);
> +	/*
> +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> +	 * necessary.
> +	 */
> +	while (cur < disk_io_size) {
> +		struct btrfs_io_geometry geom;
> +		struct bio *bio = NULL;
> +		u64 remaining;
> +
> +		ret = btrfs_get_io_geometry(fs_info, BTRFS_MAP_READ,
> +					    offset + cur, disk_io_size - cur,
> +					    &geom);
> +		if (ret) {
> +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> +			break;
> +		}
> +		remaining = min(geom.len, disk_io_size - cur);
> +		while (bio || remaining) {
> +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> +
> +			if (!bio) {
> +				bio = btrfs_bio_alloc(offset + cur);
> +				bio->bi_end_io = btrfs_encoded_read_endio;
> +				bio->bi_private = &priv;
> +				bio->bi_opf = REQ_OP_READ;
> +			}
> +
> +			if (!bytes ||
> +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> +				blk_status_t status;
> +
> +				status = submit_encoded_read_bio(inode, bio, 0,
> +								 0);
> +				if (status) {
> +					WRITE_ONCE(priv.status, status);
> +					bio_put(bio);
> +					goto out;
> +				}
> +				bio = NULL;
> +				continue;
> +			}
> +
> +			i++;
> +			cur += bytes;
> +			remaining -= bytes;
> +		}
> +	}
> +
> +out:
> +	if (atomic_dec_return(&priv.pending))
> +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> +	/* See btrfs_encoded_read_endio() for ordering. */
> +	return blk_status_to_errno(READ_ONCE(priv.status));
> +}
> +
> +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> +					  struct iov_iter *iter,
> +					  u64 start, u64 lockend,
> +					  struct extent_state **cached_state,
> +					  u64 offset, u64 disk_io_size,
> +					  size_t count,
> +					  const struct encoded_iov *encoded,
> +					  bool *unlocked)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct page **pages;
> +	unsigned long nr_pages, i;
> +	u64 cur;
> +	size_t page_offset;
> +	ssize_t ret;
> +
> +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> +	if (!pages)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_pages; i++) {
> +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> +		if (!pages[i]) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	}
> +
> +	ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size,
> +						    pages);
> +	if (ret)
> +		goto out;
> +
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock_shared(inode);
> +	*unlocked = true;
> +
> +	ret = copy_encoded_iov_to_iter(encoded, iter);
> +	if (ret)
> +		goto out;
> +	if (encoded->compression) {
> +		i = 0;
> +		page_offset = 0;
> +	} else {
> +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> +	}
> +	cur = 0;
> +	while (cur < count) {
> +		size_t bytes = min_t(size_t, count - cur,
> +				     PAGE_SIZE - page_offset);
> +
> +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> +				      iter) != bytes) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +		i++;
> +		cur += bytes;
> +		page_offset = 0;
> +	}
> +	ret = count;
> +out:
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pages[i])
> +			__free_page(pages[i]);
> +	}
> +	kfree(pages);
> +	return ret;
> +}
> +
> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	ssize_t ret;
> +	size_t count;
> +	u64 start, lockend, offset, disk_io_size;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_map *em;
> +	struct encoded_iov encoded = {};
> +	bool unlocked = false;
> +
> +	ret = generic_encoded_read_checks(iocb, iter);
> +	if (ret < 0)
> +		return ret;
> +	if (ret == 0)
> +		return copy_encoded_iov_to_iter(&encoded, iter);
> +	count = ret;
> +
> +	file_accessed(iocb->ki_filp);
> +
> +	inode_lock_shared(inode);
> +
> +	if (iocb->ki_pos >= inode->i_size) {
> +		inode_unlock_shared(inode);
> +		return copy_encoded_iov_to_iter(&encoded, iter);
> +	}
> +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> +	/*
> +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> +	 * it's compressed we know that it won't be longer than this.
> +	 */
> +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> +
> +	for (;;) {
> +		struct btrfs_ordered_extent *ordered;
> +
> +		ret = btrfs_wait_ordered_range(inode, start,
> +					       lockend - start + 1);
> +		if (ret)
> +			goto out_unlock_inode;
> +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +						     lockend - start + 1);
> +		if (!ordered)
> +			break;
> +		btrfs_put_ordered_extent(ordered);
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +		cond_resched();
> +	}

This can be replaced with btrfs_lock_and_flush_ordered_range().  Then you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes
  2020-11-18 19:18 ` [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes Omar Sandoval
  2020-12-02 22:03   ` Josef Bacik
@ 2020-12-03 14:37   ` Josef Bacik
  1 sibling, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2020-12-03 14:37 UTC (permalink / raw)
  To: Omar Sandoval, linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig
  Cc: Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On 11/18/20 2:18 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The implementation resembles direct I/O: we have to flush any ordered
> extents, invalidate the page cache, and do the io tree/delalloc/extent
> map/ordered extent dance. From there, we can reuse the compression code
> with a minor modification to distinguish the write from writeback. This
> also creates inline extents when possible.
> 
> Now that read and write are implemented, this also sets the
> FMODE_ENCODED_IO flag in btrfs_file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Fix up the spacing thing and then you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Ping: [PATCH man-pages v6] Document encoded I/O
  2020-12-01 21:56               ` Michael Kerrisk (man-pages)
@ 2020-12-18 10:32                 ` Alejandro Colomar (man-pages)
  2021-01-12  1:12                   ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Alejandro Colomar (man-pages) @ 2020-12-18 10:32 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Al Viro,
	Alejandro Colomar (man-pages), Michael Kerrisk (man-pages),
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

Hi Omar,

Linux 5.10 has been recently released.
Do you have any updates for this patch?

Thanks,

Alex

On 12/1/20 10:56 PM, Michael Kerrisk (man-pages) wrote:
> Hi Alex,
> 
> On 12/1/20 10:35 PM, Alejandro Colomar (man-pages) wrote:
>> Hi Michael,
>>
>> On 12/1/20 9:20 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>>> +.SS Security
>>>>>>>> +Encoded I/O creates the potential for some security issues:
>>>>>>>> +.IP * 3
>>>>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>>>>> +which can be exploited by maliciously crafted data.
>>>>>>>> +.IP *
>>>>>>>> +Encoded reads may return data which is not logically present in the file
>>>>>>>> +(see the discussion of
>>>>>>>> +.I len
>>>>>>>> +vs.
>>>>>>>
>>>>>>> Please, s/vs./vs/
>>>>>>> See the reasons below:
>>>>>>>
>>>>>>> Michael (mtk),
>>>>>>>
>>>>>>> Here the renderer outputs a double space
>>>>>>> (as for separating two sentences).
>>>>>>>
>>>>>>> Are you okay with that?
>>>>
>>>> Yes, that should probably be avoided. I'm not sure what the
>>>> correct way is to prevent that in groff though. I mean, one
>>>> could write
>>>>
>>>> .RI "vs.\ " unencoded_len
>>>>
>>>> but I think that simply creates a nonbreaking space,
>>>> which is not exactly what is desired.
>>>
>>> Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
>>> we can write:
>>>
>>> vs.\&
>>>
>>> to prevent the double space.
>>
>> Nice to see it's possible.
>> However, I would argue for simplicity,
>> and use a simple 'vs',
>> which is already in use.
> 
> Indeed better. Thanks for noticing that.
> 
> Thanks,
> 
> Michael
> 
> 

-- 
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads
  2020-12-03 14:32   ` Josef Bacik
@ 2021-01-11 20:21     ` Omar Sandoval
  2021-01-11 20:35       ` Josef Bacik
  0 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2021-01-11 20:21 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On Thu, Dec 03, 2020 at 09:32:37AM -0500, Josef Bacik wrote:
> On 11/18/20 2:18 PM, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > There are 4 main cases:
> > 
> > 1. Inline extents: we copy the data straight out of the extent buffer.
> > 2. Hole/preallocated extents: we fill in zeroes.
> > 3. Regular, uncompressed extents: we read the sectors we need directly
> >     from disk.
> > 4. Regular, compressed extents: we read the entire compressed extent
> >     from disk and indicate what subset of the decompressed extent is in
> >     the file.
> > 
> > This initial implementation simplifies a few things that can be improved
> > in the future:
> > 
> > - We hold the inode lock during the operation.
> > - Cases 1, 3, and 4 allocate temporary memory to read into before
> >    copying out to userspace.
> > - We don't do read repair, because it turns out that read repair is
> >    currently broken for compressed data.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >   fs/btrfs/ctree.h |   2 +
> >   fs/btrfs/file.c  |   5 +
> >   fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
> >   3 files changed, 503 insertions(+)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 6ab2ab002bf6..ce78424f1d98 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3133,6 +3133,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
> >   int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> >   void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> >   					  u64 end, int uptodate);
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> > +
> >   extern const struct dentry_operations btrfs_dentry_operations;
> >   extern const struct iomap_ops btrfs_dio_iomap_ops;
> >   extern const struct iomap_dio_ops btrfs_dio_ops;
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 224295f8f1e1..193477565200 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -3629,6 +3629,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >   {
> >   	ssize_t ret = 0;
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EOPNOTSUPP;
> > +		return btrfs_encoded_read(iocb, to);
> > +	}
> >   	if (iocb->ki_flags & IOCB_DIRECT) {
> >   		ret = btrfs_direct_read(iocb, to);
> >   		if (ret < 0 || !iov_iter_count(to) ||
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 1ff903f5c5a4..b0e800897b3b 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -9936,6 +9936,502 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
> >   	}
> >   }
> > +static int encoded_iov_compression_from_btrfs(unsigned int compress_type)
> > +{
> > +	switch (compress_type) {
> > +	case BTRFS_COMPRESS_NONE:
> > +		return ENCODED_IOV_COMPRESSION_NONE;
> > +	case BTRFS_COMPRESS_ZLIB:
> > +		return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB;
> > +	case BTRFS_COMPRESS_LZO:
> > +		/*
> > +		 * The LZO format depends on the page size. 64k is the maximum
> > +		 * sectorsize (and thus page size) that we support.
> > +		 */
> > +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> > +			return -EINVAL;
> > +		return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12);
> > +	case BTRFS_COMPRESS_ZSTD:
> > +		return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD;
> > +	default:
> > +		return -EUCLEAN;
> > +	}
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
> > +					 struct iov_iter *iter, u64 start,
> > +					 u64 lockend,
> > +					 struct extent_state **cached_state,
> > +					 u64 extent_start, size_t count,
> > +					 struct encoded_iov *encoded,
> > +					 bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct btrfs_path *path;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_file_extent_item *item;
> > +	u64 ram_bytes;
> > +	unsigned long ptr;
> > +	void *tmp;
> > +	ssize_t ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> > +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> > +				       0);
> > +	if (ret) {
> > +		if (ret > 0) {
> > +			/* The extent item disappeared? */
> > +			ret = -EIO;
> > +		}
> > +		goto out;
> > +	}
> > +	leaf = path->nodes[0];
> > +	item = btrfs_item_ptr(leaf, path->slots[0],
> > +			      struct btrfs_file_extent_item);
> > +
> > +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> > +	ptr = btrfs_file_extent_inline_start(item);
> > +
> > +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> > +			iocb->ki_pos);
> > +	ret = encoded_iov_compression_from_btrfs(
> > +				 btrfs_file_extent_compression(leaf, item));
> > +	if (ret < 0)
> > +		goto out;
> > +	encoded->compression = ret;
> > +	if (encoded->compression) {
> > +		size_t inline_size;
> > +
> > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > +						btrfs_item_nr(path->slots[0]));
> > +		if (inline_size > count) {
> > +			ret = -ENOBUFS;
> > +			goto out;
> > +		}
> > +		count = inline_size;
> > +		encoded->unencoded_len = ram_bytes;
> > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > +	} else {
> > +		encoded->len = encoded->unencoded_len = count =
> > +			min_t(u64, count, encoded->len);
> > +		ptr += iocb->ki_pos - extent_start;
> > +	}
> > +
> > +	tmp = kmalloc(count, GFP_NOFS);
> > +	if (!tmp) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	read_extent_buffer(leaf, tmp, ptr, count);
> > +	btrfs_release_path(path);
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock_shared(inode);
> > +	*unlocked = true;
> > +
> > +	ret = copy_encoded_iov_to_iter(encoded, iter);
> > +	if (ret)
> > +		goto out_free;
> > +	ret = copy_to_iter(tmp, count, iter);
> > +	if (ret != count)
> > +		ret = -EFAULT;
> > +out_free:
> > +	kfree(tmp);
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> > +struct btrfs_encoded_read_private {
> > +	struct inode *inode;
> > +	wait_queue_head_t wait;
> > +	atomic_t pending;
> > +	blk_status_t status;
> > +	bool skip_csum;
> > +};
> > +
> > +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> > +					    struct bio *bio, int mirror_num,
> > +					    unsigned long bio_flags)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	blk_status_t ret;
> > +
> > +	if (!priv->skip_csum) {
> > +		ret = btrfs_lookup_bio_sums(inode, bio, io_bio->logical, NULL);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > +	if (ret) {
> > +		btrfs_io_bio_free_csum(io_bio);
> > +		return ret;
> > +	}
> > +
> > +	atomic_inc(&priv->pending);
> > +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> > +	if (ret) {
> > +		atomic_dec(&priv->pending);
> > +		btrfs_io_bio_free_csum(io_bio);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio)
> > +{
> > +	const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK;
> > +	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
> > +	struct inode *inode = priv->inode;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	u32 sectorsize = fs_info->sectorsize;
> > +	struct bio_vec *bvec;
> > +	struct bvec_iter_all iter_all;
> > +	u64 start = io_bio->logical;
> > +	int icsum = 0;
> > +
> > +	if (priv->skip_csum || !uptodate)
> > +		return io_bio->bio.bi_status;
> > +
> > +	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
> > +		unsigned int i, nr_sectors, pgoff;
> > +
> > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > +		pgoff = bvec->bv_offset;
> > +		for (i = 0; i < nr_sectors; i++) {
> > +			ASSERT(pgoff < PAGE_SIZE);
> > +			if (check_data_csum(inode, io_bio, icsum, bvec->bv_page,
> > +					    pgoff, start))
> > +				return BLK_STS_IOERR;
> > +			start += sectorsize;
> > +			icsum++;
> > +			pgoff += sectorsize;
> > +		}
> > +	}
> > +	return BLK_STS_OK;
> > +}
> > +
> > +static void btrfs_encoded_read_endio(struct bio *bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +	blk_status_t status;
> > +
> > +	status = btrfs_encoded_read_check_bio(io_bio);
> > +	if (status) {
> > +		/*
> > +		 * The memory barrier implied by the atomic_dec_return() here
> > +		 * pairs with the memory barrier implied by the
> > +		 * atomic_dec_return() or io_wait_event() in
> > +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> > +		 * write is observed before the load of status in
> > +		 * btrfs_encoded_read_regular_fill_pages().
> > +		 */
> > +		WRITE_ONCE(priv->status, status);
> > +	}
> > +	if (!atomic_dec_return(&priv->pending))
> > +		wake_up(&priv->wait);
> > +	btrfs_io_bio_free_csum(io_bio);
> > +	bio_put(bio);
> > +}
> > +
> > +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset,
> > +						 u64 disk_io_size, struct page **pages)
> > +{
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct btrfs_encoded_read_private priv = {
> > +		.inode = inode,
> > +		.pending = ATOMIC_INIT(1),
> > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > +	};
> > +	unsigned long i = 0;
> > +	u64 cur = 0;
> > +	int ret;
> > +
> > +	init_waitqueue_head(&priv.wait);
> > +	/*
> > +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> > +	 * necessary.
> > +	 */
> > +	while (cur < disk_io_size) {
> > +		struct btrfs_io_geometry geom;
> > +		struct bio *bio = NULL;
> > +		u64 remaining;
> > +
> > +		ret = btrfs_get_io_geometry(fs_info, BTRFS_MAP_READ,
> > +					    offset + cur, disk_io_size - cur,
> > +					    &geom);
> > +		if (ret) {
> > +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> > +			break;
> > +		}
> > +		remaining = min(geom.len, disk_io_size - cur);
> > +		while (bio || remaining) {
> > +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> > +
> > +			if (!bio) {
> > +				bio = btrfs_bio_alloc(offset + cur);
> > +				bio->bi_end_io = btrfs_encoded_read_endio;
> > +				bio->bi_private = &priv;
> > +				bio->bi_opf = REQ_OP_READ;
> > +			}
> > +
> > +			if (!bytes ||
> > +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > +				blk_status_t status;
> > +
> > +				status = submit_encoded_read_bio(inode, bio, 0,
> > +								 0);
> > +				if (status) {
> > +					WRITE_ONCE(priv.status, status);
> > +					bio_put(bio);
> > +					goto out;
> > +				}
> > +				bio = NULL;
> > +				continue;
> > +			}
> > +
> > +			i++;
> > +			cur += bytes;
> > +			remaining -= bytes;
> > +		}
> > +	}
> > +
> > +out:
> > +	if (atomic_dec_return(&priv.pending))
> > +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> > +	/* See btrfs_encoded_read_endio() for ordering. */
> > +	return blk_status_to_errno(READ_ONCE(priv.status));
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > +					  struct iov_iter *iter,
> > +					  u64 start, u64 lockend,
> > +					  struct extent_state **cached_state,
> > +					  u64 offset, u64 disk_io_size,
> > +					  size_t count,
> > +					  const struct encoded_iov *encoded,
> > +					  bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct page **pages;
> > +	unsigned long nr_pages, i;
> > +	u64 cur;
> > +	size_t page_offset;
> > +	ssize_t ret;
> > +
> > +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size,
> > +						    pages);
> > +	if (ret)
> > +		goto out;
> > +
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock_shared(inode);
> > +	*unlocked = true;
> > +
> > +	ret = copy_encoded_iov_to_iter(encoded, iter);
> > +	if (ret)
> > +		goto out;
> > +	if (encoded->compression) {
> > +		i = 0;
> > +		page_offset = 0;
> > +	} else {
> > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > +	}
> > +	cur = 0;
> > +	while (cur < count) {
> > +		size_t bytes = min_t(size_t, count - cur,
> > +				     PAGE_SIZE - page_offset);
> > +
> > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > +				      iter) != bytes) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +		i++;
> > +		cur += bytes;
> > +		page_offset = 0;
> > +	}
> > +	ret = count;
> > +out:
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pages[i])
> > +			__free_page(pages[i]);
> > +	}
> > +	kfree(pages);
> > +	return ret;
> > +}
> > +
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	ssize_t ret;
> > +	size_t count;
> > +	u64 start, lockend, offset, disk_io_size;
> > +	struct extent_state *cached_state = NULL;
> > +	struct extent_map *em;
> > +	struct encoded_iov encoded = {};
> > +	bool unlocked = false;
> > +
> > +	ret = generic_encoded_read_checks(iocb, iter);
> > +	if (ret < 0)
> > +		return ret;
> > +	if (ret == 0)
> > +		return copy_encoded_iov_to_iter(&encoded, iter);
> > +	count = ret;
> > +
> > +	file_accessed(iocb->ki_filp);
> > +
> > +	inode_lock_shared(inode);
> > +
> > +	if (iocb->ki_pos >= inode->i_size) {
> > +		inode_unlock_shared(inode);
> > +		return copy_encoded_iov_to_iter(&encoded, iter);
> > +	}
> > +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> > +	/*
> > +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> > +	 * it's compressed we know that it won't be longer than this.
> > +	 */
> > +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> > +
> > +	for (;;) {
> > +		struct btrfs_ordered_extent *ordered;
> > +
> > +		ret = btrfs_wait_ordered_range(inode, start,
> > +					       lockend - start + 1);
> > +		if (ret)
> > +			goto out_unlock_inode;
> > +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > +						     lockend - start + 1);
> > +		if (!ordered)
> > +			break;
> > +		btrfs_put_ordered_extent(ordered);
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +		cond_resched();
> > +	}
> 
> This can be replaced with btrfs_lock_and_flush_ordered_range().  Then you can add

Sorry, finally getting back to this after the break. Please correct me
if I'm wrong, but I don't think btrfs_lock_and_flush_ordered_range() is
strong enough here.

An encoded read needs to make sure that any buffered writes are on disk
(since it's basically direct I/O). btrfs_lock_and_flush_ordered_range()
bails immediately if there aren't any ordered extents. As far as I can
tell, ordered extents aren't created until writepage, so if I do some
buffered writes and call btrfs_lock_and_flush_ordered_range() before
writepage creates the ordered extents, it won't flush the buffered
writes like I need it to. This loop with btrfs_wait_ordered_range()
does.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads
  2021-01-11 20:21     ` Omar Sandoval
@ 2021-01-11 20:35       ` Josef Bacik
  2021-01-11 20:58         ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Bacik @ 2021-01-11 20:35 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On 1/11/21 3:21 PM, Omar Sandoval wrote:
> On Thu, Dec 03, 2020 at 09:32:37AM -0500, Josef Bacik wrote:
>> On 11/18/20 2:18 PM, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> There are 4 main cases:
>>>
>>> 1. Inline extents: we copy the data straight out of the extent buffer.
>>> 2. Hole/preallocated extents: we fill in zeroes.
>>> 3. Regular, uncompressed extents: we read the sectors we need directly
>>>      from disk.
>>> 4. Regular, compressed extents: we read the entire compressed extent
>>>      from disk and indicate what subset of the decompressed extent is in
>>>      the file.
>>>
>>> This initial implementation simplifies a few things that can be improved
>>> in the future:
>>>
>>> - We hold the inode lock during the operation.
>>> - Cases 1, 3, and 4 allocate temporary memory to read into before
>>>     copying out to userspace.
>>> - We don't do read repair, because it turns out that read repair is
>>>     currently broken for compressed data.
>>>
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>>    fs/btrfs/ctree.h |   2 +
>>>    fs/btrfs/file.c  |   5 +
>>>    fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
>>>    3 files changed, 503 insertions(+)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 6ab2ab002bf6..ce78424f1d98 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -3133,6 +3133,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
>>>    int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>>>    void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>>>    					  u64 end, int uptodate);
>>> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
>>> +
>>>    extern const struct dentry_operations btrfs_dentry_operations;
>>>    extern const struct iomap_ops btrfs_dio_iomap_ops;
>>>    extern const struct iomap_dio_ops btrfs_dio_ops;
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index 224295f8f1e1..193477565200 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -3629,6 +3629,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>>    {
>>>    	ssize_t ret = 0;
>>> +	if (iocb->ki_flags & IOCB_ENCODED) {
>>> +		if (iocb->ki_flags & IOCB_NOWAIT)
>>> +			return -EOPNOTSUPP;
>>> +		return btrfs_encoded_read(iocb, to);
>>> +	}
>>>    	if (iocb->ki_flags & IOCB_DIRECT) {
>>>    		ret = btrfs_direct_read(iocb, to);
>>>    		if (ret < 0 || !iov_iter_count(to) ||
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 1ff903f5c5a4..b0e800897b3b 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -9936,6 +9936,502 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
>>>    	}
>>>    }
>>> +static int encoded_iov_compression_from_btrfs(unsigned int compress_type)
>>> +{
>>> +	switch (compress_type) {
>>> +	case BTRFS_COMPRESS_NONE:
>>> +		return ENCODED_IOV_COMPRESSION_NONE;
>>> +	case BTRFS_COMPRESS_ZLIB:
>>> +		return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB;
>>> +	case BTRFS_COMPRESS_LZO:
>>> +		/*
>>> +		 * The LZO format depends on the page size. 64k is the maximum
>>> +		 * sectorsize (and thus page size) that we support.
>>> +		 */
>>> +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
>>> +			return -EINVAL;
>>> +		return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12);
>>> +	case BTRFS_COMPRESS_ZSTD:
>>> +		return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD;
>>> +	default:
>>> +		return -EUCLEAN;
>>> +	}
>>> +}
>>> +
>>> +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
>>> +					 struct iov_iter *iter, u64 start,
>>> +					 u64 lockend,
>>> +					 struct extent_state **cached_state,
>>> +					 u64 extent_start, size_t count,
>>> +					 struct encoded_iov *encoded,
>>> +					 bool *unlocked)
>>> +{
>>> +	struct inode *inode = file_inode(iocb->ki_filp);
>>> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>>> +	struct btrfs_path *path;
>>> +	struct extent_buffer *leaf;
>>> +	struct btrfs_file_extent_item *item;
>>> +	u64 ram_bytes;
>>> +	unsigned long ptr;
>>> +	void *tmp;
>>> +	ssize_t ret;
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
>>> +				       btrfs_ino(BTRFS_I(inode)), extent_start,
>>> +				       0);
>>> +	if (ret) {
>>> +		if (ret > 0) {
>>> +			/* The extent item disappeared? */
>>> +			ret = -EIO;
>>> +		}
>>> +		goto out;
>>> +	}
>>> +	leaf = path->nodes[0];
>>> +	item = btrfs_item_ptr(leaf, path->slots[0],
>>> +			      struct btrfs_file_extent_item);
>>> +
>>> +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
>>> +	ptr = btrfs_file_extent_inline_start(item);
>>> +
>>> +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
>>> +			iocb->ki_pos);
>>> +	ret = encoded_iov_compression_from_btrfs(
>>> +				 btrfs_file_extent_compression(leaf, item));
>>> +	if (ret < 0)
>>> +		goto out;
>>> +	encoded->compression = ret;
>>> +	if (encoded->compression) {
>>> +		size_t inline_size;
>>> +
>>> +		inline_size = btrfs_file_extent_inline_item_len(leaf,
>>> +						btrfs_item_nr(path->slots[0]));
>>> +		if (inline_size > count) {
>>> +			ret = -ENOBUFS;
>>> +			goto out;
>>> +		}
>>> +		count = inline_size;
>>> +		encoded->unencoded_len = ram_bytes;
>>> +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
>>> +	} else {
>>> +		encoded->len = encoded->unencoded_len = count =
>>> +			min_t(u64, count, encoded->len);
>>> +		ptr += iocb->ki_pos - extent_start;
>>> +	}
>>> +
>>> +	tmp = kmalloc(count, GFP_NOFS);
>>> +	if (!tmp) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +	read_extent_buffer(leaf, tmp, ptr, count);
>>> +	btrfs_release_path(path);
>>> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
>>> +	inode_unlock_shared(inode);
>>> +	*unlocked = true;
>>> +
>>> +	ret = copy_encoded_iov_to_iter(encoded, iter);
>>> +	if (ret)
>>> +		goto out_free;
>>> +	ret = copy_to_iter(tmp, count, iter);
>>> +	if (ret != count)
>>> +		ret = -EFAULT;
>>> +out_free:
>>> +	kfree(tmp);
>>> +out:
>>> +	btrfs_free_path(path);
>>> +	return ret;
>>> +}
>>> +
>>> +struct btrfs_encoded_read_private {
>>> +	struct inode *inode;
>>> +	wait_queue_head_t wait;
>>> +	atomic_t pending;
>>> +	blk_status_t status;
>>> +	bool skip_csum;
>>> +};
>>> +
>>> +static blk_status_t submit_encoded_read_bio(struct inode *inode,
>>> +					    struct bio *bio, int mirror_num,
>>> +					    unsigned long bio_flags)
>>> +{
>>> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
>>> +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>> +	blk_status_t ret;
>>> +
>>> +	if (!priv->skip_csum) {
>>> +		ret = btrfs_lookup_bio_sums(inode, bio, io_bio->logical, NULL);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +
>>> +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
>>> +	if (ret) {
>>> +		btrfs_io_bio_free_csum(io_bio);
>>> +		return ret;
>>> +	}
>>> +
>>> +	atomic_inc(&priv->pending);
>>> +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
>>> +	if (ret) {
>>> +		atomic_dec(&priv->pending);
>>> +		btrfs_io_bio_free_csum(io_bio);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio)
>>> +{
>>> +	const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK;
>>> +	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
>>> +	struct inode *inode = priv->inode;
>>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>> +	u32 sectorsize = fs_info->sectorsize;
>>> +	struct bio_vec *bvec;
>>> +	struct bvec_iter_all iter_all;
>>> +	u64 start = io_bio->logical;
>>> +	int icsum = 0;
>>> +
>>> +	if (priv->skip_csum || !uptodate)
>>> +		return io_bio->bio.bi_status;
>>> +
>>> +	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
>>> +		unsigned int i, nr_sectors, pgoff;
>>> +
>>> +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
>>> +		pgoff = bvec->bv_offset;
>>> +		for (i = 0; i < nr_sectors; i++) {
>>> +			ASSERT(pgoff < PAGE_SIZE);
>>> +			if (check_data_csum(inode, io_bio, icsum, bvec->bv_page,
>>> +					    pgoff, start))
>>> +				return BLK_STS_IOERR;
>>> +			start += sectorsize;
>>> +			icsum++;
>>> +			pgoff += sectorsize;
>>> +		}
>>> +	}
>>> +	return BLK_STS_OK;
>>> +}
>>> +
>>> +static void btrfs_encoded_read_endio(struct bio *bio)
>>> +{
>>> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
>>> +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>>> +	blk_status_t status;
>>> +
>>> +	status = btrfs_encoded_read_check_bio(io_bio);
>>> +	if (status) {
>>> +		/*
>>> +		 * The memory barrier implied by the atomic_dec_return() here
>>> +		 * pairs with the memory barrier implied by the
>>> +		 * atomic_dec_return() or io_wait_event() in
>>> +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
>>> +		 * write is observed before the load of status in
>>> +		 * btrfs_encoded_read_regular_fill_pages().
>>> +		 */
>>> +		WRITE_ONCE(priv->status, status);
>>> +	}
>>> +	if (!atomic_dec_return(&priv->pending))
>>> +		wake_up(&priv->wait);
>>> +	btrfs_io_bio_free_csum(io_bio);
>>> +	bio_put(bio);
>>> +}
>>> +
>>> +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset,
>>> +						 u64 disk_io_size, struct page **pages)
>>> +{
>>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>> +	struct btrfs_encoded_read_private priv = {
>>> +		.inode = inode,
>>> +		.pending = ATOMIC_INIT(1),
>>> +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
>>> +	};
>>> +	unsigned long i = 0;
>>> +	u64 cur = 0;
>>> +	int ret;
>>> +
>>> +	init_waitqueue_head(&priv.wait);
>>> +	/*
>>> +	 * Submit bios for the extent, splitting due to bio or stripe limits as
>>> +	 * necessary.
>>> +	 */
>>> +	while (cur < disk_io_size) {
>>> +		struct btrfs_io_geometry geom;
>>> +		struct bio *bio = NULL;
>>> +		u64 remaining;
>>> +
>>> +		ret = btrfs_get_io_geometry(fs_info, BTRFS_MAP_READ,
>>> +					    offset + cur, disk_io_size - cur,
>>> +					    &geom);
>>> +		if (ret) {
>>> +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
>>> +			break;
>>> +		}
>>> +		remaining = min(geom.len, disk_io_size - cur);
>>> +		while (bio || remaining) {
>>> +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
>>> +
>>> +			if (!bio) {
>>> +				bio = btrfs_bio_alloc(offset + cur);
>>> +				bio->bi_end_io = btrfs_encoded_read_endio;
>>> +				bio->bi_private = &priv;
>>> +				bio->bi_opf = REQ_OP_READ;
>>> +			}
>>> +
>>> +			if (!bytes ||
>>> +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
>>> +				blk_status_t status;
>>> +
>>> +				status = submit_encoded_read_bio(inode, bio, 0,
>>> +								 0);
>>> +				if (status) {
>>> +					WRITE_ONCE(priv.status, status);
>>> +					bio_put(bio);
>>> +					goto out;
>>> +				}
>>> +				bio = NULL;
>>> +				continue;
>>> +			}
>>> +
>>> +			i++;
>>> +			cur += bytes;
>>> +			remaining -= bytes;
>>> +		}
>>> +	}
>>> +
>>> +out:
>>> +	if (atomic_dec_return(&priv.pending))
>>> +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
>>> +	/* See btrfs_encoded_read_endio() for ordering. */
>>> +	return blk_status_to_errno(READ_ONCE(priv.status));
>>> +}
>>> +
>>> +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
>>> +					  struct iov_iter *iter,
>>> +					  u64 start, u64 lockend,
>>> +					  struct extent_state **cached_state,
>>> +					  u64 offset, u64 disk_io_size,
>>> +					  size_t count,
>>> +					  const struct encoded_iov *encoded,
>>> +					  bool *unlocked)
>>> +{
>>> +	struct inode *inode = file_inode(iocb->ki_filp);
>>> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>>> +	struct page **pages;
>>> +	unsigned long nr_pages, i;
>>> +	u64 cur;
>>> +	size_t page_offset;
>>> +	ssize_t ret;
>>> +
>>> +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
>>> +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
>>> +	if (!pages)
>>> +		return -ENOMEM;
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
>>> +		if (!pages[i]) {
>>> +			ret = -ENOMEM;
>>> +			goto out;
>>> +		}
>>> +	}
>>> +
>>> +	ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size,
>>> +						    pages);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
>>> +	inode_unlock_shared(inode);
>>> +	*unlocked = true;
>>> +
>>> +	ret = copy_encoded_iov_to_iter(encoded, iter);
>>> +	if (ret)
>>> +		goto out;
>>> +	if (encoded->compression) {
>>> +		i = 0;
>>> +		page_offset = 0;
>>> +	} else {
>>> +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
>>> +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
>>> +	}
>>> +	cur = 0;
>>> +	while (cur < count) {
>>> +		size_t bytes = min_t(size_t, count - cur,
>>> +				     PAGE_SIZE - page_offset);
>>> +
>>> +		if (copy_page_to_iter(pages[i], page_offset, bytes,
>>> +				      iter) != bytes) {
>>> +			ret = -EFAULT;
>>> +			goto out;
>>> +		}
>>> +		i++;
>>> +		cur += bytes;
>>> +		page_offset = 0;
>>> +	}
>>> +	ret = count;
>>> +out:
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		if (pages[i])
>>> +			__free_page(pages[i]);
>>> +	}
>>> +	kfree(pages);
>>> +	return ret;
>>> +}
>>> +
>>> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
>>> +{
>>> +	struct inode *inode = file_inode(iocb->ki_filp);
>>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>>> +	ssize_t ret;
>>> +	size_t count;
>>> +	u64 start, lockend, offset, disk_io_size;
>>> +	struct extent_state *cached_state = NULL;
>>> +	struct extent_map *em;
>>> +	struct encoded_iov encoded = {};
>>> +	bool unlocked = false;
>>> +
>>> +	ret = generic_encoded_read_checks(iocb, iter);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +	if (ret == 0)
>>> +		return copy_encoded_iov_to_iter(&encoded, iter);
>>> +	count = ret;
>>> +
>>> +	file_accessed(iocb->ki_filp);
>>> +
>>> +	inode_lock_shared(inode);
>>> +
>>> +	if (iocb->ki_pos >= inode->i_size) {
>>> +		inode_unlock_shared(inode);
>>> +		return copy_encoded_iov_to_iter(&encoded, iter);
>>> +	}
>>> +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
>>> +	/*
>>> +	 * We don't know how long the extent containing iocb->ki_pos is, but if
>>> +	 * it's compressed we know that it won't be longer than this.
>>> +	 */
>>> +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
>>> +
>>> +	for (;;) {
>>> +		struct btrfs_ordered_extent *ordered;
>>> +
>>> +		ret = btrfs_wait_ordered_range(inode, start,
>>> +					       lockend - start + 1);
>>> +		if (ret)
>>> +			goto out_unlock_inode;
>>> +		lock_extent_bits(io_tree, start, lockend, &cached_state);
>>> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
>>> +						     lockend - start + 1);
>>> +		if (!ordered)
>>> +			break;
>>> +		btrfs_put_ordered_extent(ordered);
>>> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
>>> +		cond_resched();
>>> +	}
>>
>> This can be replaced with btrfs_lock_and_flush_ordered_range().  Then you can add
> 
> Sorry, finally getting back to this after the break. Please correct me
> if I'm wrong, but I don't think btrfs_lock_and_flush_ordered_range() is
> strong enough here.
> 
> An encoded read needs to make sure that any buffered writes are on disk
> (since it's basically direct I/O). btrfs_lock_and_flush_ordered_range()
> bails immediately if there aren't any ordered extents. As far as I can
> tell, ordered extents aren't created until writepage, so if I do some
> buffered writes and call btrfs_lock_and_flush_ordered_range() before
> writepage creates the ordered extents, it won't flush the buffered
> writes like I need it to. This loop with btrfs_wait_ordered_range()
> does.
> 

I didn't realize that btrfs_wait_ordered_range() does the fdatawrite_range, 
awesome.  You can leave it then and add my reviewed-by.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads
  2021-01-11 20:35       ` Josef Bacik
@ 2021-01-11 20:58         ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2021-01-11 20:58 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Amir Goldstein, Aleksa Sarai, linux-api,
	kernel-team

On Mon, Jan 11, 2021 at 03:35:24PM -0500, Josef Bacik wrote:
> On 1/11/21 3:21 PM, Omar Sandoval wrote:
> > On Thu, Dec 03, 2020 at 09:32:37AM -0500, Josef Bacik wrote:
> > > On 11/18/20 2:18 PM, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > There are 4 main cases:
> > > > 
> > > > 1. Inline extents: we copy the data straight out of the extent buffer.
> > > > 2. Hole/preallocated extents: we fill in zeroes.
> > > > 3. Regular, uncompressed extents: we read the sectors we need directly
> > > >      from disk.
> > > > 4. Regular, compressed extents: we read the entire compressed extent
> > > >      from disk and indicate what subset of the decompressed extent is in
> > > >      the file.
> > > > 
> > > > This initial implementation simplifies a few things that can be improved
> > > > in the future:
> > > > 
> > > > - We hold the inode lock during the operation.
> > > > - Cases 1, 3, and 4 allocate temporary memory to read into before
> > > >     copying out to userspace.
> > > > - We don't do read repair, because it turns out that read repair is
> > > >     currently broken for compressed data.
> > > > 
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >    fs/btrfs/ctree.h |   2 +
> > > >    fs/btrfs/file.c  |   5 +
> > > >    fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
> > > >    3 files changed, 503 insertions(+)
> > > > 
> > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > > index 6ab2ab002bf6..ce78424f1d98 100644
> > > > --- a/fs/btrfs/ctree.h
> > > > +++ b/fs/btrfs/ctree.h
> > > > @@ -3133,6 +3133,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
> > > >    int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> > > >    void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> > > >    					  u64 end, int uptodate);
> > > > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> > > > +
> > > >    extern const struct dentry_operations btrfs_dentry_operations;
> > > >    extern const struct iomap_ops btrfs_dio_iomap_ops;
> > > >    extern const struct iomap_dio_ops btrfs_dio_ops;
> > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > index 224295f8f1e1..193477565200 100644
> > > > --- a/fs/btrfs/file.c
> > > > +++ b/fs/btrfs/file.c
> > > > @@ -3629,6 +3629,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > >    {
> > > >    	ssize_t ret = 0;
> > > > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > > +			return -EOPNOTSUPP;
> > > > +		return btrfs_encoded_read(iocb, to);
> > > > +	}
> > > >    	if (iocb->ki_flags & IOCB_DIRECT) {
> > > >    		ret = btrfs_direct_read(iocb, to);
> > > >    		if (ret < 0 || !iov_iter_count(to) ||
> > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > index 1ff903f5c5a4..b0e800897b3b 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > @@ -9936,6 +9936,502 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
> > > >    	}
> > > >    }
> > > > +static int encoded_iov_compression_from_btrfs(unsigned int compress_type)
> > > > +{
> > > > +	switch (compress_type) {
> > > > +	case BTRFS_COMPRESS_NONE:
> > > > +		return ENCODED_IOV_COMPRESSION_NONE;
> > > > +	case BTRFS_COMPRESS_ZLIB:
> > > > +		return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB;
> > > > +	case BTRFS_COMPRESS_LZO:
> > > > +		/*
> > > > +		 * The LZO format depends on the page size. 64k is the maximum
> > > > +		 * sectorsize (and thus page size) that we support.
> > > > +		 */
> > > > +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> > > > +			return -EINVAL;
> > > > +		return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12);
> > > > +	case BTRFS_COMPRESS_ZSTD:
> > > > +		return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD;
> > > > +	default:
> > > > +		return -EUCLEAN;
> > > > +	}
> > > > +}
> > > > +
> > > > +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
> > > > +					 struct iov_iter *iter, u64 start,
> > > > +					 u64 lockend,
> > > > +					 struct extent_state **cached_state,
> > > > +					 u64 extent_start, size_t count,
> > > > +					 struct encoded_iov *encoded,
> > > > +					 bool *unlocked)
> > > > +{
> > > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > > +	struct btrfs_path *path;
> > > > +	struct extent_buffer *leaf;
> > > > +	struct btrfs_file_extent_item *item;
> > > > +	u64 ram_bytes;
> > > > +	unsigned long ptr;
> > > > +	void *tmp;
> > > > +	ssize_t ret;
> > > > +
> > > > +	path = btrfs_alloc_path();
> > > > +	if (!path) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> > > > +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> > > > +				       0);
> > > > +	if (ret) {
> > > > +		if (ret > 0) {
> > > > +			/* The extent item disappeared? */
> > > > +			ret = -EIO;
> > > > +		}
> > > > +		goto out;
> > > > +	}
> > > > +	leaf = path->nodes[0];
> > > > +	item = btrfs_item_ptr(leaf, path->slots[0],
> > > > +			      struct btrfs_file_extent_item);
> > > > +
> > > > +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> > > > +	ptr = btrfs_file_extent_inline_start(item);
> > > > +
> > > > +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> > > > +			iocb->ki_pos);
> > > > +	ret = encoded_iov_compression_from_btrfs(
> > > > +				 btrfs_file_extent_compression(leaf, item));
> > > > +	if (ret < 0)
> > > > +		goto out;
> > > > +	encoded->compression = ret;
> > > > +	if (encoded->compression) {
> > > > +		size_t inline_size;
> > > > +
> > > > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > > > +						btrfs_item_nr(path->slots[0]));
> > > > +		if (inline_size > count) {
> > > > +			ret = -ENOBUFS;
> > > > +			goto out;
> > > > +		}
> > > > +		count = inline_size;
> > > > +		encoded->unencoded_len = ram_bytes;
> > > > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > > > +	} else {
> > > > +		encoded->len = encoded->unencoded_len = count =
> > > > +			min_t(u64, count, encoded->len);
> > > > +		ptr += iocb->ki_pos - extent_start;
> > > > +	}
> > > > +
> > > > +	tmp = kmalloc(count, GFP_NOFS);
> > > > +	if (!tmp) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +	read_extent_buffer(leaf, tmp, ptr, count);
> > > > +	btrfs_release_path(path);
> > > > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > > > +	inode_unlock_shared(inode);
> > > > +	*unlocked = true;
> > > > +
> > > > +	ret = copy_encoded_iov_to_iter(encoded, iter);
> > > > +	if (ret)
> > > > +		goto out_free;
> > > > +	ret = copy_to_iter(tmp, count, iter);
> > > > +	if (ret != count)
> > > > +		ret = -EFAULT;
> > > > +out_free:
> > > > +	kfree(tmp);
> > > > +out:
> > > > +	btrfs_free_path(path);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +struct btrfs_encoded_read_private {
> > > > +	struct inode *inode;
> > > > +	wait_queue_head_t wait;
> > > > +	atomic_t pending;
> > > > +	blk_status_t status;
> > > > +	bool skip_csum;
> > > > +};
> > > > +
> > > > +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> > > > +					    struct bio *bio, int mirror_num,
> > > > +					    unsigned long bio_flags)
> > > > +{
> > > > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > > > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	blk_status_t ret;
> > > > +
> > > > +	if (!priv->skip_csum) {
> > > > +		ret = btrfs_lookup_bio_sums(inode, bio, io_bio->logical, NULL);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +
> > > > +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > > > +	if (ret) {
> > > > +		btrfs_io_bio_free_csum(io_bio);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	atomic_inc(&priv->pending);
> > > > +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> > > > +	if (ret) {
> > > > +		atomic_dec(&priv->pending);
> > > > +		btrfs_io_bio_free_csum(io_bio);
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio)
> > > > +{
> > > > +	const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK;
> > > > +	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
> > > > +	struct inode *inode = priv->inode;
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	u32 sectorsize = fs_info->sectorsize;
> > > > +	struct bio_vec *bvec;
> > > > +	struct bvec_iter_all iter_all;
> > > > +	u64 start = io_bio->logical;
> > > > +	int icsum = 0;
> > > > +
> > > > +	if (priv->skip_csum || !uptodate)
> > > > +		return io_bio->bio.bi_status;
> > > > +
> > > > +	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
> > > > +		unsigned int i, nr_sectors, pgoff;
> > > > +
> > > > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > > > +		pgoff = bvec->bv_offset;
> > > > +		for (i = 0; i < nr_sectors; i++) {
> > > > +			ASSERT(pgoff < PAGE_SIZE);
> > > > +			if (check_data_csum(inode, io_bio, icsum, bvec->bv_page,
> > > > +					    pgoff, start))
> > > > +				return BLK_STS_IOERR;
> > > > +			start += sectorsize;
> > > > +			icsum++;
> > > > +			pgoff += sectorsize;
> > > > +		}
> > > > +	}
> > > > +	return BLK_STS_OK;
> > > > +}
> > > > +
> > > > +static void btrfs_encoded_read_endio(struct bio *bio)
> > > > +{
> > > > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > > > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > > > +	blk_status_t status;
> > > > +
> > > > +	status = btrfs_encoded_read_check_bio(io_bio);
> > > > +	if (status) {
> > > > +		/*
> > > > +		 * The memory barrier implied by the atomic_dec_return() here
> > > > +		 * pairs with the memory barrier implied by the
> > > > +		 * atomic_dec_return() or io_wait_event() in
> > > > +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> > > > +		 * write is observed before the load of status in
> > > > +		 * btrfs_encoded_read_regular_fill_pages().
> > > > +		 */
> > > > +		WRITE_ONCE(priv->status, status);
> > > > +	}
> > > > +	if (!atomic_dec_return(&priv->pending))
> > > > +		wake_up(&priv->wait);
> > > > +	btrfs_io_bio_free_csum(io_bio);
> > > > +	bio_put(bio);
> > > > +}
> > > > +
> > > > +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset,
> > > > +						 u64 disk_io_size, struct page **pages)
> > > > +{
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	struct btrfs_encoded_read_private priv = {
> > > > +		.inode = inode,
> > > > +		.pending = ATOMIC_INIT(1),
> > > > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > > > +	};
> > > > +	unsigned long i = 0;
> > > > +	u64 cur = 0;
> > > > +	int ret;
> > > > +
> > > > +	init_waitqueue_head(&priv.wait);
> > > > +	/*
> > > > +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> > > > +	 * necessary.
> > > > +	 */
> > > > +	while (cur < disk_io_size) {
> > > > +		struct btrfs_io_geometry geom;
> > > > +		struct bio *bio = NULL;
> > > > +		u64 remaining;
> > > > +
> > > > +		ret = btrfs_get_io_geometry(fs_info, BTRFS_MAP_READ,
> > > > +					    offset + cur, disk_io_size - cur,
> > > > +					    &geom);
> > > > +		if (ret) {
> > > > +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> > > > +			break;
> > > > +		}
> > > > +		remaining = min(geom.len, disk_io_size - cur);
> > > > +		while (bio || remaining) {
> > > > +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> > > > +
> > > > +			if (!bio) {
> > > > +				bio = btrfs_bio_alloc(offset + cur);
> > > > +				bio->bi_end_io = btrfs_encoded_read_endio;
> > > > +				bio->bi_private = &priv;
> > > > +				bio->bi_opf = REQ_OP_READ;
> > > > +			}
> > > > +
> > > > +			if (!bytes ||
> > > > +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > > > +				blk_status_t status;
> > > > +
> > > > +				status = submit_encoded_read_bio(inode, bio, 0,
> > > > +								 0);
> > > > +				if (status) {
> > > > +					WRITE_ONCE(priv.status, status);
> > > > +					bio_put(bio);
> > > > +					goto out;
> > > > +				}
> > > > +				bio = NULL;
> > > > +				continue;
> > > > +			}
> > > > +
> > > > +			i++;
> > > > +			cur += bytes;
> > > > +			remaining -= bytes;
> > > > +		}
> > > > +	}
> > > > +
> > > > +out:
> > > > +	if (atomic_dec_return(&priv.pending))
> > > > +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> > > > +	/* See btrfs_encoded_read_endio() for ordering. */
> > > > +	return blk_status_to_errno(READ_ONCE(priv.status));
> > > > +}
> > > > +
> > > > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > > > +					  struct iov_iter *iter,
> > > > +					  u64 start, u64 lockend,
> > > > +					  struct extent_state **cached_state,
> > > > +					  u64 offset, u64 disk_io_size,
> > > > +					  size_t count,
> > > > +					  const struct encoded_iov *encoded,
> > > > +					  bool *unlocked)
> > > > +{
> > > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > > +	struct page **pages;
> > > > +	unsigned long nr_pages, i;
> > > > +	u64 cur;
> > > > +	size_t page_offset;
> > > > +	ssize_t ret;
> > > > +
> > > > +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> > > > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > > > +	if (!pages)
> > > > +		return -ENOMEM;
> > > > +	for (i = 0; i < nr_pages; i++) {
> > > > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> > > > +		if (!pages[i]) {
> > > > +			ret = -ENOMEM;
> > > > +			goto out;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size,
> > > > +						    pages);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > > > +	inode_unlock_shared(inode);
> > > > +	*unlocked = true;
> > > > +
> > > > +	ret = copy_encoded_iov_to_iter(encoded, iter);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +	if (encoded->compression) {
> > > > +		i = 0;
> > > > +		page_offset = 0;
> > > > +	} else {
> > > > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > > > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > > > +	}
> > > > +	cur = 0;
> > > > +	while (cur < count) {
> > > > +		size_t bytes = min_t(size_t, count - cur,
> > > > +				     PAGE_SIZE - page_offset);
> > > > +
> > > > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > > > +				      iter) != bytes) {
> > > > +			ret = -EFAULT;
> > > > +			goto out;
> > > > +		}
> > > > +		i++;
> > > > +		cur += bytes;
> > > > +		page_offset = 0;
> > > > +	}
> > > > +	ret = count;
> > > > +out:
> > > > +	for (i = 0; i < nr_pages; i++) {
> > > > +		if (pages[i])
> > > > +			__free_page(pages[i]);
> > > > +	}
> > > > +	kfree(pages);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > > > +{
> > > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > > +	ssize_t ret;
> > > > +	size_t count;
> > > > +	u64 start, lockend, offset, disk_io_size;
> > > > +	struct extent_state *cached_state = NULL;
> > > > +	struct extent_map *em;
> > > > +	struct encoded_iov encoded = {};
> > > > +	bool unlocked = false;
> > > > +
> > > > +	ret = generic_encoded_read_checks(iocb, iter);
> > > > +	if (ret < 0)
> > > > +		return ret;
> > > > +	if (ret == 0)
> > > > +		return copy_encoded_iov_to_iter(&encoded, iter);
> > > > +	count = ret;
> > > > +
> > > > +	file_accessed(iocb->ki_filp);
> > > > +
> > > > +	inode_lock_shared(inode);
> > > > +
> > > > +	if (iocb->ki_pos >= inode->i_size) {
> > > > +		inode_unlock_shared(inode);
> > > > +		return copy_encoded_iov_to_iter(&encoded, iter);
> > > > +	}
> > > > +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> > > > +	/*
> > > > +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> > > > +	 * it's compressed we know that it won't be longer than this.
> > > > +	 */
> > > > +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> > > > +
> > > > +	for (;;) {
> > > > +		struct btrfs_ordered_extent *ordered;
> > > > +
> > > > +		ret = btrfs_wait_ordered_range(inode, start,
> > > > +					       lockend - start + 1);
> > > > +		if (ret)
> > > > +			goto out_unlock_inode;
> > > > +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> > > > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > > > +						     lockend - start + 1);
> > > > +		if (!ordered)
> > > > +			break;
> > > > +		btrfs_put_ordered_extent(ordered);
> > > > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > > > +		cond_resched();
> > > > +	}
> > > 
> > > This can be replaced with btrfs_lock_and_flush_ordered_range().  Then you can add
> > 
> > Sorry, finally getting back to this after the break. Please correct me
> > if I'm wrong, but I don't think btrfs_lock_and_flush_ordered_range() is
> > strong enough here.
> > 
> > An encoded read needs to make sure that any buffered writes are on disk
> > (since it's basically direct I/O). btrfs_lock_and_flush_ordered_range()
> > bails immediately if there aren't any ordered extents. As far as I can
> > tell, ordered extents aren't created until writepage, so if I do some
> > buffered writes and call btrfs_lock_and_flush_ordered_range() before
> > writepage creates the ordered extents, it won't flush the buffered
> > writes like I need it to. This loop with btrfs_wait_ordered_range()
> > does.
> > 
> 
> I didn't realize that btrfs_wait_ordered_range() does the fdatawrite_range,
> awesome.  You can leave it then and add my reviewed-by.  Thanks,

Yeah btrfs_wait_ordered_range() leaves something to be desired in
naming. Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v6 03/11] fs: add RWF_ENCODED for reading/writing compressed data
  2020-11-19  7:38   ` Amir Goldstein
@ 2021-01-11 23:06     ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2021-01-11 23:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Linux Btrfs, Al Viro, Christoph Hellwig,
	Dave Chinner, Jann Horn, Aleksa Sarai, Linux API, kernel-team

On Thu, Nov 19, 2020 at 09:38:17AM +0200, Amir Goldstein wrote:
> On Wed, Nov 18, 2020 at 9:18 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > From: Omar Sandoval <osandov@fb.com>
> >
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> >
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future a la copy_struct_from_user().
> > The remaining iovecs contain the encoded extent.
> >
> > This adds the VFS helpers for supporting encoded I/O and documentation
> > for filesystem support.
> >
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  Documentation/filesystems/encoded_io.rst |  74 ++++++++++
> >  Documentation/filesystems/index.rst      |   1 +
> >  fs/read_write.c                          | 167 +++++++++++++++++++++--
> >  include/linux/fs.h                       |  11 ++
> >  include/uapi/linux/fs.h                  |  41 +++++-
> >  5 files changed, 280 insertions(+), 14 deletions(-)
> >  create mode 100644 Documentation/filesystems/encoded_io.rst
> >
> > diff --git a/Documentation/filesystems/encoded_io.rst b/Documentation/filesystems/encoded_io.rst
> > new file mode 100644
> > index 000000000000..50405276d866
> > --- /dev/null
> > +++ b/Documentation/filesystems/encoded_io.rst
> > @@ -0,0 +1,74 @@
> > +===========
> > +Encoded I/O
> > +===========
> > +
> > +Encoded I/O is a mechanism for reading and writing encoded (e.g., compressed
> > +and/or encrypted) data directly from/to the filesystem. The userspace interface
> > +is thoroughly described in the :manpage:`encoded_io(7)` man page; this document
> > +describes the requirements for filesystem support.
> > +
> > +First of all, a filesystem supporting encoded I/O must indicate this by setting
> > +the ``FMODE_ENCODED_IO`` flag in its ``file_open`` file operation::
> > +

Hi, Amir, I'm getting back to this now after the holidays.

> Should this be FMODE_ALLOW_ENCODED_IO?
> How come I see no checks for this flag in vfs code?

Thanks for catching that, apparently I dropped the check between v5 and
v6 when I was resolving a conflict with commit ce71bfea207b ("fs: align
IOCB_* flags with RWF_* flags"). I'll add it back for v7. (The flag
indicates support for encoded I/O, and it's checked at read/write time,
so I think FMODE_ENCODED_IO is still the best name for it.)

> You seem to only be checking the O_ flag.
> Do we really want to allow setting the O_ flag after open or should we
> deny that?

I believe the conclusion after the other thread was to give
O_ALLOW_ENCODED no special treatment, so yes, we should allow it.

> > +    static int foo_file_open(struct inode *inode, struct file *filp)
> > +    {
> > +            ...
> > +            filep->f_mode |= FMODE_ENCODED_IO;
> > +            ...
> > +    }
> > +
> > +Encoded I/O goes through ``read_iter`` and ``write_iter``, designated by the
> > +``IOCB_ENCODED`` flag in ``kiocb->ki_flags``.
> > +
> > +Reads
> > +=====
> > +
> > +Encoded ``read_iter`` should:
> > +
> > +1. Call ``generic_encoded_read_checks()`` to validate the file and buffers
> > +   provided by userspace.
> > +2. Initialize the ``encoded_iov`` appropriately.
> > +3. Copy it to the user with ``copy_encoded_iov_to_iter()``.
> > +4. Copy the encoded data to the user.
> > +5. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
> > +6. Return the size of the encoded data read, not including the ``encoded_iov``.
> > +
> > +There are a few details to be aware of:
> > +
> > +* Encoded ``read_iter`` should support reading unencoded data if the extent is
> > +  not encoded.
> > +* If the buffers provided by the user are not large enough to contain an entire
> > +  encoded extent, then ``read_iter`` should return ``-ENOBUFS``. This is to
> > +  avoid confusing userspace with truncated data that cannot be properly
> > +  decoded.
> > +* Reads in the middle of an encoded extent can be returned by setting
> > +  ``encoded_iov->unencoded_offset`` to non-zero.
> > +* Truncated unencoded data (e.g., because the file does not end on a block
> > +  boundary) may be returned by setting ``encoded_iov->len`` to a value smaller
> > +  value than ``encoded_iov->unencoded_len - encoded_iov->unencoded_offset``.
> > +
> > +Writes
> > +======
> > +
> > +Encoded ``write_iter`` should (in addition to the usual accounting/checks done
> > +by ``write_iter``):
> > +
> > +1. Call ``copy_encoded_iov_from_iter()`` to get and validate the
> > +   ``encoded_iov``.
> > +2. Call ``generic_encoded_write_checks()`` instead of
> > +   ``generic_write_checks()``.
> > +3. Check that the provided encoding in ``encoded_iov`` is supported.
> > +4. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
> > +5. Return the size of the encoded data written.
> > +
> > +Again, there are a few details:
> > +
> > +* Encoded ``write_iter`` doesn't need to support writing unencoded data.
> > +* ``write_iter`` should either write all of the encoded data or none of it; it
> > +  must not do partial writes.
> > +* ``write_iter`` doesn't need to validate the encoded data; a subsequent read
> > +  may return, e.g., ``-EIO`` if the data is not valid.
> > +* The user may lie about the unencoded size of the data; a subsequent read
> > +  should truncate or zero-extend the unencoded data rather than returning an
> > +  error.
> > +* Be careful of page cache coherency.
> > diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> > index 98f59a864242..6d9e3ff0a455 100644
> > --- a/Documentation/filesystems/index.rst
> > +++ b/Documentation/filesystems/index.rst
> > @@ -53,6 +53,7 @@ filesystem implementations.
> >     journalling
> >     fscrypt
> >     fsverity
> > +   encoded_io
> >
> >  Filesystems
> >  ===========
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 75f764b43418..e2ad418d2987 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1625,24 +1625,15 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
> >         return 0;
> >  }
> >
> > -/*
> > - * Performs necessary checks before doing a write
> > - *
> > - * Can adjust writing position or amount of bytes to write.
> > - * Returns appropriate error code that caller should return or
> > - * zero in case that write should be allowed.
> > - */
> > -ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> >  {
> >         struct file *file = iocb->ki_filp;
> >         struct inode *inode = file->f_mapping->host;
> > -       loff_t count;
> > -       int ret;
> >
> >         if (IS_SWAPFILE(inode))
> >                 return -ETXTBSY;
> >
> > -       if (!iov_iter_count(from))
> > +       if (!*count)
> >                 return 0;
> >
> >         /* FIXME: this is for backwards compatibility with 2.4 */
> > @@ -1652,8 +1643,22 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >         if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> >                 return -EINVAL;
> >
> > -       count = iov_iter_count(from);
> > -       ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > +       return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > +}
> > +
> > +/*
> > + * Performs necessary checks before doing a write
> > + *
> > + * Can adjust writing position or amount of bytes to write.
> > + * Returns appropriate error code that caller should return or
> > + * zero in case that write should be allowed.
> > + */
> > +ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +       loff_t count = iov_iter_count(from);
> > +       int ret;
> > +
> > +       ret = generic_write_checks_common(iocb, &count);
> >         if (ret)
> >                 return ret;
> >
> > @@ -1684,3 +1689,139 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
> >
> >         return 0;
> >  }
> > +
> > +/**
> > + * generic_encoded_write_checks() - check an encoded write
> > + * @iocb: I/O context.
> > + * @encoded: Encoding metadata.
> > + *
> > + * This should be called by RWF_ENCODED write implementations rather than
> > + * generic_write_checks(). Unlike generic_write_checks(), it returns -EFBIG
> > + * instead of adjusting the size of the write.
> > + *
> > + * Return: 0 on success, -errno on error.
> > + */
> > +int generic_encoded_write_checks(struct kiocb *iocb,
> > +                                const struct encoded_iov *encoded)
> > +{
> > +       loff_t count = encoded->len;
> > +       int ret;
> > +
> > +       if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
> > +               return -EPERM;
> > +
> > +       ret = generic_write_checks_common(iocb, &count);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (count != encoded->len) {
> > +               /*
> > +                * The write got truncated by generic_write_checks_common(). We
> > +                * can't do a partial encoded write.
> > +                */
> > +               return -EFBIG;
> > +       }
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > +
> > +/**
> > + * copy_encoded_iov_from_iter() - copy a &struct encoded_iov from userspace
> > + * @encoded: Returned encoding metadata.
> > + * @from: Source iterator.
> > + *
> > + * This copies in the &struct encoded_iov and does some basic sanity checks.
> > + * This should always be used rather than a plain copy_from_iter(), as it does
> > + * the proper handling for backward- and forward-compatibility.
> > + *
> > + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if the
> > + *         copied structure contained non-zero fields that this kernel doesn't
> > + *         support, -EINVAL if the copied structure was invalid.
> > + */
> > +int copy_encoded_iov_from_iter(struct encoded_iov *encoded,
> > +                              struct iov_iter *from)
> > +{
> > +       size_t usize;
> > +       int ret;
> > +
> > +       usize = iov_iter_single_seg_count(from);
> > +       if (usize > PAGE_SIZE)
> > +               return -E2BIG;
> > +       if (usize < ENCODED_IOV_SIZE_VER0)
> > +               return -EINVAL;
> > +       ret = copy_struct_from_iter(encoded, sizeof(*encoded), from, usize);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > +               return -EINVAL;
> > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > +               return -EINVAL;
> > +       if (encoded->unencoded_offset > encoded->unencoded_len)
> > +               return -EINVAL;
> > +       if (encoded->len > encoded->unencoded_len - encoded->unencoded_offset)
> > +               return -EINVAL;
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL(copy_encoded_iov_from_iter);
> > +
> > +/**
> > + * generic_encoded_read_checks() - sanity check an RWF_ENCODED read
> > + * @iocb: I/O context.
> > + * @iter: Destination iterator for read.
> > + *
> > + * This should always be called by RWF_ENCODED read implementations before
> > + * returning any data.
> > + *
> > + * Return: Number of bytes available to return encoded data in @iter on success,
> > + *         -EPERM if the file was not opened with O_ALLOW_ENCODED, -EINVAL if
> > + *         the size of the &struct encoded_iov iovec is invalid.
> > + */
> > +ssize_t generic_encoded_read_checks(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +       size_t usize;
> > +
> > +       if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
> > +               return -EPERM;
> > +       usize = iov_iter_single_seg_count(iter);
> > +       if (usize > PAGE_SIZE || usize < ENCODED_IOV_SIZE_VER0)
> > +               return -EINVAL;
> > +       return iov_iter_count(iter) - usize;
> > +}
> > +EXPORT_SYMBOL(generic_encoded_read_checks);
> > +
> > +/**
> > + * copy_encoded_iov_to_iter() - copy a &struct encoded_iov to userspace
> > + * @encoded: Encoding metadata to return.
> > + * @to: Destination iterator.
> > + *
> > + * This should always be used by RWF_ENCODED read implementations rather than a
> > + * plain copy_to_iter(), as it does the proper handling for backward- and
> > + * forward-compatibility. The iterator must be sanity-checked with
> > + * generic_encoded_read_checks() before this is called.
> > + *
> > + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if there
> > + *         were non-zero fields in @encoded that the user buffer could not
> > + *         accommodate.
> > + */
> > +int copy_encoded_iov_to_iter(const struct encoded_iov *encoded,
> > +                            struct iov_iter *to)
> > +{
> > +       size_t ksize = sizeof(*encoded);
> > +       size_t usize = iov_iter_single_seg_count(to);
> > +       size_t size = min(ksize, usize);
> > +
> > +       /* We already sanity-checked usize in generic_encoded_read_checks(). */
> > +
> > +       if (usize < ksize &&
> > +           memchr_inv((char *)encoded + usize, 0, ksize - usize))
> > +               return -E2BIG;
> > +       if (copy_to_iter(encoded, size, to) != size ||
> > +           (usize > ksize &&
> > +            iov_iter_zero(usize - ksize, to) != usize - ksize))
> > +               return -EFAULT;
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL(copy_encoded_iov_to_iter);
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8667d0cdc71e..67810bf6fb1c 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -178,6 +178,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File supports async buffered reads */
> >  #define FMODE_BUF_RASYNC       ((__force fmode_t)0x40000000)
> >
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO       ((__force fmode_t)0x80000000)
> > +
> >  /*
> >   * Attribute flags.  These should be or-ed together to figure out what
> >   * has been changed!
> > @@ -308,6 +311,7 @@ enum rw_hint {
> >  #define IOCB_SYNC              (__force int) RWF_SYNC
> >  #define IOCB_NOWAIT            (__force int) RWF_NOWAIT
> >  #define IOCB_APPEND            (__force int) RWF_APPEND
> > +#define IOCB_ENCODED           (__force int) RWF_ENCODED
> >
> >  /* non-RWF related bits - start at 16 */
> >  #define IOCB_EVENTFD           (1 << 16)
> > @@ -2964,6 +2968,13 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> >  extern int generic_write_check_limits(struct file *file, loff_t pos,
> >                 loff_t *count);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *,
> > +                                       const struct encoded_iov *);
> > +extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *);
> > +extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *);
> > +extern int copy_encoded_iov_to_iter(const struct encoded_iov *,
> > +                                   struct iov_iter *);
> >  extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
> >  extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
> >                 struct iov_iter *to, ssize_t already_read);
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index f44eb0a04afd..95493420117a 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -279,6 +279,42 @@ struct fsxattr {
> >                                          SYNC_FILE_RANGE_WAIT_BEFORE | \
> >                                          SYNC_FILE_RANGE_WAIT_AFTER)
> >
> > +enum {
> > +       ENCODED_IOV_COMPRESSION_NONE,
> > +#define ENCODED_IOV_COMPRESSION_NONE ENCODED_IOV_COMPRESSION_NONE
> > +       ENCODED_IOV_COMPRESSION_BTRFS_ZLIB,
> > +#define ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ENCODED_IOV_COMPRESSION_BTRFS_ZLIB
> > +       ENCODED_IOV_COMPRESSION_BTRFS_ZSTD,
> > +#define ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ENCODED_IOV_COMPRESSION_BTRFS_ZSTD
> > +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K,
> > +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K
> > +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K,
> > +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K
> > +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K,
> > +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K
> > +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K,
> > +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K
> > +       ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K,
> > +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K
> > +       ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K,
> > +};
> > +
> 
> I am not a fan of this trick.
> There is no shortage of enums in uapi headers, but I think that if we want
> to set values in stone, the values should be set explicitly and not
> auto assigned
> by compiler.
> 
> If anybody ever adds a line, say ENCODED_IOV_COMPRESSION_BTRFS_ZLIB_V2
> in the middle of the enum list, it won't be obvious that it's a uapi breakage.
> 
> In principle, we could have partitioned the encoding types by domains
> (e.g. btrfs),
> and the btrfs specific encodings would have been a part of a btrfs
> header, but it's
> not that important.
> 
> However, please move all encoded_io stuff to a new uapi header and do
> not include it
> from fs.h to avoid having to compile most filesystems every time a new
> btrfs private encoding
> type is added.

Fine with me, I'll make these #define's and move them to their own
header.

Thanks,
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Ping: [PATCH man-pages v6] Document encoded I/O
  2020-12-18 10:32                 ` Ping: " Alejandro Colomar (man-pages)
@ 2021-01-12  1:12                   ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2021-01-12  1:12 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages)
  Cc: linux-fsdevel, linux-btrfs, Al Viro, Michael Kerrisk (man-pages),
	Christoph Hellwig, Dave Chinner, Jann Horn, Amir Goldstein,
	Aleksa Sarai, linux-api, kernel-team, linux-man

On Fri, Dec 18, 2020 at 11:32:17AM +0100, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> Linux 5.10 has been recently released.
> Do you have any updates for this patch?
> 
> Thanks,
> 
> Alex

Hi, Alex,

Now that the holidays are over I'm revisiting this series and plan to
send a new version this week or next.

Thanks,
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2021-01-12  1:13 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18 19:18 [PATCH v6 00/11] fs: interface for directly reading/writing compressed data Omar Sandoval
2020-11-18 19:18 ` [PATCH man-pages v6] Document encoded I/O Omar Sandoval
2020-11-19 23:29   ` Alejandro Colomar (mailing lists; readonly)
2020-11-20 14:06     ` Alejandro Colomar (man-pages)
2020-11-20 15:03       ` Alejandro Colomar (man-pages)
2020-11-30 19:35         ` Omar Sandoval
2020-12-01 14:36         ` Ping: " Alejandro Colomar (man-pages)
2020-12-01 20:12         ` Michael Kerrisk (man-pages)
2020-12-01 20:20           ` Michael Kerrisk (man-pages)
2020-12-01 21:35             ` Alejandro Colomar (man-pages)
2020-12-01 21:56               ` Michael Kerrisk (man-pages)
2020-12-18 10:32                 ` Ping: " Alejandro Colomar (man-pages)
2021-01-12  1:12                   ` Omar Sandoval
     [not found]           ` <20201201202144.ulbfnawi2ljmm6mn@localhost.localdomain>
2020-12-01 21:34             ` Alejandro Colomar (man-pages)
2020-12-01 21:58             ` Michael Kerrisk (man-pages)
2020-11-18 19:18 ` [PATCH v6 01/11] iov_iter: add copy_struct_from_iter() Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 02/11] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
2020-11-19  7:02   ` Amir Goldstein
2020-11-20 23:41     ` Jann Horn
2020-11-30 19:26       ` Omar Sandoval
2020-12-01  8:15         ` Amir Goldstein
2020-12-01 20:31           ` Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 03/11] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
2020-11-19  7:38   ` Amir Goldstein
2021-01-11 23:06     ` Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 04/11] btrfs: fix btrfs_write_check() Omar Sandoval
2020-11-23 17:08   ` David Sterba
2020-11-30 19:18     ` Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 05/11] btrfs: fix check_data_csum() error message for direct I/O Omar Sandoval
2020-11-23 17:09   ` David Sterba
2020-11-30 19:20     ` Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 06/11] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 07/11] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 08/11] btrfs: support different disk extent size for delalloc Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 09/11] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 10/11] btrfs: implement RWF_ENCODED reads Omar Sandoval
2020-12-03 14:32   ` Josef Bacik
2021-01-11 20:21     ` Omar Sandoval
2021-01-11 20:35       ` Josef Bacik
2021-01-11 20:58         ` Omar Sandoval
2020-11-18 19:18 ` [PATCH v6 11/11] btrfs: implement RWF_ENCODED writes Omar Sandoval
2020-12-02 22:03   ` Josef Bacik
2020-12-03 14:37   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).