linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data
@ 2021-06-17 23:51 Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter() Omar Sandoval
                   ` (8 more replies)
  0 siblings, 9 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

This series adds an API for reading compressed data on a filesystem
without decompressing it as well as support for writing compressed data
directly to the filesystem. I have test cases (including fsstress
support) and example programs which I'll send up once the dust settles
[1].

The main use-case is Btrfs send/receive: currently, when sending data
from one compressed filesystem to another, the sending side decompresses
the data and the receiving side recompresses it before writing it out.
This is wasteful and can be avoided if we can just send and write
compressed extents. The patches implementing the send/receive support
were sent with the last submission of this series [2].

Patches 1-3 add the VFS support, UAPI, and documentation. Patches 4-7
are Btrfs prep patches. Patch 8 adds Btrfs encoded read support and
patch 9 adds Btrfs encoded write support.

These patches are based on Dave Sterba's Btrfs misc-next branch [3],
which is in turn currently based on v5.13-rc6.

This is a _resend of a resend of a resend_ of v9 [4], rebased on the
latest kdave/misc-next branch.

In the last resend, there was some good discussion around how to support
encryption with this interface in the future. The conclusion was that
this interface should suffice for file data, and we would need separate
interface(s) for working with encrypted file names. So, this really just
needs review on the VFS side.

1: https://github.com/osandov/xfstests/tree/rwf-encoded
2: https://lore.kernel.org/linux-btrfs/cover.1615922753.git.osandov@fb.com/
3: https://github.com/kdave/btrfs-devel/tree/misc-next
4: https://lore.kernel.org/linux-fsdevel/cover.1621276134.git.osandov@fb.com/

Omar Sandoval (9):
  iov_iter: add copy_struct_from_iter()
  fs: add O_ALLOW_ENCODED open flag
  fs: add RWF_ENCODED for reading/writing compressed data
  btrfs: don't advance offset for compressed bios in
    btrfs_csum_one_bio()
  btrfs: add ram_bytes and offset to btrfs_ordered_extent
  btrfs: support different disk extent size for delalloc
  btrfs: optionally extend i_size in cow_file_range_inline()
  btrfs: implement RWF_ENCODED reads
  btrfs: implement RWF_ENCODED writes

 Documentation/filesystems/encoded_io.rst | 240 ++++++
 Documentation/filesystems/index.rst      |   1 +
 arch/alpha/include/uapi/asm/fcntl.h      |   1 +
 arch/parisc/include/uapi/asm/fcntl.h     |   1 +
 arch/sparc/include/uapi/asm/fcntl.h      |   1 +
 fs/btrfs/compression.c                   |  12 +-
 fs/btrfs/compression.h                   |   6 +-
 fs/btrfs/ctree.h                         |   9 +-
 fs/btrfs/delalloc-space.c                |  18 +-
 fs/btrfs/file-item.c                     |  35 +-
 fs/btrfs/file.c                          |  46 +-
 fs/btrfs/inode.c                         | 925 +++++++++++++++++++++--
 fs/btrfs/ordered-data.c                  | 124 +--
 fs/btrfs/ordered-data.h                  |  25 +-
 fs/btrfs/relocation.c                    |   4 +-
 fs/fcntl.c                               |  10 +-
 fs/namei.c                               |   4 +
 fs/read_write.c                          | 168 +++-
 include/linux/encoded_io.h               |  17 +
 include/linux/fcntl.h                    |   2 +-
 include/linux/fs.h                       |  13 +
 include/linux/uio.h                      |   1 +
 include/uapi/asm-generic/fcntl.h         |   4 +
 include/uapi/linux/encoded_io.h          |  30 +
 include/uapi/linux/fs.h                  |   5 +-
 lib/iov_iter.c                           |  91 +++
 26 files changed, 1559 insertions(+), 234 deletions(-)
 create mode 100644 Documentation/filesystems/encoded_io.rst
 create mode 100644 include/linux/encoded_io.h
 create mode 100644 include/uapi/linux/encoded_io.h

-- 
2.32.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-18 18:50   ` Linus Torvalds
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 2/9] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

This is essentially copy_struct_from_user() but for an iov_iter.

Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index d3ec87706d75..cbaf6b3bfcbc 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -129,6 +129,7 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
+int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i);
 
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c701b7a187f2..129f264416ff 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -995,6 +995,97 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+/**
+ * copy_struct_from_iter - copy a struct from an iov_iter
+ * @dst: Destination buffer.
+ * @ksize: Size of @dst struct.
+ * @i: Source iterator.
+ *
+ * Copies a struct from an iov_iter in a way that guarantees
+ * backwards-compatibility for struct arguments in an iovec (as long as the
+ * rules for copy_struct_from_user() are followed).
+ *
+ * The source struct is assumed to be stored in the current segment of the
+ * iov_iter, and its size is the size of the current segment. The iov_iter must
+ * be positioned at the beginning of the current segment.
+ *
+ * The recommended usage is something like the following:
+ *
+ *   int do_foo(struct iov_iter *i)
+ *   {
+ *     size_t usize = iov_iter_single_seg_count(i);
+ *     struct foo karg;
+ *     int err;
+ *
+ *     if (usize > PAGE_SIZE)
+ *       return -E2BIG;
+ *     if (usize < FOO_SIZE_VER0)
+ *       return -EINVAL;
+ *     err = copy_struct_from_iter(&karg, sizeof(karg), i);
+ *     if (err)
+ *       return err;
+ *
+ *     // ...
+ *   }
+ *
+ * Returns 0 on success or one of the following errors:
+ *  * -E2BIG:  (size of current segment > @ksize) and there are non-zero
+ *             trailing bytes in the current segment.
+ *  * -EFAULT: access to userspace failed.
+ *  * -EINVAL: the iterator is not at the beginning of the current segment.
+ *
+ * On success, the iterator is advanced to the next segment. On error, the
+ * iterator is not advanced.
+ */
+int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i)
+{
+	size_t usize;
+	int ret;
+
+	if (i->iov_offset != 0)
+		return -EINVAL;
+	if (iter_is_iovec(i)) {
+		usize = i->iov->iov_len;
+		might_fault();
+		if (copyin(dst, i->iov->iov_base, min(ksize, usize)))
+			return -EFAULT;
+		if (usize > ksize) {
+			ret = check_zeroed_user(i->iov->iov_base + ksize,
+						usize - ksize);
+			if (ret < 0)
+				return ret;
+			else if (ret == 0)
+				return -E2BIG;
+		}
+	} else if (iov_iter_is_kvec(i)) {
+		usize = i->kvec->iov_len;
+		memcpy(dst, i->kvec->iov_base, min(ksize, usize));
+		if (usize > ksize &&
+		    memchr_inv(i->kvec->iov_base + ksize, 0, usize - ksize))
+			return -E2BIG;
+	} else if (iov_iter_is_bvec(i)) {
+		char *p;
+
+		usize = i->bvec->bv_len;
+		p = kmap_local_page(i->bvec->bv_page);
+		memcpy(dst, p + i->bvec->bv_offset, min(ksize, usize));
+		if (usize > ksize &&
+		    memchr_inv(p + i->bvec->bv_offset + ksize, 0,
+			       usize - ksize)) {
+			kunmap_local(p);
+			return -E2BIG;
+		}
+		kunmap_local(p);
+	} else {
+		return -EFAULT;
+	}
+	if (usize < ksize)
+		memset(dst + usize, 0, ksize - usize);
+	iov_iter_advance(i, usize);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(copy_struct_from_iter);
+
 static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 2/9] fs: add O_ALLOW_ENCODED open flag
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter() Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 3/9] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

The upcoming RWF_ENCODED operation introduces some security concerns:

1. Compressed writes will pass arbitrary data to decompression
   algorithms in the kernel.
2. Compressed reads can leak truncated/hole punched data.

Therefore, we need to require privilege for RWF_ENCODED. It's not
possible to do the permissions checks at the time of the read or write
because, e.g., io_uring submits IO from a worker thread. So, add an open
flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
fcntl(). The flag is not cleared in any way on fork or exec.

Note that the usual issue that unknown open flags are ignored doesn't
really matter for O_ALLOW_ENCODED; if the kernel doesn't support
O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 arch/alpha/include/uapi/asm/fcntl.h  |  1 +
 arch/parisc/include/uapi/asm/fcntl.h |  1 +
 arch/sparc/include/uapi/asm/fcntl.h  |  1 +
 fs/fcntl.c                           | 10 ++++++++--
 fs/namei.c                           |  4 ++++
 include/linux/fcntl.h                |  2 +-
 include/uapi/asm-generic/fcntl.h     |  4 ++++
 7 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..391e0d112e41 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
 
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
+#define O_ALLOW_ENCODED	0200000000
 
 #define F_GETLK		7
 #define F_SETLK		8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03dee816cb13..0feb31faaefa 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -19,6 +19,7 @@
 
 #define O_PATH		020000000
 #define __O_TMPFILE	040000000
+#define O_ALLOW_ENCODED	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..ac3e8c9cb32c 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_ALLOW_ENCODED	0x8000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f15be7f..eca4eb008194 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -31,7 +31,8 @@
 #include <asm/siginfo.h>
 #include <linux/uaccess.h>
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
+		    O_ALLOW_ENCODED)
 
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
@@ -50,6 +51,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 		if (!inode_owner_or_capable(file_mnt_user_ns(filp), inode))
 			return -EPERM;
 
+	/* O_ALLOW_ENCODED can only be set by superuser */
+	if ((arg & O_ALLOW_ENCODED) && !(filp->f_flags & O_ALLOW_ENCODED) &&
+	    !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	/* required for strict SunOS emulation */
 	if (O_NONBLOCK != O_NDELAY)
 	       if (arg & O_NDELAY)
@@ -1043,7 +1049,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index 79b0ff9b151e..b05f121b3947 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2997,6 +2997,10 @@ static int may_open(struct user_namespace *mnt_userns, const struct path *path,
 	if (flag & O_NOATIME && !inode_owner_or_capable(mnt_userns, inode))
 		return -EPERM;
 
+	/* O_ALLOW_ENCODED can only be set by superuser */
+	if ((flag & O_ALLOW_ENCODED) && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	return 0;
 }
 
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 766fcd973beb..2cd6a9185d4c 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ALLOW_ENCODED)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..75321c7a66ac 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_ALLOW_ENCODED
+#define O_ALLOW_ENCODED	040000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 3/9] fs: add RWF_ENCODED for reading/writing compressed data
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter() Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 2/9] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 4/9] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Btrfs supports transparent compression: data written by the user can be
compressed when written to disk and decompressed when read back.
However, we'd like to add an interface to write pre-compressed data
directly to the filesystem, and the matching interface to read
compressed data without decompressing it. This adds support for
so-called "encoded I/O" via preadv2() and pwritev2().

A new RWF_ENCODED flags indicates that a read or write is "encoded". If
this flag is set, iov[0].iov_base points to a struct encoded_iov which
is used for metadata: namely, the compression algorithm, unencoded
(i.e., decompressed) length, and what subrange of the unencoded data
should be used (needed for truncated or hole-punched extents and when
reading in the middle of an extent). For reads, the filesystem returns
this information; for writes, the caller provides it to the filesystem.
iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
used to extend the interface in the future a la copy_struct_from_user().
The remaining iovecs contain the encoded extent.

This adds the VFS helpers for supporting encoded I/O and documentation.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 Documentation/filesystems/encoded_io.rst | 240 +++++++++++++++++++++++
 Documentation/filesystems/index.rst      |   1 +
 fs/read_write.c                          | 168 ++++++++++++++--
 include/linux/encoded_io.h               |  17 ++
 include/linux/fs.h                       |  13 ++
 include/uapi/linux/encoded_io.h          |  30 +++
 include/uapi/linux/fs.h                  |   5 +-
 7 files changed, 460 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/encoded_io.rst
 create mode 100644 include/linux/encoded_io.h
 create mode 100644 include/uapi/linux/encoded_io.h

diff --git a/Documentation/filesystems/encoded_io.rst b/Documentation/filesystems/encoded_io.rst
new file mode 100644
index 000000000000..38f1dc940331
--- /dev/null
+++ b/Documentation/filesystems/encoded_io.rst
@@ -0,0 +1,240 @@
+===========
+Encoded I/O
+===========
+
+Several filesystems (e.g., Btrfs) support transparent encoding (e.g.,
+compression, encryption) of data on disk: written data is encoded by the kernel
+before it is written to disk, and read data is decoded before being returned to
+the user. In some cases, it is useful to skip this encoding step. For example,
+the user may want to read the compressed contents of a file or write
+pre-compressed data directly to a file. This is referred to as "encoded I/O".
+
+User API
+========
+
+Encoded I/O is specified with the ``RWF_ENCODED`` flag to ``preadv2()`` and
+``pwritev2()``. If ``RWF_ENCODED`` is specified, then ``iov[0].iov_base``
+points to an ``encoded_iov`` structure, defined in ``<linux/encoded_io.h>``
+as::
+
+    struct encoded_iov {
+            __aligned_u64 len;
+            __aligned_u64 unencoded_len;
+            __aligned_u64 unencoded_offset;
+            __u32 compression;
+            __u32 encryption;
+    };
+
+This may be extended in the future, so ``iov[0].iov_len`` must be set to
+``sizeof(struct encoded_iov)`` for forward/backward compatibility. The
+remaining buffers contain the encoded data.
+
+``compression`` and ``encryption`` are the encoding fields. ``compression`` is
+``ENCODED_IOV_COMPRESSION_NONE`` (zero) or a filesystem-specific
+``ENCODED_IOV_COMPRESSION_*`` constant; see `Filesystem support`_ below.
+``encryption`` is currently always ``ENCODED_IOV_ENCRYPTION_NONE`` (zero).
+
+``unencoded_len`` is the length of the unencoded (i.e., decrypted and
+decompressed) data. ``unencoded_offset`` is the offset from the first byte of
+the unencoded data to the first byte of logical data in the file (less than or
+equal to ``unencoded_len``). ``len`` is the length of the data in the file
+(less than or equal to ``unencoded_len - unencoded_offset``). See `Extent
+layout`_ below for some examples.
+
+If the unencoded data is actually longer than ``unencoded_len``, then it is
+truncated; if it is shorter, then it is extended with zeroes.
+
+``pwritev2()`` uses the metadata specified in ``iov[0]``, writes the encoded
+data from the remaining buffers, and returns the number of encoded bytes
+written (that is, the sum of ``iov[n].iov_len for 1 <= n < iovcnt``; partial
+writes will not occur). At least one encoding field must be non-zero. Note that
+the encoded data is not validated when it is written; if it is not valid (e.g.,
+it cannot be decompressed), then a subsequent read may return an error. If the
+offset argument to ``pwritev2()`` is -1, then the file offset is incremented by
+``len``. If ``iov[0].iov_len`` is less than ``sizeof(struct encoded_iov)`` in
+the kernel, then any fields unknown to user space are treated as if they were
+zero; if it is greater and any fields unknown to the kernel are non-zero, then
+``pwritev2()`` returns -1 and sets errno to ``E2BIG``.
+
+``preadv2()`` populates the metadata in ``iov[0]``, the encoded data in the
+remaining buffers, and returns the number of encoded bytes read. This will only
+return one extent per call. This can also read data which is not encoded; all
+encoding fields will be zero in that case. If the offset argument to
+``preadv2()`` is -1, then the file offset is incremented by ``len``. If
+``iov[0].iov_len`` is less than ``sizeof(struct encoded_iov)`` in the kernel
+and any fields unknown to user space are non-zero, then ``preadv2()`` returns
+-1 and sets errno to ``E2BIG``; if it is greater, then any fields unknown to
+the kernel are returned as zero. If the provided buffers are not large enough
+to return an entire encoded extent, then ``preadv2()`` returns -1 and sets
+errno to ``ENOBUFS``.
+
+As the filesystem page cache typically contains decoded data, encoded I/O
+bypasses the page cache.
+
+Extent layout
+-------------
+
+By using ``len``, ``unencoded_len``, and ``unencoded_offset``, it is possible
+to refer to a subset of an unencoded extent.
+
+In the simplest case, ``len`` is equal to ``unencoded_len`` and
+``unencoded_offset`` is zero. This means that the entire unencoded extent is
+used.
+
+However, suppose we read 50 bytes into a file which contains a single
+compressed extent. The filesystem must still return the entire compressed
+extent for us to be able to decompress it, so ``unencoded_len`` would be the
+length of the entire decompressed extent. However, because the read was at
+offset 50, the first 50 bytes should be ignored. Therefore,
+``unencoded_offset`` would be 50, and ``len`` would accordingly be
+``unencoded_len - 50``.
+
+Additionally, suppose we want to create an encrypted file with length 500, but
+the file is encrypted with a block cipher using a block size of 4096. The
+unencoded data would therefore include the appropriate padding, and
+``unencoded_len`` would be 4096. However, to represent the logical size of the
+file, ``len`` would be 500 (and ``unencoded_offset`` would be 0).
+
+Similar situations can arise in other cases:
+
+* If the filesystem pads data to the filesystem block size before compressing,
+  then compressed files with a size unaligned to the filesystem block size will
+  end with an extent with ``len < unencoded_len``.
+
+* Extents cloned from the middle of a larger encoded extent with
+  ``FICLONERANGE`` may have a non-zero ``unencoded_offset`` and/or
+  ``len < unencoded_len``.
+
+* If the middle of an encoded extent is overwritten, the filesystem may create
+  extents with a non-zero ``unencoded_offset`` and/or ``len < unencoded_len``
+  for the parts that were not overwritten.
+
+Security
+--------
+
+Encoded I/O creates the potential for some security issues:
+
+* Encoded writes allow writing arbitrary data which the kernel will decode on a
+  subsequent read. Decompression algorithms are complex and may have bugs that
+  can be exploited by maliciously crafted data.
+* Encoded reads may return data that is not logically present in the file (see
+  the discussion of ``len`` vs ``unencoded_len`` above). It may not be intended
+  for this data to be readable.
+
+Therefore, encoded I/O requires privilege. Namely, the ``RWF_ENCODED`` flag may
+only be used if the file description has the ``O_ALLOW_ENCODED`` file status
+flag set, and the ``O_ALLOW_ENCODED`` flag may only be set by a thread with the
+``CAP_SYS_ADMIN`` capability. The ``O_ALLOW_ENCODED`` flag can be set by
+``open()`` or ``fcntl()``. It can also be cleared by ``fcntl()``; clearing it
+does not require ``CAP_SYS_ADMIN``. Note that it is not cleared on ``fork()``
+or ``execve()``. One may wish to use ``O_CLOEXEC`` with ``O_ALLOW_ENCODED``.
+
+Filesystem support
+------------------
+
+Encoded I/O is supported on the following filesystems:
+
+Btrfs (since Linux 5.14)
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Btrfs supports encoded reads and writes of compressed data. The data is encoded
+as follows:
+
+* If ``compression`` is ``ENCODED_IOV_COMPRESSION_BTRFS_ZLIB``, then the encoded
+  data is a single zlib stream.
+* If ``compression`` is ``ENCODED_IOV_COMPRESSION_BTRFS_ZSTD``, then the
+  encoded data is a single zstd frame compressed with the windowLog compression
+  parameter set to no more than 17.
+* If ``compression`` is one of ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K``,
+  ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K``,
+  ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K``,
+  ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K``, or
+  ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K``, then the encoded data is
+  compressed page by page (using the page size indicated by the name of the
+  constant) with LZO1X and wrapped in the format documented in the Linux kernel
+  source file ``fs/btrfs/lzo.c``.
+
+Additionally, there are some restrictions on ``pwritev2()``:
+
+* ``offset`` (or the current file offset if ``offset`` is -1) must be aligned
+  to the sector size of the filesystem.
+* ``len`` must be aligned to the sector size of the filesystem unless the data
+  ends at or beyond the current end of the file.
+* ``unencoded_len`` and the length of the encoded data must each be no more
+  than 128 KiB. This limit may increase in the future.
+* The length of the encoded data must be less than or equal to
+  ``unencoded_len.``
+* If using LZO, the filesystem's page size must match the compression page
+  size.
+
+Implementation
+==============
+
+This section describes the requirements for filesystems implementing encoded
+I/O.
+
+First of all, a filesystem supporting encoded I/O must indicate this by setting
+the ``FMODE_ENCODED_IO`` flag in its ``file_open`` file operation::
+
+    static int foo_file_open(struct inode *inode, struct file *filp)
+    {
+            ...
+            filep->f_mode |= FMODE_ENCODED_IO;
+            ...
+    }
+
+Encoded I/O goes through ``read_iter`` and ``write_iter``, designated by the
+``IOCB_ENCODED`` flag in ``kiocb->ki_flags``.
+
+Reads
+-----
+
+Encoded ``read_iter`` should:
+
+1. Call ``generic_encoded_read_checks()`` to validate the file and buffers
+   provided by userspace.
+2. Initialize the ``encoded_iov`` appropriately.
+3. Copy it to the user with ``copy_encoded_iov_to_iter()``.
+4. Copy the encoded data to the user.
+5. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
+6. Return the size of the encoded data read, not including the ``encoded_iov``.
+
+There are a few details to be aware of:
+
+* Encoded ``read_iter`` should support reading unencoded data if the extent is
+  not encoded.
+* If the buffers provided by the user are not large enough to contain an entire
+  encoded extent, then ``read_iter`` should return ``-ENOBUFS``. This is to
+  avoid confusing userspace with truncated data that cannot be properly
+  decoded.
+* Reads in the middle of an encoded extent can be returned by setting
+  ``encoded_iov->unencoded_offset`` to non-zero.
+* Truncated unencoded data (e.g., because the file does not end on a block
+  boundary) may be returned by setting ``encoded_iov->len`` to a value smaller
+  value than ``encoded_iov->unencoded_len - encoded_iov->unencoded_offset``.
+
+Writes
+------
+
+Encoded ``write_iter`` should (in addition to the usual accounting/checks done
+by ``write_iter``):
+
+1. Call ``copy_encoded_iov_from_iter()`` to get and validate the
+   ``encoded_iov``.
+2. Call ``generic_encoded_write_checks()`` instead of
+   ``generic_write_checks()``.
+3. Check that the provided encoding in ``encoded_iov`` is supported.
+4. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``.
+5. Return the size of the encoded data written.
+
+Again, there are a few details:
+
+* Encoded ``write_iter`` doesn't need to support writing unencoded data.
+* ``write_iter`` should either write all of the encoded data or none of it; it
+  must not do partial writes.
+* ``write_iter`` doesn't need to validate the encoded data; a subsequent read
+  may return, e.g., ``-EIO`` if the data is not valid.
+* The user may lie about the unencoded size of the data; a subsequent read
+  should truncate or zero-extend the unencoded data rather than returning an
+  error.
+* Be careful of page cache coherency.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index d4853cb919d2..670c673c5956 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -54,6 +54,7 @@ filesystem implementations.
    fscrypt
    fsverity
    netfs_library
+   encoded_io
 
 Filesystems
 ===========
diff --git a/fs/read_write.c b/fs/read_write.c
index 9db7adf160d2..f8db16e01227 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -20,6 +20,7 @@
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/fs.h>
+#include <linux/encoded_io.h>
 #include "internal.h"
 
 #include <linux/uaccess.h>
@@ -1632,24 +1633,15 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 	return 0;
 }
 
-/*
- * Performs necessary checks before doing a write
- *
- * Can adjust writing position or amount of bytes to write.
- * Returns appropriate error code that caller should return or
- * zero in case that write should be allowed.
- */
-ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
-	loff_t count;
-	int ret;
 
 	if (IS_SWAPFILE(inode))
 		return -ETXTBSY;
 
-	if (!iov_iter_count(from))
+	if (!*count)
 		return 0;
 
 	/* FIXME: this is for backwards compatibility with 2.4 */
@@ -1659,8 +1651,22 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
 		return -EINVAL;
 
-	count = iov_iter_count(from);
-	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
+	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
+}
+
+/*
+ * Performs necessary checks before doing a write
+ *
+ * Can adjust writing position or amount of bytes to write.
+ * Returns appropriate error code that caller should return or
+ * zero in case that write should be allowed.
+ */
+ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	loff_t count = iov_iter_count(from);
+	int ret;
+
+	ret = generic_write_checks_common(iocb, &count);
 	if (ret)
 		return ret;
 
@@ -1691,3 +1697,139 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
 
 	return 0;
 }
+
+/**
+ * generic_encoded_write_checks() - check an encoded write
+ * @iocb: I/O context.
+ * @encoded: Encoding metadata.
+ *
+ * This should be called by RWF_ENCODED write implementations rather than
+ * generic_write_checks(). Unlike generic_write_checks(), it returns -EFBIG
+ * instead of adjusting the size of the write.
+ *
+ * Return: 0 on success, -errno on error.
+ */
+int generic_encoded_write_checks(struct kiocb *iocb,
+				 const struct encoded_iov *encoded)
+{
+	loff_t count = encoded->len;
+	int ret;
+
+	if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
+		return -EPERM;
+
+	ret = generic_write_checks_common(iocb, &count);
+	if (ret)
+		return ret;
+
+	if (count != encoded->len) {
+		/*
+		 * The write got truncated by generic_write_checks_common(). We
+		 * can't do a partial encoded write.
+		 */
+		return -EFBIG;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(generic_encoded_write_checks);
+
+/**
+ * copy_encoded_iov_from_iter() - copy a &struct encoded_iov from userspace
+ * @encoded: Returned encoding metadata.
+ * @from: Source iterator.
+ *
+ * This copies in the &struct encoded_iov and does some basic sanity checks.
+ * This should always be used rather than a plain copy_from_iter(), as it does
+ * the proper handling for backward- and forward-compatibility.
+ *
+ * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if the
+ *         copied structure contained non-zero fields that this kernel doesn't
+ *         support, -EINVAL if the copied structure was invalid.
+ */
+int copy_encoded_iov_from_iter(struct encoded_iov *encoded,
+			       struct iov_iter *from)
+{
+	size_t usize;
+	int ret;
+
+	usize = iov_iter_single_seg_count(from);
+	if (usize > PAGE_SIZE)
+		return -E2BIG;
+	if (usize < ENCODED_IOV_SIZE_VER0)
+		return -EINVAL;
+	ret = copy_struct_from_iter(encoded, sizeof(*encoded), from);
+	if (ret)
+		return ret;
+
+	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
+	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
+		return -EINVAL;
+	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
+	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
+		return -EINVAL;
+	if (encoded->unencoded_offset > encoded->unencoded_len)
+		return -EINVAL;
+	if (encoded->len > encoded->unencoded_len - encoded->unencoded_offset)
+		return -EINVAL;
+	return 0;
+}
+EXPORT_SYMBOL(copy_encoded_iov_from_iter);
+
+/**
+ * generic_encoded_read_checks() - sanity check an RWF_ENCODED read
+ * @iocb: I/O context.
+ * @iter: Destination iterator for read.
+ *
+ * This should always be called by RWF_ENCODED read implementations before
+ * returning any data.
+ *
+ * Return: Number of bytes available to return encoded data in @iter on success,
+ *         -EPERM if the file was not opened with O_ALLOW_ENCODED, -EINVAL if
+ *         the size of the &struct encoded_iov iovec is invalid.
+ */
+ssize_t generic_encoded_read_checks(struct kiocb *iocb, struct iov_iter *iter)
+{
+	size_t usize;
+
+	if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED))
+		return -EPERM;
+	usize = iov_iter_single_seg_count(iter);
+	if (usize > PAGE_SIZE || usize < ENCODED_IOV_SIZE_VER0)
+		return -EINVAL;
+	return iov_iter_count(iter) - usize;
+}
+EXPORT_SYMBOL(generic_encoded_read_checks);
+
+/**
+ * copy_encoded_iov_to_iter() - copy a &struct encoded_iov to userspace
+ * @encoded: Encoding metadata to return.
+ * @to: Destination iterator.
+ *
+ * This should always be used by RWF_ENCODED read implementations rather than a
+ * plain copy_to_iter(), as it does the proper handling for backward- and
+ * forward-compatibility. The iterator must be sanity-checked with
+ * generic_encoded_read_checks() before this is called.
+ *
+ * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if there
+ *         were non-zero fields in @encoded that the user buffer could not
+ *         accommodate.
+ */
+int copy_encoded_iov_to_iter(const struct encoded_iov *encoded,
+			     struct iov_iter *to)
+{
+	size_t ksize = sizeof(*encoded);
+	size_t usize = iov_iter_single_seg_count(to);
+	size_t size = min(ksize, usize);
+
+	/* We already sanity-checked usize in generic_encoded_read_checks(). */
+
+	if (usize < ksize &&
+	    memchr_inv((char *)encoded + usize, 0, ksize - usize))
+		return -E2BIG;
+	if (copy_to_iter(encoded, size, to) != size ||
+	    (usize > ksize &&
+	     iov_iter_zero(usize - ksize, to) != usize - ksize))
+		return -EFAULT;
+	return 0;
+}
+EXPORT_SYMBOL(copy_encoded_iov_to_iter);
diff --git a/include/linux/encoded_io.h b/include/linux/encoded_io.h
new file mode 100644
index 000000000000..a8cfc0108ba0
--- /dev/null
+++ b/include/linux/encoded_io.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ENCODED_IO_H
+#define _LINUX_ENCODED_IO_H
+
+#include <uapi/linux/encoded_io.h>
+
+struct encoded_iov;
+struct iov_iter;
+struct kiocb;
+extern int generic_encoded_write_checks(struct kiocb *,
+					const struct encoded_iov *);
+extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *);
+extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *);
+extern int copy_encoded_iov_to_iter(const struct encoded_iov *,
+				    struct iov_iter *);
+
+#endif /* _LINUX_ENCODED_IO_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..2a9ab11baaed 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -181,6 +181,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports async buffered reads */
 #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
 
+/* File supports encoded IO */
+#define FMODE_ENCODED_IO	((__force fmode_t)0x80000000)
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
@@ -311,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(__force int) RWF_SYNC
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
+#define IOCB_ENCODED		(__force int) RWF_ENCODED
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -3223,6 +3227,13 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_write_check_limits(struct file *file, loff_t pos,
 		loff_t *count);
+struct encoded_iov;
+extern int generic_encoded_write_checks(struct kiocb *,
+					const struct encoded_iov *);
+extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *);
+extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *);
+extern int copy_encoded_iov_to_iter(const struct encoded_iov *,
+				    struct iov_iter *);
 extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
 ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *to,
 		ssize_t already_read);
@@ -3528,6 +3539,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 			return -EOPNOTSUPP;
 		kiocb_flags |= IOCB_NOIO;
 	}
+	if ((flags & RWF_ENCODED) && !(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
+		return -EOPNOTSUPP;
 	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
 	if (flags & RWF_SYNC)
 		kiocb_flags |= IOCB_DSYNC;
diff --git a/include/uapi/linux/encoded_io.h b/include/uapi/linux/encoded_io.h
new file mode 100644
index 000000000000..cf741453dba4
--- /dev/null
+++ b/include/uapi/linux/encoded_io.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_ENCODED_IO_H
+#define _UAPI_LINUX_ENCODED_IO_H
+
+#include <linux/types.h>
+
+#define ENCODED_IOV_COMPRESSION_NONE 0
+#define ENCODED_IOV_COMPRESSION_BTRFS_ZLIB 1
+#define ENCODED_IOV_COMPRESSION_BTRFS_ZSTD 2
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K 3
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K 4
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K 5
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K 6
+#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K 7
+#define ENCODED_IOV_COMPRESSION_TYPES 8
+
+#define ENCODED_IOV_ENCRYPTION_NONE 0
+#define ENCODED_IOV_ENCRYPTION_TYPES 1
+
+struct encoded_iov {
+	__aligned_u64 len;
+	__aligned_u64 unencoded_len;
+	__aligned_u64 unencoded_offset;
+	__u32 compression;
+	__u32 encryption;
+};
+
+#define ENCODED_IOV_SIZE_VER0 32
+
+#endif /* _UAPI_LINUX_ENCODED_IO_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 4c32e97dcdf0..0ef3a073c9b4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -300,8 +300,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* encoded (e.g., compressed and/or encrypted) IO */
+#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ENCODED)
 
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 4/9] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio()
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (2 preceding siblings ...)
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 3/9] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 5/9] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

btrfs_csum_one_bio() loops over each filesystem block in the bio while
keeping a cursor of its current logical position in the file in order to
look up the ordered extent to add the checksums to. However, this
doesn't make much sense for compressed extents, as a sector on disk does
not correspond to a sector of decompressed file data. It happens to work
because 1) the compressed bio always covers one ordered extent and 2)
the size of the bio is always less than the size of the ordered extent.
However, the second point will not always be true for encoded writes.

Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
it can assume that the bio only covers one ordered extent. Since we're
already changing the signature, let's get rid of the contig parameter
and make it implied by the offset parameter, similar to the change we
recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
nr_sectors to blockcount to make it clear that it's the number of
filesystem blocks, not the number of 512-byte sectors.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c |  5 +++--
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/file-item.c   | 35 ++++++++++++++++-------------------
 fs/btrfs/inode.c       |  8 ++++----
 4 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 9a023ae0f98b..59733573ed08 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -481,7 +481,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			BUG_ON(ret); /* -ENOMEM */
 
 			if (!skip_sum) {
-				ret = btrfs_csum_one_bio(inode, bio, start, 1);
+				ret = btrfs_csum_one_bio(inode, bio, start,
+							 true);
 				BUG_ON(ret); /* -ENOMEM */
 			}
 
@@ -517,7 +518,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	BUG_ON(ret); /* -ENOMEM */
 
 	if (!skip_sum) {
-		ret = btrfs_csum_one_bio(inode, bio, start, 1);
+		ret = btrfs_csum_one_bio(inode, bio, start, true);
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6131b58f779f..dd37ec16f22a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3094,7 +3094,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct btrfs_ordered_sum *sums);
 blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-				u64 file_start, int contig);
+				u64 offset, bool one_ordered);
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 			     struct list_head *list, int search_commit);
 void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index df6631eefc65..2ab1494c32d5 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -615,28 +615,28 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
  * btrfs_csum_one_bio - Calculates checksums of the data contained inside a bio
  * @inode:	 Owner of the data inside the bio
  * @bio:	 Contains the data to be checksummed
- * @file_start:  offset in file this bio begins to describe
- * @contig:	 Boolean. If true/1 means all bio vecs in this bio are
- *		 contiguous and they begin at @file_start in the file. False/0
- *		 means this bio can contain potentially discontiguous bio vecs
- *		 so the logical offset of each should be calculated separately.
+ * @offset:      If (u64)-1, @bio may contain discontiguous bio vecs, so the
+ *               file offsets are determined from the page offsets in the bio.
+ *               Otherwise, this is the starting file offset of the bio vecs in
+ *               @bio, which must be contiguous.
+ * @one_ordered: If true, @bio only refers to one ordered extent.
  */
 blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-		       u64 file_start, int contig)
+				u64 offset, bool one_ordered)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	struct btrfs_ordered_sum *sums;
 	struct btrfs_ordered_extent *ordered = NULL;
+	const bool page_offsets = (offset == (u64)-1);
 	char *data;
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	int index;
-	int nr_sectors;
+	int blockcount;
 	unsigned long total_bytes = 0;
 	unsigned long this_sum_bytes = 0;
 	int i;
-	u64 offset;
 	unsigned nofs_flag;
 
 	nofs_flag = memalloc_nofs_save();
@@ -650,18 +650,13 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 	sums->len = bio->bi_iter.bi_size;
 	INIT_LIST_HEAD(&sums->list);
 
-	if (contig)
-		offset = file_start;
-	else
-		offset = 0; /* shut up gcc */
-
 	sums->bytenr = bio->bi_iter.bi_sector << 9;
 	index = 0;
 
 	shash->tfm = fs_info->csum_shash;
 
 	bio_for_each_segment(bvec, bio, iter) {
-		if (!contig)
+		if (page_offsets)
 			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
 
 		if (!ordered) {
@@ -669,13 +664,14 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 			BUG_ON(!ordered); /* Logic error */
 		}
 
-		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info,
+		blockcount = BTRFS_BYTES_TO_BLKS(fs_info,
 						 bvec.bv_len + fs_info->sectorsize
 						 - 1);
 
-		for (i = 0; i < nr_sectors; i++) {
-			if (offset >= ordered->file_offset + ordered->num_bytes ||
-			    offset < ordered->file_offset) {
+		for (i = 0; i < blockcount; i++) {
+			if (!one_ordered &&
+			    (offset >= ordered->file_offset + ordered->num_bytes ||
+			     offset < ordered->file_offset)) {
 				unsigned long bytes_left;
 
 				sums->len = this_sum_bytes;
@@ -706,7 +702,8 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 					    sums->sums + index);
 			kunmap_atomic(data);
 			index += fs_info->csum_size;
-			offset += fs_info->sectorsize;
+			if (!one_ordered)
+				offset += fs_info->sectorsize;
 			this_sum_bytes += fs_info->sectorsize;
 			total_bytes += fs_info->sectorsize;
 		}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a2494c645681..994f9cea72cb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2268,7 +2268,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
 					   u64 dio_file_offset)
 {
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
+	return btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false);
 }
 
 static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
@@ -2431,7 +2431,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 					  0, btrfs_submit_bio_start);
 		goto out;
 	} else if (!skip_sum) {
-		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
+		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false);
 		if (ret)
 			goto out;
 	}
@@ -7994,7 +7994,7 @@ static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
 						     struct bio *bio,
 						     u64 dio_file_offset)
 {
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, 1);
+	return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, false);
 }
 
 static void btrfs_end_dio_bio(struct bio *bio)
@@ -8053,7 +8053,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 		 * If we aren't doing async submit, calculate the csum of the
 		 * bio now.
 		 */
-		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, 1);
+		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, false);
 		if (ret)
 			goto err;
 	} else {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 5/9] btrfs: add ram_bytes and offset to btrfs_ordered_extent
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (3 preceding siblings ...)
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 4/9] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 6/9] btrfs: support different disk extent size for delalloc Omar Sandoval
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, we only create ordered extents when ram_bytes == num_bytes
and offset == 0. However, RWF_ENCODED writes may create extents which
only refer to a subset of the full unencoded extent, so we need to plumb
these fields through the ordered extent infrastructure and pass them
down to insert_reserved_file_extent().

Since we're changing the btrfs_add_ordered_extent* signature, let's get
rid of the trivial wrappers and add a kernel-doc.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c        |  56 +++++++++++---------
 fs/btrfs/ordered-data.c | 112 +++++++++++-----------------------------
 fs/btrfs/ordered-data.h |  22 ++++----
 3 files changed, 76 insertions(+), 114 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 994f9cea72cb..f59ad09abb61 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -938,12 +938,12 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			goto out_free_reserve;
 		free_extent_map(em);
 
-		ret = btrfs_add_ordered_extent_compress(inode,
-						async_extent->start,
-						ins.objectid,
-						async_extent->ram_size,
-						ins.offset,
-						async_extent->compress_type);
+		ret = btrfs_add_ordered_extent(inode, async_extent->start,
+					       async_extent->ram_size,
+					       async_extent->ram_size,
+					       ins.objectid, ins.offset, 0,
+					       1 << BTRFS_ORDERED_COMPRESSED,
+					       async_extent->compress_type);
 		if (ret) {
 			btrfs_drop_extent_cache(inode, async_extent->start,
 						async_extent->start +
@@ -1163,9 +1163,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		}
 		free_extent_map(em);
 
-		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
-					       ram_size, cur_alloc_size,
-					       BTRFS_ORDERED_REGULAR);
+		ret = btrfs_add_ordered_extent(inode, start, ram_size, ram_size,
+					       ins.objectid, cur_alloc_size, 0,
+					       0, BTRFS_COMPRESS_NONE);
 		if (ret)
 			goto out_drop_extent_cache;
 
@@ -1826,10 +1826,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 				goto error;
 			}
 			free_extent_map(em);
-			ret = btrfs_add_ordered_extent(inode, cur_offset,
-						       disk_bytenr, num_bytes,
-						       num_bytes,
-						       BTRFS_ORDERED_PREALLOC);
+			ret = btrfs_add_ordered_extent(inode,
+					cur_offset, num_bytes, num_bytes,
+					disk_bytenr, num_bytes, 0,
+					1 << BTRFS_ORDERED_PREALLOC,
+					BTRFS_COMPRESS_NONE);
 			if (ret) {
 				btrfs_drop_extent_cache(inode, cur_offset,
 							cur_offset + num_bytes - 1,
@@ -1838,9 +1839,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			}
 		} else {
 			ret = btrfs_add_ordered_extent(inode, cur_offset,
+						       num_bytes, num_bytes,
 						       disk_bytenr, num_bytes,
-						       num_bytes,
-						       BTRFS_ORDERED_NOCOW);
+						       0,
+						       1 << BTRFS_ORDERED_NOCOW,
+						       BTRFS_COMPRESS_NONE);
 			if (ret)
 				goto error;
 		}
@@ -2735,6 +2738,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_key ins;
 	u64 disk_num_bytes = btrfs_stack_file_extent_disk_num_bytes(stack_fi);
 	u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi);
+	u64 offset = btrfs_stack_file_extent_offset(stack_fi);
 	u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi);
 	u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi);
 	struct btrfs_drop_extents_args drop_args = { 0 };
@@ -2809,7 +2813,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 		goto out;
 
 	ret = btrfs_alloc_reserved_file_extent(trans, root, btrfs_ino(inode),
-					       file_pos, qgroup_reserved, &ins);
+					       file_pos - offset,
+					       qgroup_reserved, &ins);
 out:
 	btrfs_free_path(path);
 
@@ -2835,20 +2840,20 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 					     struct btrfs_ordered_extent *oe)
 {
 	struct btrfs_file_extent_item stack_fi;
-	u64 logical_len;
 	bool update_inode_bytes;
+	u64 num_bytes = oe->num_bytes;
+	u64 ram_bytes = oe->ram_bytes;
 
 	memset(&stack_fi, 0, sizeof(stack_fi));
 	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG);
 	btrfs_set_stack_file_extent_disk_bytenr(&stack_fi, oe->disk_bytenr);
 	btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi,
 						   oe->disk_num_bytes);
+	btrfs_set_stack_file_extent_offset(&stack_fi, oe->offset);
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags))
-		logical_len = oe->truncated_len;
-	else
-		logical_len = oe->num_bytes;
-	btrfs_set_stack_file_extent_num_bytes(&stack_fi, logical_len);
-	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, logical_len);
+		num_bytes = ram_bytes = oe->truncated_len;
+	btrfs_set_stack_file_extent_num_bytes(&stack_fi, num_bytes);
+	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, ram_bytes);
 	btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type);
 	/* Encryption and other encoding is reserved and all 0 */
 
@@ -7249,8 +7254,11 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		if (IS_ERR(em))
 			goto out;
 	}
-	ret = btrfs_add_ordered_extent_dio(inode, start, block_start, len,
-					   block_len, type);
+	ret = btrfs_add_ordered_extent(inode, start, len, len, block_start,
+				       block_len, 0,
+				       (1 << type) |
+				       (1 << BTRFS_ORDERED_DIRECT),
+				       BTRFS_COMPRESS_NONE);
 	if (ret) {
 		if (em) {
 			free_extent_map(em);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6eb41b7c0c84..96357d2c845e 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -143,16 +143,27 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
 	return ret;
 }
 
-/*
- * Allocate and add a new ordered_extent into the per-inode tree.
+/**
+ * btrfs_add_ordered_extent - Add an ordered extent to the per-inode tree.
+ * @inode: inode that this extent is for.
+ * @file_offset: Logical offset in file where the extent starts.
+ * @num_bytes: Logical length of extent in file.
+ * @ram_bytes: Full length of unencoded data.
+ * @disk_bytenr: Offset of extent on disk.
+ * @disk_num_bytes: Size of extent on disk.
+ * @offset: Offset into unencoded data where file data starts.
+ * @flags: Flags specifying type of extent (1 << BTRFS_ORDERED_*).
+ * @compress_type: Compression algorithm used for data.
  *
- * The tree is given a single reference on the ordered extent that was
- * inserted.
+ * Most of these parameters correspond to &struct btrfs_file_extent_item. The
+ * tree is given a single reference on the ordered extent that was inserted.
+ *
+ * Return: 0 or -ENOMEM.
  */
-static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int type, int dio,
-				      int compress_type)
+int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, int flags,
+			     int compress_type)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -161,7 +172,8 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	struct btrfs_ordered_extent *entry;
 	int ret;
 
-	if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
+	if (flags &
+	    ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC))) {
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
 		if (ret < 0)
@@ -181,9 +193,11 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 		return -ENOMEM;
 
 	entry->file_offset = file_offset;
-	entry->disk_bytenr = disk_bytenr;
 	entry->num_bytes = num_bytes;
+	entry->ram_bytes = ram_bytes;
+	entry->disk_bytenr = disk_bytenr;
 	entry->disk_num_bytes = disk_num_bytes;
+	entry->offset = offset;
 	entry->bytes_left = num_bytes;
 	entry->inode = igrab(&inode->vfs_inode);
 	entry->compress_type = compress_type;
@@ -193,18 +207,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	entry->disk = NULL;
 	entry->partno = (u8)-1;
 
-	ASSERT(type == BTRFS_ORDERED_REGULAR ||
-	       type == BTRFS_ORDERED_NOCOW ||
-	       type == BTRFS_ORDERED_PREALLOC ||
-	       type == BTRFS_ORDERED_COMPRESSED);
-	set_bit(type, &entry->flags);
+	ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
+	entry->flags = flags;
 
 	percpu_counter_add_batch(&fs_info->ordered_bytes, num_bytes,
 				 fs_info->delalloc_batch);
 
-	if (dio)
-		set_bit(BTRFS_ORDERED_DIRECT, &entry->flags);
-
 	/* one ref for the tree */
 	refcount_set(&entry->refs, 1);
 	init_waitqueue_head(&entry->wait);
@@ -249,41 +257,6 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	return 0;
 }
 
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
-			     int type)
-{
-	ASSERT(type == BTRFS_ORDERED_REGULAR ||
-	       type == BTRFS_ORDERED_NOCOW ||
-	       type == BTRFS_ORDERED_PREALLOC);
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes, type, 0,
-					  BTRFS_COMPRESS_NONE);
-}
-
-int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset,
-				 u64 disk_bytenr, u64 num_bytes,
-				 u64 disk_num_bytes, int type)
-{
-	ASSERT(type == BTRFS_ORDERED_REGULAR ||
-	       type == BTRFS_ORDERED_NOCOW ||
-	       type == BTRFS_ORDERED_PREALLOC);
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes, type, 1,
-					  BTRFS_COMPRESS_NONE);
-}
-
-int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int compress_type)
-{
-	ASSERT(compress_type != BTRFS_COMPRESS_NONE);
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes,
-					  BTRFS_ORDERED_COMPRESSED, 0,
-					  compress_type);
-}
-
 /*
  * Add a struct btrfs_ordered_sum into the list of checksums to be inserted
  * when an ordered extent is finished.  If the list covers more than one
@@ -1056,35 +1029,12 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
 	struct inode *inode = ordered->inode;
 	u64 file_offset = ordered->file_offset + pos;
 	u64 disk_bytenr = ordered->disk_bytenr + pos;
-	u64 num_bytes = len;
-	u64 disk_num_bytes = len;
-	int type;
-	unsigned long flags_masked = ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT);
-	int compress_type = ordered->compress_type;
-	unsigned long weight;
-	int ret;
+	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
 
-	weight = hweight_long(flags_masked);
-	WARN_ON_ONCE(weight > 1);
-	if (!weight)
-		type = 0;
-	else
-		type = __ffs(flags_masked);
-
-	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) {
-		WARN_ON_ONCE(1);
-		ret = btrfs_add_ordered_extent_compress(BTRFS_I(inode),
-				file_offset, disk_bytenr, num_bytes,
-				disk_num_bytes, compress_type);
-	} else if (test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
-		ret = btrfs_add_ordered_extent_dio(BTRFS_I(inode), file_offset,
-				disk_bytenr, num_bytes, disk_num_bytes, type);
-	} else {
-		ret = btrfs_add_ordered_extent(BTRFS_I(inode), file_offset,
-				disk_bytenr, num_bytes, disk_num_bytes, type);
-	}
-
-	return ret;
+	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
+	return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
+					disk_bytenr, len, 0, flags,
+					ordered->compress_type);
 }
 
 int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 566472004edd..81648a78f933 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -76,6 +76,13 @@ enum {
 	BTRFS_ORDERED_PENDING,
 };
 
+/* BTRFS_ORDERED_* flags that specify the type of the extent. */
+#define BTRFS_ORDERED_TYPE_FLAGS ((1UL << BTRFS_ORDERED_REGULAR) |	\
+				  (1UL << BTRFS_ORDERED_NOCOW) |	\
+				  (1UL << BTRFS_ORDERED_PREALLOC) |	\
+				  (1UL << BTRFS_ORDERED_COMPRESSED) |	\
+				  (1UL << BTRFS_ORDERED_DIRECT))
+
 struct btrfs_ordered_extent {
 	/* logical offset in the file */
 	u64 file_offset;
@@ -84,9 +91,11 @@ struct btrfs_ordered_extent {
 	 * These fields directly correspond to the same fields in
 	 * btrfs_file_extent_item.
 	 */
-	u64 disk_bytenr;
 	u64 num_bytes;
+	u64 ram_bytes;
+	u64 disk_bytenr;
 	u64 disk_num_bytes;
+	u64 offset;
 
 	/* number of bytes that still need writing */
 	u64 bytes_left;
@@ -180,14 +189,9 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u64 io_size, int uptodate);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
-			     int type);
-int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset,
-				 u64 disk_bytenr, u64 num_bytes,
-				 u64 disk_num_bytes, int type);
-int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int compress_type);
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, int flags,
+			     int compress_type);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 6/9] btrfs: support different disk extent size for delalloc
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (4 preceding siblings ...)
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 5/9] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 7/9] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For RWF_ENCODED writes, we know the exact size of the extent on
disk, which may be less than or greater than (for bookends) the size in
the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h          |  3 ++-
 fs/btrfs/delalloc-space.c | 18 ++++++++++--------
 fs/btrfs/file.c           |  3 ++-
 fs/btrfs/inode.c          |  2 +-
 fs/btrfs/relocation.c     |  4 ++--
 5 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dd37ec16f22a..86856889188d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2796,7 +2796,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root,
 				      struct btrfs_block_rsv *rsv);
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+				    u64 disk_num_bytes);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
 				   u64 start, u64 end);
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 2059d1504149..143650004991 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -267,11 +267,11 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 }
 
 static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
-				    u64 num_bytes, u64 *meta_reserve,
-				    u64 *qgroup_reserve)
+				    u64 num_bytes, u64 disk_num_bytes,
+				    u64 *meta_reserve, u64 *qgroup_reserve)
 {
 	u64 nr_extents = count_max_extents(num_bytes);
-	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
+	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes);
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
 	*meta_reserve = btrfs_calc_insert_metadata_size(fs_info,
@@ -285,7 +285,8 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+				    u64 disk_num_bytes)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -315,6 +316,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	}
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
+	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
 
 	/*
 	 * We always want to do it this way, every other way is wrong and ends
@@ -326,8 +328,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	 * everything out and try again, which is bad.  This way we just
 	 * over-reserve slightly, and clean up the mess when we are done.
 	 */
-	calc_inode_reservations(fs_info, num_bytes, &meta_reserve,
-				&qgroup_reserve);
+	calc_inode_reservations(fs_info, num_bytes, disk_num_bytes,
+				&meta_reserve, &qgroup_reserve);
 	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true);
 	if (ret)
 		return ret;
@@ -346,7 +348,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	spin_lock(&inode->lock);
 	nr_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
-	inode->csum_bytes += num_bytes;
+	inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
@@ -451,7 +453,7 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode,
 	ret = btrfs_check_data_free_space(inode, reserved, start, len);
 	if (ret < 0)
 		return ret;
-	ret = btrfs_delalloc_reserve_metadata(inode, len);
+	ret = btrfs_delalloc_reserve_metadata(inode, len, len);
 	if (ret < 0)
 		btrfs_free_reserved_data_space(inode, *reserved, start, len);
 	return ret;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 28a05ba47060..4e13c481b21e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1730,7 +1730,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 					 fs_info->sectorsize);
 		WARN_ON(reserve_bytes == 0);
 		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				reserve_bytes);
+						      reserve_bytes,
+						      reserve_bytes);
 		if (ret) {
 			if (!only_release_metadata)
 				btrfs_free_reserved_data_space(BTRFS_I(inode),
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f59ad09abb61..fcf685fd4eff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4898,7 +4898,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 			goto out;
 		}
 	}
-	ret = btrfs_delalloc_reserve_metadata(inode, blocksize);
+	ret = btrfs_delalloc_reserve_metadata(inode, blocksize, blocksize);
 	if (ret < 0) {
 		if (!only_release_metadata)
 			btrfs_free_reserved_data_space(inode, data_reserved,
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 420a89869889..173d426ece30 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2922,8 +2922,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
 	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				PAGE_SIZE);
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE,
+						      PAGE_SIZE);
 		if (ret)
 			goto out;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 7/9] btrfs: optionally extend i_size in cow_file_range_inline()
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (5 preceding siblings ...)
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 6/9] btrfs: support different disk extent size for delalloc Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 8/9] btrfs: implement RWF_ENCODED reads Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 9/9] btrfs: implement RWF_ENCODED writes Omar Sandoval
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, an inline extent is always created after i_size is extended
from btrfs_dirty_pages(). However, for encoded writes, we only want to
update i_size after we successfully created the inline extent. Add an
update_i_size parameter to cow_file_range_inline() and
insert_inline_extent() and pass in the size of the extent rather than
determining it from i_size. Since the start parameter is always passed
as 0, get rid of it and simplify the logic in these two functions. While
we're here, let's document the requirements for creating an inline
extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 100 +++++++++++++++++++++++------------------------
 1 file changed, 48 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fcf685fd4eff..15b06d6ac875 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -235,9 +235,10 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
 static int insert_inline_extent(struct btrfs_trans_handle *trans,
 				struct btrfs_path *path, bool extent_inserted,
 				struct btrfs_root *root, struct inode *inode,
-				u64 start, size_t size, size_t compressed_size,
+				size_t size, size_t compressed_size,
 				int compress_type,
-				struct page **compressed_pages)
+				struct page **compressed_pages,
+				bool update_i_size)
 {
 	struct extent_buffer *leaf;
 	struct page *page = NULL;
@@ -246,7 +247,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_file_extent_item *ei;
 	int ret;
 	size_t cur_size = size;
-	unsigned long offset;
+	u64 i_size;
 
 	ASSERT((compressed_size > 0 && compressed_pages) ||
 	       (compressed_size == 0 && !compressed_pages));
@@ -259,7 +260,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		size_t datasize;
 
 		key.objectid = btrfs_ino(BTRFS_I(inode));
-		key.offset = start;
+		key.offset = 0;
 		key.type = BTRFS_EXTENT_DATA_KEY;
 
 		datasize = btrfs_file_extent_calc_inline_size(cur_size);
@@ -297,12 +298,10 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		btrfs_set_file_extent_compression(leaf, ei,
 						  compress_type);
 	} else {
-		page = find_get_page(inode->i_mapping,
-				     start >> PAGE_SHIFT);
+		page = find_get_page(inode->i_mapping, 0);
 		btrfs_set_file_extent_compression(leaf, ei, 0);
 		kaddr = kmap_atomic(page);
-		offset = offset_in_page(start);
-		write_extent_buffer(leaf, kaddr + offset, ptr, size);
+		write_extent_buffer(leaf, kaddr, ptr, size);
 		kunmap_atomic(kaddr);
 		put_page(page);
 	}
@@ -313,8 +312,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * We align size to sectorsize for inline extents just for simplicity
 	 * sake.
 	 */
-	size = ALIGN(size, root->fs_info->sectorsize);
-	ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), start, size);
+	ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), 0,
+					ALIGN(size, root->fs_info->sectorsize));
 	if (ret)
 		goto fail;
 
@@ -327,7 +326,13 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * before we unlock the pages.  Otherwise we
 	 * could end up racing with unlink.
 	 */
-	BTRFS_I(inode)->disk_i_size = inode->i_size;
+	i_size = i_size_read(inode);
+	if (update_i_size && size > i_size) {
+		i_size_write(inode, size);
+		i_size = size;
+	}
+	BTRFS_I(inode)->disk_i_size = i_size;
+
 fail:
 	return ret;
 }
@@ -338,35 +343,31 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
  * does the checks required to make sure the data is small enough
  * to fit as an inline extent.
  */
-static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
-					  u64 end, size_t compressed_size,
+static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size,
+					  size_t compressed_size,
 					  int compress_type,
-					  struct page **compressed_pages)
+					  struct page **compressed_pages,
+					  bool update_i_size)
 {
 	struct btrfs_drop_extents_args drop_args = { 0 };
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
-	u64 isize = i_size_read(&inode->vfs_inode);
-	u64 actual_end = min(end + 1, isize);
-	u64 inline_len = actual_end - start;
-	u64 aligned_end = ALIGN(end, fs_info->sectorsize);
-	u64 data_len = inline_len;
+	u64 data_len = compressed_size ? compressed_size : size;
 	int ret;
 	struct btrfs_path *path;
 
-	if (compressed_size)
-		data_len = compressed_size;
-
-	if (start > 0 ||
-	    actual_end > fs_info->sectorsize ||
+	/*
+	 * We can create an inline extent if it ends at or beyond the current
+	 * i_size, is no larger than a sector (decompressed), and the (possibly
+	 * compressed) data fits in a leaf and the configured maximum inline
+	 * size.
+	 */
+	if (size < i_size_read(&inode->vfs_inode) ||
+	    size > fs_info->sectorsize ||
 	    data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) ||
-	    (!compressed_size &&
-	    (actual_end & (fs_info->sectorsize - 1)) == 0) ||
-	    end + 1 < isize ||
-	    data_len > fs_info->max_inline) {
+	    data_len > fs_info->max_inline)
 		return 1;
-	}
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -380,30 +381,21 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 	trans->block_rsv = &inode->block_rsv;
 
 	drop_args.path = path;
-	drop_args.start = start;
-	drop_args.end = aligned_end;
+	drop_args.start = 0;
+	drop_args.end = fs_info->sectorsize;
 	drop_args.drop_cache = true;
 	drop_args.replace_extent = true;
-
-	if (compressed_size && compressed_pages)
-		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
-		   compressed_size);
-	else
-		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
-		    inline_len);
-
+	drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(data_len);
 	ret = btrfs_drop_extents(trans, root, inode, &drop_args);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
 	}
 
-	if (isize > actual_end)
-		inline_len = min_t(u64, isize, actual_end);
-	ret = insert_inline_extent(trans, path, drop_args.extent_inserted,
-				   root, &inode->vfs_inode, start,
-				   inline_len, compressed_size,
-				   compress_type, compressed_pages);
+	ret = insert_inline_extent(trans, path, drop_args.extent_inserted, root,
+				   &inode->vfs_inode, size, compressed_size,
+				   compress_type, compressed_pages,
+				   update_i_size);
 	if (ret && ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
@@ -412,7 +404,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 		goto out;
 	}
 
-	btrfs_update_inode_bytes(inode, inline_len, drop_args.bytes_found);
+	btrfs_update_inode_bytes(inode, size, drop_args.bytes_found);
 	ret = btrfs_update_inode(trans, root, inode);
 	if (ret && ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, ret);
@@ -688,14 +680,15 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 			/* we didn't compress the entire range, try
 			 * to make an uncompressed inline extent.
 			 */
-			ret = cow_file_range_inline(BTRFS_I(inode), start, end,
+			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    0, BTRFS_COMPRESS_NONE,
-						    NULL);
+						    NULL, false);
 		} else {
 			/* try making a compressed inline extent */
-			ret = cow_file_range_inline(BTRFS_I(inode), start, end,
+			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    total_compressed,
-						    compress_type, pages);
+						    compress_type, pages,
+						    false);
 		}
 		if (ret <= 0) {
 			unsigned long clear_flags = EXTENT_DELALLOC |
@@ -1081,9 +1074,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
 	if (start == 0) {
+		u64 actual_end = min_t(u64, i_size_read(&inode->vfs_inode),
+				       end + 1);
+
 		/* lets try to make an inline extent */
-		ret = cow_file_range_inline(inode, start, end, 0,
-					    BTRFS_COMPRESS_NONE, NULL);
+		ret = cow_file_range_inline(inode, actual_end, 0,
+					    BTRFS_COMPRESS_NONE, NULL, false);
 		if (ret == 0) {
 			/*
 			 * We use DO_ACCOUNTING here because we need the
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 8/9] btrfs: implement RWF_ENCODED reads
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (6 preceding siblings ...)
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 7/9] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 9/9] btrfs: implement RWF_ENCODED writes Omar Sandoval
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

There are 4 main cases:

1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
   from disk.
4. Regular, compressed extents: we read the entire compressed extent
   from disk and indicate what subset of the decompressed extent is in
   the file.

This initial implementation simplifies a few things that can be improved
in the future:

- We hold the inode lock during the operation.
- Cases 1, 3, and 4 allocate temporary memory to read into before
  copying out to userspace.
- We don't do read repair, because it turns out that read repair is
  currently broken for compressed data.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h |   2 +
 fs/btrfs/file.c  |   5 +
 fs/btrfs/inode.c | 503 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 510 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 86856889188d..cd31dbc2e4a4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3201,6 +3201,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 					  struct page *page, u64 start,
 					  u64 end, int uptodate);
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
+
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
 extern const struct iomap_dio_ops btrfs_dio_ops;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4e13c481b21e..89b3dd3fa9e4 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3648,6 +3648,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret = 0;
 
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EOPNOTSUPP;
+		return btrfs_encoded_read(iocb, to);
+	}
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = btrfs_direct_read(iocb, to);
 		if (ret < 0 || !iov_iter_count(to) ||
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 15b06d6ac875..e09af2c8083a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6,6 +6,7 @@
 #include <crypto/hash.h>
 #include <linux/kernel.h>
 #include <linux/bio.h>
+#include <linux/encoded_io.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
@@ -10270,6 +10271,508 @@ void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 	}
 }
 
+static int encoded_iov_compression_from_btrfs(unsigned int compress_type)
+{
+	switch (compress_type) {
+	case BTRFS_COMPRESS_NONE:
+		return ENCODED_IOV_COMPRESSION_NONE;
+	case BTRFS_COMPRESS_ZLIB:
+		return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB;
+	case BTRFS_COMPRESS_LZO:
+		/*
+		 * The LZO format depends on the page size. 64k is the maximum
+		 * sectorsize (and thus page size) that we support.
+		 */
+		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
+			return -EINVAL;
+		return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12);
+	case BTRFS_COMPRESS_ZSTD:
+		return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD;
+	default:
+		return -EUCLEAN;
+	}
+}
+
+static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
+					 struct iov_iter *iter, u64 start,
+					 u64 lockend,
+					 struct extent_state **cached_state,
+					 u64 extent_start, size_t count,
+					 struct encoded_iov *encoded,
+					 bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *item;
+	u64 ram_bytes;
+	unsigned long ptr;
+	void *tmp;
+	ssize_t ret;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
+				       btrfs_ino(BTRFS_I(inode)), extent_start,
+				       0);
+	if (ret) {
+		if (ret > 0) {
+			/* The extent item disappeared? */
+			ret = -EIO;
+		}
+		goto out;
+	}
+	leaf = path->nodes[0];
+	item = btrfs_item_ptr(leaf, path->slots[0],
+			      struct btrfs_file_extent_item);
+
+	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
+	ptr = btrfs_file_extent_inline_start(item);
+
+	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
+			iocb->ki_pos);
+	ret = encoded_iov_compression_from_btrfs(
+				 btrfs_file_extent_compression(leaf, item));
+	if (ret < 0)
+		goto out;
+	encoded->compression = ret;
+	if (encoded->compression) {
+		size_t inline_size;
+
+		inline_size = btrfs_file_extent_inline_item_len(leaf,
+						btrfs_item_nr(path->slots[0]));
+		if (inline_size > count) {
+			ret = -ENOBUFS;
+			goto out;
+		}
+		count = inline_size;
+		encoded->unencoded_len = ram_bytes;
+		encoded->unencoded_offset = iocb->ki_pos - extent_start;
+	} else {
+		encoded->len = encoded->unencoded_len = count =
+			min_t(u64, count, encoded->len);
+		ptr += iocb->ki_pos - extent_start;
+	}
+
+	tmp = kmalloc(count, GFP_NOFS);
+	if (!tmp) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	read_extent_buffer(leaf, tmp, ptr, count);
+	btrfs_release_path(path);
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock_shared(inode);
+	*unlocked = true;
+
+	ret = copy_encoded_iov_to_iter(encoded, iter);
+	if (ret)
+		goto out_free;
+	ret = copy_to_iter(tmp, count, iter);
+	if (ret != count)
+		ret = -EFAULT;
+out_free:
+	kfree(tmp);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+struct btrfs_encoded_read_private {
+	struct inode *inode;
+	wait_queue_head_t wait;
+	atomic_t pending;
+	blk_status_t status;
+	bool skip_csum;
+};
+
+static blk_status_t submit_encoded_read_bio(struct inode *inode,
+					    struct bio *bio, int mirror_num,
+					    unsigned long bio_flags)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	blk_status_t ret;
+
+	if (!priv->skip_csum) {
+		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
+		if (ret)
+			return ret;
+	}
+
+	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
+	if (ret) {
+		btrfs_io_bio_free_csum(io_bio);
+		return ret;
+	}
+
+	atomic_inc(&priv->pending);
+	ret = btrfs_map_bio(fs_info, bio, mirror_num);
+	if (ret) {
+		atomic_dec(&priv->pending);
+		btrfs_io_bio_free_csum(io_bio);
+	}
+	return ret;
+}
+
+static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio)
+{
+	const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK;
+	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
+	struct inode *inode = priv->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u32 sectorsize = fs_info->sectorsize;
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+	u64 start = io_bio->logical;
+	u32 bio_offset = 0;
+
+	if (priv->skip_csum || !uptodate)
+		return io_bio->bio.bi_status;
+
+	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
+		unsigned int i, nr_sectors, pgoff;
+
+		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
+		pgoff = bvec->bv_offset;
+		for (i = 0; i < nr_sectors; i++) {
+			ASSERT(pgoff < PAGE_SIZE);
+			if (check_data_csum(inode, io_bio, bio_offset,
+					    bvec->bv_page, pgoff, start))
+				return BLK_STS_IOERR;
+			start += sectorsize;
+			bio_offset += sectorsize;
+			pgoff += sectorsize;
+		}
+	}
+	return BLK_STS_OK;
+}
+
+static void btrfs_encoded_read_endio(struct bio *bio)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	blk_status_t status;
+
+	status = btrfs_encoded_read_check_bio(io_bio);
+	if (status) {
+		/*
+		 * The memory barrier implied by the atomic_dec_return() here
+		 * pairs with the memory barrier implied by the
+		 * atomic_dec_return() or io_wait_event() in
+		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
+		 * write is observed before the load of status in
+		 * btrfs_encoded_read_regular_fill_pages().
+		 */
+		WRITE_ONCE(priv->status, status);
+	}
+	if (!atomic_dec_return(&priv->pending))
+		wake_up(&priv->wait);
+	btrfs_io_bio_free_csum(io_bio);
+	bio_put(bio);
+}
+
+static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset,
+						 u64 disk_io_size, struct page **pages)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_encoded_read_private priv = {
+		.inode = inode,
+		.pending = ATOMIC_INIT(1),
+		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
+	};
+	unsigned long i = 0;
+	u64 cur = 0;
+	int ret;
+
+	init_waitqueue_head(&priv.wait);
+	/*
+	 * Submit bios for the extent, splitting due to bio or stripe limits as
+	 * necessary.
+	 */
+	while (cur < disk_io_size) {
+		struct extent_map *em;
+		struct btrfs_io_geometry geom;
+		struct bio *bio = NULL;
+		u64 remaining;
+
+		em = btrfs_get_chunk_map(fs_info, offset + cur,
+					 disk_io_size - cur);
+		if (IS_ERR(em)) {
+			ret = PTR_ERR(em);
+		} else {
+			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
+						    offset + cur, &geom);
+		}
+		if (ret) {
+			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
+			break;
+		}
+		remaining = min(geom.len, disk_io_size - cur);
+		while (bio || remaining) {
+			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
+
+			if (!bio) {
+				bio = btrfs_bio_alloc(offset + cur);
+				bio->bi_end_io = btrfs_encoded_read_endio;
+				bio->bi_private = &priv;
+				bio->bi_opf = REQ_OP_READ;
+			}
+
+			if (!bytes ||
+			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
+				blk_status_t status;
+
+				status = submit_encoded_read_bio(inode, bio, 0,
+								 0);
+				if (status) {
+					WRITE_ONCE(priv.status, status);
+					bio_put(bio);
+					goto out;
+				}
+				bio = NULL;
+				continue;
+			}
+
+			i++;
+			cur += bytes;
+			remaining -= bytes;
+		}
+	}
+
+out:
+	if (atomic_dec_return(&priv.pending))
+		io_wait_event(priv.wait, !atomic_read(&priv.pending));
+	/* See btrfs_encoded_read_endio() for ordering. */
+	return blk_status_to_errno(READ_ONCE(priv.status));
+}
+
+static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
+					  struct iov_iter *iter,
+					  u64 start, u64 lockend,
+					  struct extent_state **cached_state,
+					  u64 offset, u64 disk_io_size,
+					  size_t count,
+					  const struct encoded_iov *encoded,
+					  bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct page **pages;
+	unsigned long nr_pages, i;
+	u64 cur;
+	size_t page_offset;
+	ssize_t ret;
+
+	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size,
+						    pages);
+	if (ret)
+		goto out;
+
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock_shared(inode);
+	*unlocked = true;
+
+	ret = copy_encoded_iov_to_iter(encoded, iter);
+	if (ret)
+		goto out;
+	if (encoded->compression) {
+		i = 0;
+		page_offset = 0;
+	} else {
+		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
+		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
+	}
+	cur = 0;
+	while (cur < count) {
+		size_t bytes = min_t(size_t, count - cur,
+				     PAGE_SIZE - page_offset);
+
+		if (copy_page_to_iter(pages[i], page_offset, bytes,
+				      iter) != bytes) {
+			ret = -EFAULT;
+			goto out;
+		}
+		i++;
+		cur += bytes;
+		page_offset = 0;
+	}
+	ret = count;
+out:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kfree(pages);
+	return ret;
+}
+
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	ssize_t ret;
+	size_t count;
+	u64 start, lockend, offset, disk_io_size;
+	struct extent_state *cached_state = NULL;
+	struct extent_map *em;
+	struct encoded_iov encoded = {};
+	bool unlocked = false;
+
+	ret = generic_encoded_read_checks(iocb, iter);
+	if (ret < 0)
+		return ret;
+	if (ret == 0)
+		return copy_encoded_iov_to_iter(&encoded, iter);
+	count = ret;
+
+	file_accessed(iocb->ki_filp);
+
+	inode_lock_shared(inode);
+
+	if (iocb->ki_pos >= inode->i_size) {
+		inode_unlock_shared(inode);
+		return copy_encoded_iov_to_iter(&encoded, iter);
+	}
+	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
+	/*
+	 * We don't know how long the extent containing iocb->ki_pos is, but if
+	 * it's compressed we know that it won't be longer than this.
+	 */
+	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start,
+					       lockend - start + 1);
+		if (ret)
+			goto out_unlock_inode;
+		lock_extent_bits(io_tree, start, lockend, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     lockend - start + 1);
+		if (!ordered)
+			break;
+		btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		cond_resched();
+	}
+
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
+			      lockend - start + 1);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_unlock_extent;
+	}
+
+	if (em->block_start == EXTENT_MAP_INLINE) {
+		u64 extent_start = em->start;
+
+		/*
+		 * For inline extents we get everything we need out of the
+		 * extent item.
+		 */
+		free_extent_map(em);
+		em = NULL;
+		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
+						&cached_state, extent_start,
+						count, &encoded, &unlocked);
+		goto out;
+	}
+
+	/*
+	 * We only want to return up to EOF even if the extent extends beyond
+	 * that.
+	 */
+	encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) -
+		       iocb->ki_pos);
+	if (em->block_start == EXTENT_MAP_HOLE ||
+	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
+		offset = EXTENT_MAP_HOLE;
+		encoded.len = encoded.unencoded_len = count =
+			min_t(u64, count, encoded.len);
+	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+		offset = em->block_start;
+		/*
+		 * Bail if the buffer isn't large enough to return the whole
+		 * compressed extent.
+		 */
+		if (em->block_len > count) {
+			ret = -ENOBUFS;
+			goto out_em;
+		}
+		disk_io_size = count = em->block_len;
+		encoded.unencoded_len = em->ram_bytes;
+		encoded.unencoded_offset = iocb->ki_pos - em->orig_start;
+		ret = encoded_iov_compression_from_btrfs(em->compress_type);
+		if (ret < 0)
+			goto out_em;
+		encoded.compression = ret;
+	} else {
+		offset = em->block_start + (start - em->start);
+		if (encoded.len > count)
+			encoded.len = count;
+		/*
+		 * Don't read beyond what we locked. This also limits the page
+		 * allocations that we'll do.
+		 */
+		disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start;
+		encoded.len = encoded.unencoded_len = count =
+			start + disk_io_size - iocb->ki_pos;
+		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
+	}
+	free_extent_map(em);
+	em = NULL;
+
+	if (offset == EXTENT_MAP_HOLE) {
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		inode_unlock_shared(inode);
+		unlocked = true;
+		ret = copy_encoded_iov_to_iter(&encoded, iter);
+		if (ret)
+			goto out;
+		ret = iov_iter_zero(count, iter);
+		if (ret != count)
+			ret = -EFAULT;
+	} else {
+		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
+						 &cached_state, offset,
+						 disk_io_size, count, &encoded,
+						 &unlocked);
+	}
+
+out:
+	if (ret >= 0)
+		iocb->ki_pos += encoded.len;
+out_em:
+	free_extent_map(em);
+out_unlock_extent:
+	if (!unlocked)
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+out_unlock_inode:
+	if (!unlocked)
+		inode_unlock_shared(inode);
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RESEND x3 v9 9/9] btrfs: implement RWF_ENCODED writes
  2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (7 preceding siblings ...)
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 8/9] btrfs: implement RWF_ENCODED reads Omar Sandoval
@ 2021-06-17 23:51 ` Omar Sandoval
  8 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-17 23:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs, Al Viro
  Cc: Linus Torvalds, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.

Now that read and write are implemented, this also sets the
FMODE_ENCODED_IO flag in btrfs_file_open().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c  |   7 +-
 fs/btrfs/compression.h  |   6 +-
 fs/btrfs/ctree.h        |   2 +
 fs/btrfs/file.c         |  38 +++++-
 fs/btrfs/inode.c        | 256 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ordered-data.c |  12 +-
 fs/btrfs/ordered-data.h |   5 +-
 7 files changed, 313 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 59733573ed08..e8146012d84b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -354,7 +354,8 @@ static void end_compressed_bio_write(struct bio *bio)
 			cb->start, cb->start + cb->len - 1,
 			bio->bi_status == BLK_STS_OK);
 
-	end_compressed_writeback(inode, cb);
+	if (cb->writeback)
+		end_compressed_writeback(inode, cb);
 	/* note, our inode could be gone now */
 
 	/*
@@ -390,7 +391,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 				 struct page **compressed_pages,
 				 unsigned int nr_pages,
 				 unsigned int write_flags,
-				 struct cgroup_subsys_state *blkcg_css)
+				 struct cgroup_subsys_state *blkcg_css,
+				 bool writeback)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct bio *bio = NULL;
@@ -416,6 +418,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	cb->mirror_num = 0;
 	cb->compressed_pages = compressed_pages;
 	cb->compressed_len = compressed_len;
+	cb->writeback = writeback;
 	cb->orig_bio = NULL;
 	cb->nr_pages = nr_pages;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index c359f20920d0..86b70e320813 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -52,6 +52,9 @@ struct compressed_bio {
 	/* The compression algorithm for this bio */
 	u8 compress_type;
 
+	/* Whether this is a write for writeback. */
+	bool writeback;
+
 	/* IO errors */
 	u8 errors;
 	int mirror_num;
@@ -96,7 +99,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 				  struct page **compressed_pages,
 				  unsigned int nr_pages,
 				  unsigned int write_flags,
-				  struct cgroup_subsys_state *blkcg_css);
+				  struct cgroup_subsys_state *blkcg_css,
+				  bool writeback);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index cd31dbc2e4a4..2a435c9adaec 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3202,6 +3202,8 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 					  struct page *page, u64 start,
 					  u64 end, int uptodate);
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			       struct encoded_iov *encoded);
 
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 89b3dd3fa9e4..deaa14a08711 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2007 Oracle.  All rights reserved.
  */
 
+#include <linux/encoded_io.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
 #include <linux/time.h>
@@ -1990,6 +1991,32 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	return written ? written : err;
 }
 
+static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	struct encoded_iov encoded;
+	ssize_t ret;
+
+	ret = copy_encoded_iov_from_iter(&encoded, from);
+	if (ret)
+		return ret;
+
+	btrfs_inode_lock(inode, 0);
+	ret = generic_encoded_write_checks(iocb, &encoded);
+	if (ret || encoded.len == 0)
+		goto out;
+
+	ret = btrfs_write_check(iocb, from, encoded.len);
+	if (ret < 0)
+		goto out;
+
+	ret = btrfs_do_encoded_write(iocb, from, &encoded);
+out:
+	btrfs_inode_unlock(inode, 0);
+	return ret;
+}
+
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 				    struct iov_iter *from)
 {
@@ -2006,14 +2033,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (test_bit(BTRFS_FS_STATE_ERROR, &inode->root->fs_info->fs_state))
 		return -EROFS;
 
-	if (!(iocb->ki_flags & IOCB_DIRECT) &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+	    (!(iocb->ki_flags & IOCB_DIRECT) ||
+	     (iocb->ki_flags & IOCB_ENCODED)))
 		return -EOPNOTSUPP;
 
 	if (sync)
 		atomic_inc(&inode->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT)
+	if (iocb->ki_flags & IOCB_ENCODED)
+		num_written = btrfs_encoded_write(iocb, from);
+	else if (iocb->ki_flags & IOCB_DIRECT)
 		num_written = btrfs_direct_write(iocb, from);
 	else
 		num_written = btrfs_buffered_write(iocb, from);
@@ -3606,7 +3636,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_ENCODED_IO;
 	return generic_file_open(inode, filp);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e09af2c8083a..81e5edee4e99 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -960,7 +960,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				    ins.offset, async_extent->pages,
 				    async_extent->nr_pages,
 				    async_chunk->write_flags,
-				    async_chunk->blkcg_css)) {
+				    async_chunk->blkcg_css, true)) {
 			struct page *p = async_extent->pages[0];
 			const u64 start = async_extent->start;
 			const u64 end = start + async_extent->ram_size - 1;
@@ -2861,6 +2861,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 	 * except if the ordered extent was truncated.
 	 */
 	update_inode_bytes = test_bit(BTRFS_ORDERED_DIRECT, &oe->flags) ||
+			     test_bit(BTRFS_ORDERED_ENCODED, &oe->flags) ||
 			     test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags);
 
 	return insert_reserved_file_extent(trans, BTRFS_I(oe->inode),
@@ -2895,7 +2896,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
 	    !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) &&
-	    !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags))
+	    !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags) &&
+	    !test_bit(BTRFS_ORDERED_ENCODED, &ordered_extent->flags))
 		clear_bits |= EXTENT_DELALLOC_NEW;
 
 	freespace_inode = btrfs_is_free_space_inode(inode);
@@ -10773,6 +10775,256 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
+ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			       struct encoded_iov *encoded)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_changeset *data_reserved = NULL;
+	struct extent_state *cached_state = NULL;
+	int compression;
+	size_t orig_count;
+	u64 start, end;
+	u64 num_bytes, ram_bytes, disk_num_bytes;
+	unsigned long nr_pages, i;
+	struct page **pages;
+	struct btrfs_key ins;
+	bool extent_reserved = false;
+	struct extent_map *em;
+	ssize_t ret;
+
+	switch (encoded->compression) {
+	case ENCODED_IOV_COMPRESSION_BTRFS_ZLIB:
+		compression = BTRFS_COMPRESS_ZLIB;
+		break;
+	case ENCODED_IOV_COMPRESSION_BTRFS_ZSTD:
+		compression = BTRFS_COMPRESS_ZSTD;
+		break;
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K:
+	case ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K:
+		/* The page size must match for LZO. */
+		if (encoded->compression -
+		    ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + 12 != PAGE_SHIFT)
+			return -EINVAL;
+		compression = BTRFS_COMPRESS_LZO;
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (encoded->encryption != ENCODED_IOV_ENCRYPTION_NONE)
+		return -EINVAL;
+
+	orig_count = iov_iter_count(from);
+
+	/* The extent size must be sane. */
+	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
+	    orig_count > BTRFS_MAX_COMPRESSED || orig_count == 0)
+		return -EINVAL;
+
+	/*
+	 * The compressed data must be smaller than the decompressed data.
+	 *
+	 * It's of course possible for data to compress to larger or the same
+	 * size, but the buffered I/O path falls back to no compression for such
+	 * data, and we don't want to break any assumptions by creating these
+	 * extents.
+	 *
+	 * Note that this is less strict than the current check we have that the
+	 * compressed data must be at least one sector smaller than the
+	 * decompressed data. We only want to enforce the weaker requirement
+	 * from old kernels that it is at least one byte smaller.
+	 */
+	if (orig_count >= encoded->unencoded_len)
+		return -EINVAL;
+
+	/* The extent must start on a sector boundary. */
+	start = iocb->ki_pos;
+	if (!IS_ALIGNED(start, fs_info->sectorsize))
+		return -EINVAL;
+
+	/*
+	 * The extent must end on a sector boundary. However, we allow a write
+	 * which ends at or extends i_size to have an unaligned length; we round
+	 * up the extent size and set i_size to the unaligned end.
+	 */
+	if (start + encoded->len < inode->i_size &&
+	    !IS_ALIGNED(start + encoded->len, fs_info->sectorsize))
+		return -EINVAL;
+
+	/* Finally, the offset in the unencoded data must be sector-aligned. */
+	if (!IS_ALIGNED(encoded->unencoded_offset, fs_info->sectorsize))
+		return -EINVAL;
+
+	num_bytes = ALIGN(encoded->len, fs_info->sectorsize);
+	ram_bytes = ALIGN(encoded->unencoded_len, fs_info->sectorsize);
+	end = start + num_bytes - 1;
+
+	/*
+	 * If the extent cannot be inline, the compressed data on disk must be
+	 * sector-aligned. For convenience, we extend it with zeroes if it
+	 * isn't.
+	 */
+	disk_num_bytes = ALIGN(orig_count, fs_info->sectorsize);
+	nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
+	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
+		char *kaddr;
+
+		pages[i] = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out_pages;
+		}
+		kaddr = kmap(pages[i]);
+		if (copy_from_iter(kaddr, bytes, from) != bytes) {
+			kunmap(pages[i]);
+			ret = -EFAULT;
+			goto out_pages;
+		}
+		if (bytes < PAGE_SIZE)
+			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
+		kunmap(pages[i]);
+	}
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start, num_bytes);
+		if (ret)
+			goto out_pages;
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+						    start >> PAGE_SHIFT,
+						    end >> PAGE_SHIFT);
+		if (ret)
+			goto out_pages;
+		lock_extent_bits(io_tree, start, end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     num_bytes);
+		if (!ordered &&
+		    !filemap_range_has_page(inode->i_mapping, start, end))
+			break;
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, end, &cached_state);
+		cond_resched();
+	}
+
+	/*
+	 * We don't use the higher-level delalloc space functions because our
+	 * num_bytes and disk_num_bytes are different.
+	 */
+	ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), disk_num_bytes);
+	if (ret)
+		goto out_unlock;
+	ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, start,
+					num_bytes);
+	if (ret)
+		goto out_free_data_space;
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), num_bytes,
+					      disk_num_bytes);
+	if (ret)
+		goto out_qgroup_free_data;
+
+	/* Try an inline extent first. */
+	if (start == 0 && encoded->unencoded_len == encoded->len &&
+	    encoded->unencoded_offset == 0) {
+		ret = cow_file_range_inline(BTRFS_I(inode), encoded->len,
+					    orig_count, compression, pages,
+					    true);
+		if (ret <= 0) {
+			if (ret == 0)
+				ret = orig_count;
+			goto out_delalloc_release;
+		}
+	}
+
+	ret = btrfs_reserve_extent(root, disk_num_bytes, disk_num_bytes,
+				   disk_num_bytes, 0, 0, &ins, 1, 1);
+	if (ret)
+		goto out_delalloc_release;
+	extent_reserved = true;
+
+	em = create_io_em(BTRFS_I(inode), start, num_bytes,
+			  start - encoded->unencoded_offset, ins.objectid,
+			  ins.offset, ins.offset, ram_bytes, compression,
+			  BTRFS_ORDERED_COMPRESSED);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_free_reserved;
+	}
+	free_extent_map(em);
+
+	ret = btrfs_add_ordered_extent(BTRFS_I(inode), start, num_bytes,
+				       ram_bytes, ins.objectid, ins.offset,
+				       encoded->unencoded_offset,
+				       (1 << BTRFS_ORDERED_ENCODED) |
+				       (1 << BTRFS_ORDERED_COMPRESSED),
+				       compression);
+	if (ret) {
+		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
+		goto out_free_reserved;
+	}
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+
+	if (start + encoded->len > inode->i_size)
+		i_size_write(inode, start + encoded->len);
+
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes);
+
+	if (btrfs_submit_compressed_write(BTRFS_I(inode), start, num_bytes,
+					  ins.objectid, ins.offset, pages,
+					  nr_pages, 0, NULL, false)) {
+		btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), pages[0],
+						     start, end, 0);
+		ret = -EIO;
+		goto out_pages;
+	}
+	ret = orig_count;
+	goto out;
+
+out_free_reserved:
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+out_delalloc_release:
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes);
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), disk_num_bytes,
+					ret < 0);
+out_qgroup_free_data:
+	if (ret < 0) {
+		btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, start,
+				       num_bytes);
+	}
+out_free_data_space:
+	/*
+	 * If btrfs_reserve_extent() succeeded, then we already decremented
+	 * bytes_may_use.
+	 */
+	if (!extent_reserved)
+		btrfs_free_reserved_data_space_noquota(fs_info, disk_num_bytes);
+out_unlock:
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+out_pages:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kvfree(pages);
+out:
+	if (ret >= 0)
+		iocb->ki_pos += encoded->len;
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 96357d2c845e..07be2775b895 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -526,9 +526,15 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 	spin_lock(&btrfs_inode->lock);
 	btrfs_mod_outstanding_extents(btrfs_inode, -1);
 	spin_unlock(&btrfs_inode->lock);
-	if (root != fs_info->tree_root)
-		btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes,
-						false);
+	if (root != fs_info->tree_root) {
+		u64 release;
+
+		if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags))
+			release = entry->disk_num_bytes;
+		else
+			release = entry->num_bytes;
+		btrfs_delalloc_release_metadata(btrfs_inode, release, false);
+	}
 
 	percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
 				 fs_info->delalloc_batch);
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 81648a78f933..888b03689a9e 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -74,6 +74,8 @@ enum {
 	BTRFS_ORDERED_LOGGED_CSUM,
 	/* We wait for this extent to complete in the current transaction */
 	BTRFS_ORDERED_PENDING,
+	/* RWF_ENCODED I/O */
+	BTRFS_ORDERED_ENCODED,
 };
 
 /* BTRFS_ORDERED_* flags that specify the type of the extent. */
@@ -81,7 +83,8 @@ enum {
 				  (1UL << BTRFS_ORDERED_NOCOW) |	\
 				  (1UL << BTRFS_ORDERED_PREALLOC) |	\
 				  (1UL << BTRFS_ORDERED_COMPRESSED) |	\
-				  (1UL << BTRFS_ORDERED_DIRECT))
+				  (1UL << BTRFS_ORDERED_DIRECT) |	\
+				  (1UL << BTRFS_ORDERED_ENCODED))
 
 struct btrfs_ordered_extent {
 	/* logical offset in the file */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-17 23:51 ` [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter() Omar Sandoval
@ 2021-06-18 18:50   ` Linus Torvalds
  2021-06-18 19:42     ` Al Viro
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2021-06-18 18:50 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-fsdevel, linux-btrfs, Al Viro, Linux API, Kernel Team

On Thu, Jun 17, 2021 at 4:51 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> This is essentially copy_struct_from_user() but for an iov_iter.

So I continue to think that this series looks fine - if we want this
interface at all.

I do note a few issues with this iov patch, though - partly probably
because I have been reading Al's cleanup patches that had some
optimizations in place.

And in particular, I now react to this:

> +       iov_iter_advance(i, usize);

at the end of copy_struct_from_iter().

It's very wasteful to use the generic iov_iter_advance() function,
when you just had special functions for each of the iterator cases.

Because that generic function will now just end up re-testing that
whole "what kind was it" and then do each kind separately.

So it would actually be a lot simpler and m,ore efficient to just do
that "advance" part as you go through the cases, iow just do

        iov_iter_iovec_advance(i, usize);

at the end of the iter_is_iovec/iter_is_kvec cases, and

        iov_iter_bvec_advance(i, usize)

for the bvec case.

I think that you may need it to be based on Al's series for that to
work, which might be inconvenient, though.

One other non-code issue: particularly since you only handle a subset
of the iov_iter cases, it would be nice to have an explanation for
_why_ those particular cases.

IOW, have some trivial explanation for each of the cases. "iovec" is
for regular read/write, what triggers the kvec and bvec cases?

But also, the other way around. Why doesn't the pipe case trigger? No
splice support?

              Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 18:50   ` Linus Torvalds
@ 2021-06-18 19:42     ` Al Viro
  2021-06-18 19:49       ` Al Viro
  2021-06-18 20:32       ` Omar Sandoval
  0 siblings, 2 replies; 50+ messages in thread
From: Al Viro @ 2021-06-18 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 11:50:25AM -0700, Linus Torvalds wrote:

> I think that you may need it to be based on Al's series for that to
> work, which might be inconvenient, though.
> 
> One other non-code issue: particularly since you only handle a subset
> of the iov_iter cases, it would be nice to have an explanation for
> _why_ those particular cases.
> 
> IOW, have some trivial explanation for each of the cases. "iovec" is
> for regular read/write, what triggers the kvec and bvec cases?
> 
> But also, the other way around. Why doesn't the pipe case trigger? No
> splice support?

Pipe ones are strictly destinations - they can't be sources.  So if you
see it called for one of those, you've a bug.

Xarray ones are *not* - they can be sources, and that's missing here.

Much more unpleasant, though, is that this thing has hard dependency on
nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
which is completely wrong.  That sucker has some weird users left (as
of #work.iov_iter), but all of them are actually due to API deficiencies
and I very much hope to kill that thing off.

Why not simply add iov_iter_check_zeroes(), that would be called after
copy_from_iter() and verified that all that's left in the iterator
consists of zeroes?  Then this copy_struct_from_...() would be
trivial to express through those two.  And check_zeroes would also
be trivial, especially on top of #work.iov_iter.  With no calls of
iov_iter_advance() at all, while we are at it...

IDGI... Omar, what semantics do you really want from that primitive?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 19:42     ` Al Viro
@ 2021-06-18 19:49       ` Al Viro
  2021-06-18 20:33         ` Omar Sandoval
  2021-06-18 20:32       ` Omar Sandoval
  1 sibling, 1 reply; 50+ messages in thread
From: Al Viro @ 2021-06-18 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 07:42:41PM +0000, Al Viro wrote:

> Pipe ones are strictly destinations - they can't be sources.  So if you
> see it called for one of those, you've a bug.
> 
> Xarray ones are *not* - they can be sources, and that's missing here.
> 
> Much more unpleasant, though, is that this thing has hard dependency on
> nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
> which is completely wrong.  That sucker has some weird users left (as
> of #work.iov_iter), but all of them are actually due to API deficiencies
> and I very much hope to kill that thing off.
> 
> Why not simply add iov_iter_check_zeroes(), that would be called after
> copy_from_iter() and verified that all that's left in the iterator
> consists of zeroes?  Then this copy_struct_from_...() would be
> trivial to express through those two.  And check_zeroes would also
> be trivial, especially on top of #work.iov_iter.  With no calls of
> iov_iter_advance() at all, while we are at it...
> 
> IDGI... Omar, what semantics do you really want from that primitive?

And for pity sake, let's not do that EXPORT_SYMBOL_GPL() posturing there.
If it's a sane general-purpose API, it doesn't matter who uses it;
if it's not, it shouldn't be exported in the first place.

It can be implemented via the already exported primitives, so it's
not as if we prevented anyone from doing an equivalent...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 19:42     ` Al Viro
  2021-06-18 19:49       ` Al Viro
@ 2021-06-18 20:32       ` Omar Sandoval
  2021-06-18 20:58         ` Al Viro
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-18 20:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 07:42:41PM +0000, Al Viro wrote:
> On Fri, Jun 18, 2021 at 11:50:25AM -0700, Linus Torvalds wrote:
> 
> > I think that you may need it to be based on Al's series for that to
> > work, which might be inconvenient, though.
> > 
> > One other non-code issue: particularly since you only handle a subset
> > of the iov_iter cases, it would be nice to have an explanation for
> > _why_ those particular cases.
> > 
> > IOW, have some trivial explanation for each of the cases. "iovec" is
> > for regular read/write, what triggers the kvec and bvec cases?
> > 
> > But also, the other way around. Why doesn't the pipe case trigger? No
> > splice support?
> 
> Pipe ones are strictly destinations - they can't be sources.  So if you
> see it called for one of those, you've a bug.
> 
> Xarray ones are *not* - they can be sources, and that's missing here.

Ah, ITER_XARRAY was added recently so I missed it.

> Much more unpleasant, though, is that this thing has hard dependency on
> nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
> which is completely wrong.  That sucker has some weird users left (as
> of #work.iov_iter), but all of them are actually due to API deficiencies
> and I very much hope to kill that thing off.
> 
> Why not simply add iov_iter_check_zeroes(), that would be called after
> copy_from_iter() and verified that all that's left in the iterator
> consists of zeroes?  Then this copy_struct_from_...() would be
> trivial to express through those two.  And check_zeroes would also
> be trivial, especially on top of #work.iov_iter.  With no calls of
> iov_iter_advance() at all, while we are at it...
> 
> IDGI... Omar, what semantics do you really want from that primitive?

RWF_ENCODED is intended to be used like this:

	struct encoded_iov encoded_iov = {
		/* compression metadata */ ...
	};
	char compressed_data[] = ...;
	struct iovec iov[] = {
		{ &encoded_iov, sizeof(encoded_iov) },
		{ compressed_data, sizeof(compressed_data) },
	};
	pwritev2(fd, iov, 2, -1, RWF_ENCODED);

Basically, we squirrel away the compression metadata in the first
element of the iovec array, and we use iov[0].iov_len so that we can
support future extensions of struct encoded_iov in the style of
copy_struct_from_user().

So this doesn't require nr_seg == 1. On the contrary, it's expected that
the rest of the iovec has the compressed payload. And to support the
copy_struct_from_user()-style versioning, we need to know the size of
the struct encoded_iov that userspace gave us, which is the reason for
the iov_iter_single_seg_count().

I know this interface isn't the prettiest. It started as a
Btrfs-specific ioctl, but this approach was suggested as a way to avoid
having a whole new I/O path:
https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
The copy_struct_from_iter() thing was proposed as a way to allow future
extensions here:
https://lore.kernel.org/linux-btrfs/20191022020215.csdwgi3ky27rfidf@yavin.dot.cyphar.com/

Please let me know if you have any suggestions for how to improve this.

Thanks,
Omar

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 19:49       ` Al Viro
@ 2021-06-18 20:33         ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-18 20:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 07:49:53PM +0000, Al Viro wrote:
> On Fri, Jun 18, 2021 at 07:42:41PM +0000, Al Viro wrote:
> 
> > Pipe ones are strictly destinations - they can't be sources.  So if you
> > see it called for one of those, you've a bug.
> > 
> > Xarray ones are *not* - they can be sources, and that's missing here.
> > 
> > Much more unpleasant, though, is that this thing has hard dependency on
> > nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
> > which is completely wrong.  That sucker has some weird users left (as
> > of #work.iov_iter), but all of them are actually due to API deficiencies
> > and I very much hope to kill that thing off.
> > 
> > Why not simply add iov_iter_check_zeroes(), that would be called after
> > copy_from_iter() and verified that all that's left in the iterator
> > consists of zeroes?  Then this copy_struct_from_...() would be
> > trivial to express through those two.  And check_zeroes would also
> > be trivial, especially on top of #work.iov_iter.  With no calls of
> > iov_iter_advance() at all, while we are at it...
> > 
> > IDGI... Omar, what semantics do you really want from that primitive?
> 
> And for pity sake, let's not do that EXPORT_SYMBOL_GPL() posturing there.
> If it's a sane general-purpose API, it doesn't matter who uses it;
> if it's not, it shouldn't be exported in the first place.
> 
> It can be implemented via the already exported primitives, so it's
> not as if we prevented anyone from doing an equivalent...

Fair enough, I'll fix that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 20:32       ` Omar Sandoval
@ 2021-06-18 20:58         ` Al Viro
  2021-06-18 21:10           ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Al Viro @ 2021-06-18 20:58 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Linus Torvalds, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 01:32:26PM -0700, Omar Sandoval wrote:

> RWF_ENCODED is intended to be used like this:
> 
> 	struct encoded_iov encoded_iov = {
> 		/* compression metadata */ ...
> 	};
> 	char compressed_data[] = ...;
> 	struct iovec iov[] = {
> 		{ &encoded_iov, sizeof(encoded_iov) },
> 		{ compressed_data, sizeof(compressed_data) },
> 	};
> 	pwritev2(fd, iov, 2, -1, RWF_ENCODED);
> 
> Basically, we squirrel away the compression metadata in the first
> element of the iovec array, and we use iov[0].iov_len so that we can
> support future extensions of struct encoded_iov in the style of
> copy_struct_from_user().

Yecchhh...

> So this doesn't require nr_seg == 1. On the contrary, it's expected that
> the rest of the iovec has the compressed payload. And to support the
> copy_struct_from_user()-style versioning, we need to know the size of
> the struct encoded_iov that userspace gave us, which is the reason for
> the iov_iter_single_seg_count().
> 
> I know this interface isn't the prettiest. It started as a
> Btrfs-specific ioctl, but this approach was suggested as a way to avoid
> having a whole new I/O path:
> https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> The copy_struct_from_iter() thing was proposed as a way to allow future
> extensions here:
> https://lore.kernel.org/linux-btrfs/20191022020215.csdwgi3ky27rfidf@yavin.dot.cyphar.com/
> 
> Please let me know if you have any suggestions for how to improve this.

Just put the size of the encoded part first and be done with that.
Magical effect of the iovec sizes is a bloody bad idea.

And on top of #work.iov_iter something like

bool iov_iter_check_zeroes(struct iov_iter *i, size_t bytes)
{
	bool failed = false;
        iterate_and_advance(i, bytes, base, len, off,
			failed = (check_zeroed_user(base, len) != 1),
			failed = (memchr_inv(base, 0, len) != NULL),
			)
	if (unlikely(failed))
		iov_iter_revert(i, bytes);
	return !failed;
}

would do "is that chunk all-zeroes?" just fine.  It's that simple...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 20:58         ` Al Viro
@ 2021-06-18 21:10           ` Linus Torvalds
  2021-06-18 21:32             ` Al Viro
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2021-06-18 21:10 UTC (permalink / raw)
  To: Al Viro; +Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 1:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Jun 18, 2021 at 01:32:26PM -0700, Omar Sandoval wrote:
>
> > RWF_ENCODED is intended to be used like this:
> >
> >       struct encoded_iov encoded_iov = {
> >               /* compression metadata */ ...
> >       };
> >       char compressed_data[] = ...;
> >       struct iovec iov[] = {
> >               { &encoded_iov, sizeof(encoded_iov) },
> >               { compressed_data, sizeof(compressed_data) },
> >       };
> >       pwritev2(fd, iov, 2, -1, RWF_ENCODED);
> >
> > Basically, we squirrel away the compression metadata in the first
> > element of the iovec array, and we use iov[0].iov_len so that we can
> > support future extensions of struct encoded_iov in the style of
> > copy_struct_from_user().
>
> Yecchhh...

Al, this has been true since the beginning, and was the whole point of the set.

> Just put the size of the encoded part first and be done with that.
> Magical effect of the iovec sizes is a bloody bad idea.

That makes everything uglier and more complicated, honestly. Then
you'd have to do it in _two_ operations ("get the size, then get the
rest"), *AND* you'd have to worry about all the corner-cases (ie
people putting the structure in pieces across multiple iov entries.

So it would be slower, more complex, and much more likely to have bugs.

So no. Not acceptable. The "in the first iov" is simple, efficient,
and avoids all the problems.

The size *is* encoded already - in the iov itself. Encoding it
anywhere else is much worse.

The only issue I have is that the issue itself is kind of ugly -
regardless of any iov issues. And the "encryption" side of it doesn't
actually seem to be relevant or solvable using this model anyway, so
that side is questionable.

The alternative would be to have an ioctl rather than make this be
about the IO operations (and then that encoded data would be
explicitly separate).

Which I suggested originally, but apparently people who want to use
this had some real reasons not to.

But encoding the structure without having the rule of "first iov only"
is entirely unacceptable to me. See above. It's objectively much much
worse.

             Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 21:10           ` Linus Torvalds
@ 2021-06-18 21:32             ` Al Viro
  2021-06-18 21:40               ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Al Viro @ 2021-06-18 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 02:10:36PM -0700, Linus Torvalds wrote:

> > Just put the size of the encoded part first and be done with that.
> > Magical effect of the iovec sizes is a bloody bad idea.
> 
> That makes everything uglier and more complicated, honestly. Then
> you'd have to do it in _two_ operations ("get the size, then get the
> rest"), *AND* you'd have to worry about all the corner-cases (ie
> people putting the structure in pieces across multiple iov entries.

Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
What I'm proposing is to have the size as a field in 'encoded' and
do this
	if (!copy_from_iter_full(&encoded, sizeof(encoded), &i))
		return -EFAULT;
	if (encoded.size > sizeof(encoded)) {
		// newer than what we expect
		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
			return -EINVAL;
	} else if (encoded.size < sizeof(encoded)) {
		// older than what we expect
		iov_iter_revert(&i, sizeof(encoded) - encoded.size);
		memset((void *)&encoded + encoded.size, 0, sizoef(encoded) - encoded.size);
	}

I don't think it would be more complex, but that's a matter of taste;
I *really* doubt it would be any slower or have higher odds of bugs,
regardless of the corner cases.

And it certainly would be much smaller on the lib/iov_iter.c side -
implementation of iov_iter_check_zeroes() would be simply this:

bool iov_iter_check_zeroes(struct iov_iter *i, size_t size)
{
	bool failed = false;

	iterate_and_advance(i, bytes, base, len, off,
		failed = (check_zeroed_user(base, len) != 1),
		failed = (memchr_inv(base, 0, len) != NULL))
	if (unlikely(failed))
		iov_iter_revert(i, bytes);
	return !failed;
}

And that's it, no need to do anything special for xarray, etc.
This + EXPORT_SYMBOL + extern in uio.h + snippet above in the
user...

I could buy an argument that for userland the need to add
	encoded.size = sizeof(encoded);
or equivalent when initializing that thing would make life too complex,
but on the kernel side I'd say that Omar's variant is considerably more
complex than the above...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 21:32             ` Al Viro
@ 2021-06-18 21:40               ` Linus Torvalds
  2021-06-18 22:10                 ` Omar Sandoval
  2021-06-18 22:14                 ` Al Viro
  0 siblings, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2021-06-18 21:40 UTC (permalink / raw)
  To: Al Viro; +Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 2:32 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
> What I'm proposing is to have the size as a field in 'encoded' and
> do this

Hmm. Making it part of the structure does make it easier (also for the
sending userspace side, that doesn't now have to create yet another
iov or copy the structure or whatever).

Except your code doesn't actually handle the "smaller than expected"
case correctly, since by the time it even checks for that, it will
possibly already have failed. So you actually had a bug there - you
can't use the "xyz_full()" version and get it right.

That's fixable.

So I guess I'd be ok with that version.

             Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 21:40               ` Linus Torvalds
@ 2021-06-18 22:10                 ` Omar Sandoval
  2021-06-18 22:32                   ` Al Viro
  2021-06-18 22:14                 ` Al Viro
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-18 22:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 02:40:51PM -0700, Linus Torvalds wrote:
> On Fri, Jun 18, 2021 at 2:32 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
> > What I'm proposing is to have the size as a field in 'encoded' and
> > do this
> 
> Hmm. Making it part of the structure does make it easier (also for the
> sending userspace side, that doesn't now have to create yet another
> iov or copy the structure or whatever).
> 
> Except your code doesn't actually handle the "smaller than expected"
> case correctly, since by the time it even checks for that, it will
> possibly already have failed. So you actually had a bug there - you
> can't use the "xyz_full()" version and get it right.
> 
> That's fixable.

Right, we either need to read the size first and then the rest:

	size_t copy_size;
        if (!copy_from_iter_full(&encoded.size, sizeof(encoded.size),
				 &i))
                return -EFAULT;
	if (encoded.size > PAGE_SIZE)
		return -E2BIG;
	if (encoded.size < ENCODED_IOV_SIZE_VER0)
		return -EINVAL;
	if (!copy_from_iter_full(&encoded.size + 1,
				 min(sizeof(encoded), encoded.size) - sizeof(encoded.size),
				 &i))
                return -EFAULT;
        if (encoded.size > sizeof(encoded)) {
                // newer than what we expect
                if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
                        return -EINVAL;
        } else if (encoded.size < sizeof(encoded)) {
                // older than what we expect
                memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
        }

Or do the same reverting thing that Al did, but with copy_from_iter()
instead of copy_from_iter_full() and being careful with the copied count
(which I'm not 100% sure I got correct here):

	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
	if (copied < offsetofend(struct encoded_iov, size))
		return -EFAULT;
	if (encoded.size > PAGE_SIZE)
		return -E2BIG;
	if (encoded.size < ENCODED_IOV_SIZE_VER0)
		return -EINVAL;
	if (encoded.size > sizeof(encoded)) {
		if (copied < sizeof(encoded)
			return -EFAULT;
		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
			return -EINVAL;
	} else if (encoded.size < sizeof(encoded)) {
		// older than what we expect
		if (copied < encoded.size)
			return -EFAULT;
		iov_iter_revert(&i, copied - encoded.size);
		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
	}    

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 21:40               ` Linus Torvalds
  2021-06-18 22:10                 ` Omar Sandoval
@ 2021-06-18 22:14                 ` Al Viro
  1 sibling, 0 replies; 50+ messages in thread
From: Al Viro @ 2021-06-18 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 02:40:51PM -0700, Linus Torvalds wrote:
> On Fri, Jun 18, 2021 at 2:32 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
> > What I'm proposing is to have the size as a field in 'encoded' and
> > do this
> 
> Hmm. Making it part of the structure does make it easier (also for the
> sending userspace side, that doesn't now have to create yet another
> iov or copy the structure or whatever).
> 
> Except your code doesn't actually handle the "smaller than expected"
> case correctly, since by the time it even checks for that, it will
> possibly already have failed. So you actually had a bug there - you
> can't use the "xyz_full()" version and get it right.

Right you are - should be something along the lines of

#define MIN_ENCODED_SIZE minimal size, e.g. offsetof of the next field after .size

	size = copy_from_iter(&encoded, sizeof(encoded), &i);
	if (unlikely(size < sizeof(encoded))) {
		// the total length is less than expected
		// must be at least encoded.size, though, and it would better
		// cover the .size field itself.
	    	if (size < MIN_ENCODED_SIZE || size < encoded.size)
			sod off
	}
	if (sizeof(encoded) < encoded.size) {
		// newer than expected
		same as in previous variant
	} else if (size > encoded.size) {
		// older than expected
		iov_iter_revert(size - encoded.size);
		memset(....) as in previous variant
	}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 22:10                 ` Omar Sandoval
@ 2021-06-18 22:32                   ` Al Viro
  2021-06-19  0:43                     ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Al Viro @ 2021-06-18 22:32 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Linus Torvalds, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 03:10:03PM -0700, Omar Sandoval wrote:

> Or do the same reverting thing that Al did, but with copy_from_iter()
> instead of copy_from_iter_full() and being careful with the copied count
> (which I'm not 100% sure I got correct here):
> 
> 	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> 	if (copied < offsetofend(struct encoded_iov, size))
> 		return -EFAULT;
> 	if (encoded.size > PAGE_SIZE)
> 		return -E2BIG;
> 	if (encoded.size < ENCODED_IOV_SIZE_VER0)
> 		return -EINVAL;
> 	if (encoded.size > sizeof(encoded)) {
> 		if (copied < sizeof(encoded)
> 			return -EFAULT;
> 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> 			return -EINVAL;
> 	} else if (encoded.size < sizeof(encoded)) {
> 		// older than what we expect
> 		if (copied < encoded.size)
> 			return -EFAULT;
> 		iov_iter_revert(&i, copied - encoded.size);
> 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> 	}    

simpler than that, actually -

	copied = copy_from_iter(&encoded, sizeof(encoded), &i);
	if (unlikely(copied < sizeof(encoded))) {
		if (copied < offsetofend(struct encoded_iov, size) ||
		    copied < encoded.size)
			return iov_iter_count(i) ? -EFAULT : -EINVAL;
	}
	if (encoded.size > sizeof(encoded)) {
		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
			return -EINVAL;
	} else if (encoded.size < sizeof(encoded)) {
		// copied can't be less than encoded.size here - otherwise
		// we'd have copied < sizeof(encoded) and the check above
		// would've buggered off
		iov_iter_revert(&i, copied - encoded.size);
		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
	}

should do it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-18 22:32                   ` Al Viro
@ 2021-06-19  0:43                     ` Omar Sandoval
  2021-06-21 18:46                       ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-19  0:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 10:32:54PM +0000, Al Viro wrote:
> On Fri, Jun 18, 2021 at 03:10:03PM -0700, Omar Sandoval wrote:
> 
> > Or do the same reverting thing that Al did, but with copy_from_iter()
> > instead of copy_from_iter_full() and being careful with the copied count
> > (which I'm not 100% sure I got correct here):
> > 
> > 	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> > 	if (copied < offsetofend(struct encoded_iov, size))
> > 		return -EFAULT;
> > 	if (encoded.size > PAGE_SIZE)
> > 		return -E2BIG;
> > 	if (encoded.size < ENCODED_IOV_SIZE_VER0)
> > 		return -EINVAL;
> > 	if (encoded.size > sizeof(encoded)) {
> > 		if (copied < sizeof(encoded)
> > 			return -EFAULT;
> > 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> > 			return -EINVAL;
> > 	} else if (encoded.size < sizeof(encoded)) {
> > 		// older than what we expect
> > 		if (copied < encoded.size)
> > 			return -EFAULT;
> > 		iov_iter_revert(&i, copied - encoded.size);
> > 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> > 	}    
> 
> simpler than that, actually -
> 
> 	copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> 	if (unlikely(copied < sizeof(encoded))) {
> 		if (copied < offsetofend(struct encoded_iov, size) ||
> 		    copied < encoded.size)
> 			return iov_iter_count(i) ? -EFAULT : -EINVAL;
> 	}
> 	if (encoded.size > sizeof(encoded)) {
> 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> 			return -EINVAL;
> 	} else if (encoded.size < sizeof(encoded)) {
> 		// copied can't be less than encoded.size here - otherwise
> 		// we'd have copied < sizeof(encoded) and the check above
> 		// would've buggered off
> 		iov_iter_revert(&i, copied - encoded.size);
> 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> 	}
> 
> should do it.

Thanks, Al, I'll send an updated version with this approach next week.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-19  0:43                     ` Omar Sandoval
@ 2021-06-21 18:46                       ` Omar Sandoval
  2021-06-21 19:33                         ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-21 18:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Fri, Jun 18, 2021 at 05:43:21PM -0700, Omar Sandoval wrote:
> On Fri, Jun 18, 2021 at 10:32:54PM +0000, Al Viro wrote:
> > On Fri, Jun 18, 2021 at 03:10:03PM -0700, Omar Sandoval wrote:
> > 
> > > Or do the same reverting thing that Al did, but with copy_from_iter()
> > > instead of copy_from_iter_full() and being careful with the copied count
> > > (which I'm not 100% sure I got correct here):
> > > 
> > > 	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> > > 	if (copied < offsetofend(struct encoded_iov, size))
> > > 		return -EFAULT;
> > > 	if (encoded.size > PAGE_SIZE)
> > > 		return -E2BIG;
> > > 	if (encoded.size < ENCODED_IOV_SIZE_VER0)
> > > 		return -EINVAL;
> > > 	if (encoded.size > sizeof(encoded)) {
> > > 		if (copied < sizeof(encoded)
> > > 			return -EFAULT;
> > > 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> > > 			return -EINVAL;
> > > 	} else if (encoded.size < sizeof(encoded)) {
> > > 		// older than what we expect
> > > 		if (copied < encoded.size)
> > > 			return -EFAULT;
> > > 		iov_iter_revert(&i, copied - encoded.size);
> > > 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> > > 	}    
> > 
> > simpler than that, actually -
> > 
> > 	copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> > 	if (unlikely(copied < sizeof(encoded))) {
> > 		if (copied < offsetofend(struct encoded_iov, size) ||
> > 		    copied < encoded.size)
> > 			return iov_iter_count(i) ? -EFAULT : -EINVAL;
> > 	}
> > 	if (encoded.size > sizeof(encoded)) {
> > 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> > 			return -EINVAL;
> > 	} else if (encoded.size < sizeof(encoded)) {
> > 		// copied can't be less than encoded.size here - otherwise
> > 		// we'd have copied < sizeof(encoded) and the check above
> > 		// would've buggered off
> > 		iov_iter_revert(&i, copied - encoded.size);
> > 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> > 	}
> > 
> > should do it.
> 
> Thanks, Al, I'll send an updated version with this approach next week.

Okay, so this works for the write side of RWF_ENCODED, but it causes
problems for the read side. That currently works like so:

	struct encoded_iov encoded_iov;
	char compressed_data[...];
	struct iovec iov[] = {
		{ &encoded_iov, sizeof(encoded_iov) },
		{ compressed_data, sizeof(compressed_data) },
	};
	preadv2(fd, iov, 2, -1, RWF_ENCODED);

The kernel fills in the encoded_iov with the compression metadata and
the remaining buffers with the compressed data. The kernel needs to know
how much of the iovec is for the encoded_iov. The backwards
compatibility is similar to the write side: if the kernel size is less
than the userspace size, then we can fill in extra zeroes. If the kernel
size is greater than the userspace size and all of the extra metadata is
zero, then we can omit it. If the extra metadata is non-zero, then we
return an error.

How do we get the userspace size with the encoded_iov.size approach?
We'd have to read the size from the iov_iter before writing to the rest
of the iov_iter. Is it okay to mix the iov_iter as a source and
destination like this? From what I can tell, it's not intended to be
used like this.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-21 18:46                       ` Omar Sandoval
@ 2021-06-21 19:33                         ` Linus Torvalds
  2021-06-21 20:46                           ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2021-06-21 19:33 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Al Viro, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> How do we get the userspace size with the encoded_iov.size approach?
> We'd have to read the size from the iov_iter before writing to the rest
> of the iov_iter. Is it okay to mix the iov_iter as a source and
> destination like this? From what I can tell, it's not intended to be
> used like this.

I guess it could work that way, but yes, it's ugly as hell. And I
really don't want a readv() system call - that should write to the
result buffer - to first have to read from it.

So I think the original "just make it be the first iov entry" is the
better approach, even if Al hates it.

Although I still get the feeling that using an ioctl is the *really*
correct way to go. That was my first reaction to the series
originally, and I still don't see why we'd have encoded data in a
regular read/write path.

What was the argument against ioctl's, again?

To me, this isn't all that different from the fsverity things we
added, where filesystem people were happy to try to work out some
common model and add FS_IOC_*_VERITY* ioctls.

               Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-21 19:33                         ` Linus Torvalds
@ 2021-06-21 20:46                           ` Omar Sandoval
  2021-06-21 20:53                             ` Omar Sandoval
  2021-06-21 20:55                             ` Omar Sandoval
  0 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-21 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, linux-btrfs, Linux API, Kernel Team

On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > How do we get the userspace size with the encoded_iov.size approach?
> > We'd have to read the size from the iov_iter before writing to the rest
> > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > destination like this? From what I can tell, it's not intended to be
> > used like this.
> 
> I guess it could work that way, but yes, it's ugly as hell. And I
> really don't want a readv() system call - that should write to the
> result buffer - to first have to read from it.
> 
> So I think the original "just make it be the first iov entry" is the
> better approach, even if Al hates it.
> 
> Although I still get the feeling that using an ioctl is the *really*
> correct way to go. That was my first reaction to the series
> originally, and I still don't see why we'd have encoded data in a
> regular read/write path.
> 
> What was the argument against ioctl's, again?

The suggestion came from Dave Chinner here:
https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/

His objection to an ioctl was two-fold:

1. This interfaces looks really similar to normal read/write, so we
   should try to use the normal read/write interface for it. Perhaps
   this trouble with iov_iter has refuted that.
2. The last time we had Btrfs-specific ioctls that eventually became
   generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
   painful. Part of the problem with clone/dedupe was that the Btrfs
   ioctls were underspecified. I think I've done a better job of
   documenting all of the semantics and corner cases for the encoded I/O
   interface (and if not, I can address this). The other part of the
   problem is that there were various sanity checks in the normal
   read/write paths that were missed or drifted out of sync in the
   ioctls. That requires some vigilance going forward. Maybe starting
   this off as a generic (not Btrfs-specific) ioctl right off the bat
   will help.

If we do go the ioctl route, then we also have to decide how much of
preadv2/pwritev2 it should emulate. Should it use the fd offset, or
should that be an ioctl argument? Some of the RWF_ flags would be useful
for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
support those? These bring us back to Dave's first point.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-21 20:46                           ` Omar Sandoval
@ 2021-06-21 20:53                             ` Omar Sandoval
  2021-06-21 20:55                             ` Omar Sandoval
  1 sibling, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-21 20:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, linux-btrfs, Linux API, Kernel Team



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-21 20:46                           ` Omar Sandoval
  2021-06-21 20:53                             ` Omar Sandoval
@ 2021-06-21 20:55                             ` Omar Sandoval
  2021-06-22 22:06                               ` Dave Chinner
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-21 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, linux-btrfs, Linux API, Kernel Team,
	Dave Chinner

On Mon, Jun 21, 2021 at 01:46:04PM -0700, Omar Sandoval wrote:
> On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> > On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> > >
> > > How do we get the userspace size with the encoded_iov.size approach?
> > > We'd have to read the size from the iov_iter before writing to the rest
> > > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > > destination like this? From what I can tell, it's not intended to be
> > > used like this.
> > 
> > I guess it could work that way, but yes, it's ugly as hell. And I
> > really don't want a readv() system call - that should write to the
> > result buffer - to first have to read from it.
> > 
> > So I think the original "just make it be the first iov entry" is the
> > better approach, even if Al hates it.
> > 
> > Although I still get the feeling that using an ioctl is the *really*
> > correct way to go. That was my first reaction to the series
> > originally, and I still don't see why we'd have encoded data in a
> > regular read/write path.
> > 
> > What was the argument against ioctl's, again?
> 
> The suggestion came from Dave Chinner here:
> https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> 
> His objection to an ioctl was two-fold:
> 
> 1. This interfaces looks really similar to normal read/write, so we
>    should try to use the normal read/write interface for it. Perhaps
>    this trouble with iov_iter has refuted that.
> 2. The last time we had Btrfs-specific ioctls that eventually became
>    generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
>    painful. Part of the problem with clone/dedupe was that the Btrfs
>    ioctls were underspecified. I think I've done a better job of
>    documenting all of the semantics and corner cases for the encoded I/O
>    interface (and if not, I can address this). The other part of the
>    problem is that there were various sanity checks in the normal
>    read/write paths that were missed or drifted out of sync in the
>    ioctls. That requires some vigilance going forward. Maybe starting
>    this off as a generic (not Btrfs-specific) ioctl right off the bat
>    will help.
> 
> If we do go the ioctl route, then we also have to decide how much of
> preadv2/pwritev2 it should emulate. Should it use the fd offset, or
> should that be an ioctl argument? Some of the RWF_ flags would be useful
> for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
> support those? These bring us back to Dave's first point.

Oops, I dropped Dave from the Cc list at some point. Adding him back
now.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-21 20:55                             ` Omar Sandoval
@ 2021-06-22 22:06                               ` Dave Chinner
  2021-06-23 17:49                                 ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2021-06-22 22:06 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, linux-btrfs, Linux API,
	Kernel Team, Dave Chinner

On Mon, Jun 21, 2021 at 01:55:03PM -0700, Omar Sandoval wrote:
> On Mon, Jun 21, 2021 at 01:46:04PM -0700, Omar Sandoval wrote:
> > On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> > > On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > >
> > > > How do we get the userspace size with the encoded_iov.size approach?
> > > > We'd have to read the size from the iov_iter before writing to the rest
> > > > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > > > destination like this? From what I can tell, it's not intended to be
> > > > used like this.
> > > 
> > > I guess it could work that way, but yes, it's ugly as hell. And I
> > > really don't want a readv() system call - that should write to the
> > > result buffer - to first have to read from it.
> > > 
> > > So I think the original "just make it be the first iov entry" is the
> > > better approach, even if Al hates it.
> > > 
> > > Although I still get the feeling that using an ioctl is the *really*
> > > correct way to go. That was my first reaction to the series
> > > originally, and I still don't see why we'd have encoded data in a
> > > regular read/write path.
> > > 
> > > What was the argument against ioctl's, again?
> > 
> > The suggestion came from Dave Chinner here:
> > https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> > 
> > His objection to an ioctl was two-fold:
> > 
> > 1. This interfaces looks really similar to normal read/write, so we
> >    should try to use the normal read/write interface for it. Perhaps
> >    this trouble with iov_iter has refuted that.
> > 2. The last time we had Btrfs-specific ioctls that eventually became
> >    generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
> >    painful. Part of the problem with clone/dedupe was that the Btrfs
> >    ioctls were underspecified. I think I've done a better job of
> >    documenting all of the semantics and corner cases for the encoded I/O
> >    interface (and if not, I can address this). The other part of the
> >    problem is that there were various sanity checks in the normal
> >    read/write paths that were missed or drifted out of sync in the
> >    ioctls. That requires some vigilance going forward. Maybe starting
> >    this off as a generic (not Btrfs-specific) ioctl right off the bat
> >    will help.
> > 
> > If we do go the ioctl route, then we also have to decide how much of
> > preadv2/pwritev2 it should emulate. Should it use the fd offset, or
> > should that be an ioctl argument? Some of the RWF_ flags would be useful
> > for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
> > support those? These bring us back to Dave's first point.
> 
> Oops, I dropped Dave from the Cc list at some point. Adding him back
> now.

Fair summary. The only other thing that I'd add is this is an IO
interface that requires issuing physical IO. So if someone wants
high throughput for encoded IO, we really need AIO and/or io_uring
support, and we get that for free if we use readv2/writev2
interfaces.

Yes, it could be an ioctl() interface, but I think that this sort of
functionality is exactly what extensible syscalls like
preadv2/pwritev2 should be used for. It's a slight variant on normal
IO, and that's exactly what the RWF_* flags are intended to be used
for - allowing interesting per-IO variant behaviour without having
to completely re-implemnt the IO path via custom ioctls every time
we want slightly different functionality...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-22 22:06                               ` Dave Chinner
@ 2021-06-23 17:49                                 ` Omar Sandoval
  2021-06-23 18:28                                   ` Linus Torvalds
  2021-06-23 19:45                                   ` Al Viro
  0 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-23 17:49 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Dave Chinner
  Cc: linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 08:06:39AM +1000, Dave Chinner wrote:
> On Mon, Jun 21, 2021 at 01:55:03PM -0700, Omar Sandoval wrote:
> > On Mon, Jun 21, 2021 at 01:46:04PM -0700, Omar Sandoval wrote:
> > > On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> > > > On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > >
> > > > > How do we get the userspace size with the encoded_iov.size approach?
> > > > > We'd have to read the size from the iov_iter before writing to the rest
> > > > > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > > > > destination like this? From what I can tell, it's not intended to be
> > > > > used like this.
> > > > 
> > > > I guess it could work that way, but yes, it's ugly as hell. And I
> > > > really don't want a readv() system call - that should write to the
> > > > result buffer - to first have to read from it.
> > > > 
> > > > So I think the original "just make it be the first iov entry" is the
> > > > better approach, even if Al hates it.
> > > > 
> > > > Although I still get the feeling that using an ioctl is the *really*
> > > > correct way to go. That was my first reaction to the series
> > > > originally, and I still don't see why we'd have encoded data in a
> > > > regular read/write path.
> > > > 
> > > > What was the argument against ioctl's, again?
> > > 
> > > The suggestion came from Dave Chinner here:
> > > https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> > > 
> > > His objection to an ioctl was two-fold:
> > > 
> > > 1. This interfaces looks really similar to normal read/write, so we
> > >    should try to use the normal read/write interface for it. Perhaps
> > >    this trouble with iov_iter has refuted that.
> > > 2. The last time we had Btrfs-specific ioctls that eventually became
> > >    generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
> > >    painful. Part of the problem with clone/dedupe was that the Btrfs
> > >    ioctls were underspecified. I think I've done a better job of
> > >    documenting all of the semantics and corner cases for the encoded I/O
> > >    interface (and if not, I can address this). The other part of the
> > >    problem is that there were various sanity checks in the normal
> > >    read/write paths that were missed or drifted out of sync in the
> > >    ioctls. That requires some vigilance going forward. Maybe starting
> > >    this off as a generic (not Btrfs-specific) ioctl right off the bat
> > >    will help.
> > > 
> > > If we do go the ioctl route, then we also have to decide how much of
> > > preadv2/pwritev2 it should emulate. Should it use the fd offset, or
> > > should that be an ioctl argument? Some of the RWF_ flags would be useful
> > > for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
> > > support those? These bring us back to Dave's first point.
> > 
> > Oops, I dropped Dave from the Cc list at some point. Adding him back
> > now.
> 
> Fair summary. The only other thing that I'd add is this is an IO
> interface that requires issuing physical IO. So if someone wants
> high throughput for encoded IO, we really need AIO and/or io_uring
> support, and we get that for free if we use readv2/writev2
> interfaces.
> 
> Yes, it could be an ioctl() interface, but I think that this sort of
> functionality is exactly what extensible syscalls like
> preadv2/pwritev2 should be used for. It's a slight variant on normal
> IO, and that's exactly what the RWF_* flags are intended to be used
> for - allowing interesting per-IO variant behaviour without having
> to completely re-implemnt the IO path via custom ioctls every time
> we want slightly different functionality...

Al, Linus, what do you think? Is there a path forward for this series as
is? I'd be happy to have this functionality merged in any form, but I do
think that this approach with preadv2/pwritev2 using iov_len is decent
relative to the alternatives.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 17:49                                 ` Omar Sandoval
@ 2021-06-23 18:28                                   ` Linus Torvalds
  2021-06-23 19:33                                     ` Omar Sandoval
  2021-06-23 19:45                                   ` Al Viro
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2021-06-23 18:28 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-btrfs, Linux API,
	Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 10:49 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> Al, Linus, what do you think? Is there a path forward for this series as
> is?

So the "read from user space in order to write" is a no-go for me. It
completely violates what a "read()" system call should do. It also
entirely violates what an iovec can and should do.

And honestly, if Al hates the "first iov entry" model, I'm not sure I
want to merge that version - I personally find it fine, but Al is
effectively the iov-iter maintainer.

I do worry a bit about the "first iov entry" simply because it might
work for "writev2()" when given virtual user space addresses - but I
think it's conceptually broken for things like direct-IO which might
do things by physical address, and what is a contiguous user space
virtual address is not necessarily a contiguous physical address.

Yes, the filesystem can - and does - hide that path by basically not
doing direct-IO on the first entry at all, and just treat is very
specially in the front end of the IO access, but that only reinforces
the whole "this is not at all like read/write".

Similar issues might crop up in other situations, ie splice etc, where
it's not at all obvious that the iov_iter boundaries would be
maintained as it moves through the system.

So while I personally find the "first iov entry" model fairly
reasonable, I think Dave is being disingenuous when he says that it
looks like a normal read/write. It very much does not. The above is
quite fundamental.

>  I'd be happy to have this functionality merged in any form, but I do
> think that this approach with preadv2/pwritev2 using iov_len is decent
> relative to the alternatives.

As mentioned, I find it acceptable. I'm completely unimpressed with
Dave's argument, but ioctl's aren't perfect either, so weak or not,
that argument being bogus doesn't necessarily mean that the iovec
entry model is wrong.

That said, thinking about exactly the fact that I don't think a
translation from iovec to anything else can be truly valid, I find the
iter_is_iovec() case to be the only obviously valid one.

Which gets me back to: how can any of the non-iovec alternatives ever
be valid? You did mention having missed ITER_XARRAY, but my question
is more fundamental than that. How could a non-iter_is_iovec ever be
valid? There are no possible interfaces that can generate such a thing
sanely.

                Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 18:28                                   ` Linus Torvalds
@ 2021-06-23 19:33                                     ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-23 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-btrfs, Linux API,
	Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 11:28:15AM -0700, Linus Torvalds wrote:
> On Wed, Jun 23, 2021 at 10:49 AM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > Al, Linus, what do you think? Is there a path forward for this series as
> > is?
> 
> So the "read from user space in order to write" is a no-go for me. It
> completely violates what a "read()" system call should do. It also
> entirely violates what an iovec can and should do.
> 
> And honestly, if Al hates the "first iov entry" model, I'm not sure I
> want to merge that version - I personally find it fine, but Al is
> effectively the iov-iter maintainer.
> 
> I do worry a bit about the "first iov entry" simply because it might
> work for "writev2()" when given virtual user space addresses - but I
> think it's conceptually broken for things like direct-IO which might
> do things by physical address, and what is a contiguous user space
> virtual address is not necessarily a contiguous physical address.
> 
> Yes, the filesystem can - and does - hide that path by basically not
> doing direct-IO on the first entry at all, and just treat is very
> specially in the front end of the IO access, but that only reinforces
> the whole "this is not at all like read/write".
> 
> Similar issues might crop up in other situations, ie splice etc, where
> it's not at all obvious that the iov_iter boundaries would be
> maintained as it moves through the system.
> 
> So while I personally find the "first iov entry" model fairly
> reasonable, I think Dave is being disingenuous when he says that it
> looks like a normal read/write. It very much does not. The above is
> quite fundamental.
> 
> >  I'd be happy to have this functionality merged in any form, but I do
> > think that this approach with preadv2/pwritev2 using iov_len is decent
> > relative to the alternatives.
> 
> As mentioned, I find it acceptable. I'm completely unimpressed with
> Dave's argument, but ioctl's aren't perfect either, so weak or not,
> that argument being bogus doesn't necessarily mean that the iovec
> entry model is wrong.
> 
> That said, thinking about exactly the fact that I don't think a
> translation from iovec to anything else can be truly valid, I find the
> iter_is_iovec() case to be the only obviously valid one.
> 
> Which gets me back to: how can any of the non-iovec alternatives ever
> be valid? You did mention having missed ITER_XARRAY, but my question
> is more fundamental than that. How could a non-iter_is_iovec ever be
> valid? There are no possible interfaces that can generate such a thing
> sanely.

I only implemented the bvec and kvec cases for completeness, since
copy_struct_from_iter() would appear to be a generic helper. At least
for RWF_ENCODED, a bvec seems pretty bogus, but it doesn't seem too
far-flung to imagine an in-kernel user of RWF_ENCODED that uses a kvec.

One other option that we haven't considered is ditching the
copy_struct_from_user() semantics and going the simpler route of adding
some reserved space to the end of struct encoded_iov:

struct encoded_iov {
	__aligned_u64 len;
	__aligned_u64 unencoded_len;
	__aligned_u64 unencoded_offset;
	__u32 compression;
	__u32 encryption;
	__u8 reserved[32];
};

Then we can do an unconditional copy_from_user_full(sizeof(struct
encoded_iov)) and check the reserved space in the typical fashion.

(And in the unlikely case that we use up all of that space with
extensions, I suppose we could have an RWF_ENCODED2 with a matching
struct encoded_iov2.)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 17:49                                 ` Omar Sandoval
  2021-06-23 18:28                                   ` Linus Torvalds
@ 2021-06-23 19:45                                   ` Al Viro
  2021-06-23 20:46                                     ` Omar Sandoval
  1 sibling, 1 reply; 50+ messages in thread
From: Al Viro @ 2021-06-23 19:45 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Linus Torvalds, Dave Chinner, linux-fsdevel, linux-btrfs,
	Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 10:49:51AM -0700, Omar Sandoval wrote:

> > Fair summary. The only other thing that I'd add is this is an IO
> > interface that requires issuing physical IO. So if someone wants
> > high throughput for encoded IO, we really need AIO and/or io_uring
> > support, and we get that for free if we use readv2/writev2
> > interfaces.
> > 
> > Yes, it could be an ioctl() interface, but I think that this sort of
> > functionality is exactly what extensible syscalls like
> > preadv2/pwritev2 should be used for. It's a slight variant on normal
> > IO, and that's exactly what the RWF_* flags are intended to be used
> > for - allowing interesting per-IO variant behaviour without having
> > to completely re-implemnt the IO path via custom ioctls every time
> > we want slightly different functionality...
> 
> Al, Linus, what do you think? Is there a path forward for this series as
> is? I'd be happy to have this functionality merged in any form, but I do
> think that this approach with preadv2/pwritev2 using iov_len is decent
> relative to the alternatives.

IMO we might be better off with explicit ioctl - this magical mystery shite
with special meaning of the first iovec length is, IMO, more than enough
to make it a bad fit for read/write family.

It's *not* just a "slightly different functionality" - it's very different
calling conventions.  And the deeper one needs to dig into the interface
details to parse what's going on, the less it differs from ioctl() mess.

Said that, why do you need a variable-length header on the read side,
in the first place?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 19:45                                   ` Al Viro
@ 2021-06-23 20:46                                     ` Omar Sandoval
  2021-06-23 21:39                                       ` Al Viro
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-23 20:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-fsdevel, linux-btrfs,
	Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 07:45:59PM +0000, Al Viro wrote:
> On Wed, Jun 23, 2021 at 10:49:51AM -0700, Omar Sandoval wrote:
> 
> > > Fair summary. The only other thing that I'd add is this is an IO
> > > interface that requires issuing physical IO. So if someone wants
> > > high throughput for encoded IO, we really need AIO and/or io_uring
> > > support, and we get that for free if we use readv2/writev2
> > > interfaces.
> > > 
> > > Yes, it could be an ioctl() interface, but I think that this sort of
> > > functionality is exactly what extensible syscalls like
> > > preadv2/pwritev2 should be used for. It's a slight variant on normal
> > > IO, and that's exactly what the RWF_* flags are intended to be used
> > > for - allowing interesting per-IO variant behaviour without having
> > > to completely re-implemnt the IO path via custom ioctls every time
> > > we want slightly different functionality...
> > 
> > Al, Linus, what do you think? Is there a path forward for this series as
> > is? I'd be happy to have this functionality merged in any form, but I do
> > think that this approach with preadv2/pwritev2 using iov_len is decent
> > relative to the alternatives.
> 
> IMO we might be better off with explicit ioctl - this magical mystery shite
> with special meaning of the first iovec length is, IMO, more than enough
> to make it a bad fit for read/write family.
> 
> It's *not* just a "slightly different functionality" - it's very different
> calling conventions.  And the deeper one needs to dig into the interface
> details to parse what's going on, the less it differs from ioctl() mess.
> 
> Said that, why do you need a variable-length header on the read side,
> in the first place?

Suppose we add a new field representing a new type of encoding to the
end of encoded_iov. On the write side, the caller might want to specify
that the data is encoded in that new way, of course. But on the read
side, if the data is encoded in that new way, then the kernel will want
to return that. The kernel needs to know if the user's structure
includes the new field (otherwise when it copies the full struct out, it
will write into what the user thinks is the data instead).

As I mentioned in my reply to Linus, maybe we can stick with
preadv2/pwritev2, but make the struct encoded_iov structure a fixed size
with some reserved space for future expansion. That makes this a lot
less special: just copy a fixed size structure, then read/write the
rest. And then we don't need to reinvent the rest of the
preadv2/pwritev2 path for an ioctl.

Between a fixed size structure and an ioctl, what would you prefer?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 20:46                                     ` Omar Sandoval
@ 2021-06-23 21:39                                       ` Al Viro
  2021-06-23 21:58                                         ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Al Viro @ 2021-06-23 21:39 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Linus Torvalds, Dave Chinner, linux-fsdevel, linux-btrfs,
	Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 01:46:50PM -0700, Omar Sandoval wrote:

> Suppose we add a new field representing a new type of encoding to the
> end of encoded_iov. On the write side, the caller might want to specify
> that the data is encoded in that new way, of course. But on the read
> side, if the data is encoded in that new way, then the kernel will want
> to return that. The kernel needs to know if the user's structure
> includes the new field (otherwise when it copies the full struct out, it
> will write into what the user thinks is the data instead).

Er...  What's the problem with simply copying that extended structure out,
followed by the data?

IOW, why can't the caller pick the header out of the whole thing and
deal with it in whatever way it likes?  Why should kernel need to do
anything special here?

IDGI...  Userland had always been able to deal with that kind of stuff;
you read e.g. gzipped data into buffer, you decode the header, you figure
out how long it is and how far out does the payload begin, etc.

How is that different?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 21:39                                       ` Al Viro
@ 2021-06-23 21:58                                         ` Omar Sandoval
  2021-06-23 22:26                                           ` Al Viro
  2021-06-24  2:00                                           ` Matthew Wilcox
  0 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-23 21:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-fsdevel, linux-btrfs,
	Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 09:39:48PM +0000, Al Viro wrote:
> On Wed, Jun 23, 2021 at 01:46:50PM -0700, Omar Sandoval wrote:
> 
> > Suppose we add a new field representing a new type of encoding to the
> > end of encoded_iov. On the write side, the caller might want to specify
> > that the data is encoded in that new way, of course. But on the read
> > side, if the data is encoded in that new way, then the kernel will want
> > to return that. The kernel needs to know if the user's structure
> > includes the new field (otherwise when it copies the full struct out, it
> > will write into what the user thinks is the data instead).
> 
> Er...  What's the problem with simply copying that extended structure out,
> followed by the data?
> 
> IOW, why can't the caller pick the header out of the whole thing and
> deal with it in whatever way it likes?  Why should kernel need to do
> anything special here?
> 
> IDGI...  Userland had always been able to deal with that kind of stuff;
> you read e.g. gzipped data into buffer, you decode the header, you figure
> out how long it is and how far out does the payload begin, etc.
> 
> How is that different?

Ah, I was stuck on thinking about this calling convention:

	struct encoded_iov encoded_iov;
	char compressed_data[...];
	struct iovec iov[] = {
		{ &encoded_iov, sizeof(encoded_iov) },
		{ compressed_data, sizeof(compressed_data) },
	};
	preadv2(fd, iov, 2, -1, RWF_ENCODED);

But what you described would look more like:

	// Needs to be large enough for maximum returned header + data.
	char buffer[...];
	struct iovec iov[] = {
		{ buffer, sizeof(buffer) },
	};
	preadv2(fd, iov, 2, -1, RWF_ENCODED);
	// We should probably align the buffer.
	struct encoded_iov *encoded_iov = (void *)buffer;
	char *data = buffer + encoded_iov->size;

That's a little uglier, but it should work, and allows for arbitrary
extensions. So, among these three alternatives (fixed size structure
with reserved space, variable size structure like above, or ioctl),
which would you prefer?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 21:58                                         ` Omar Sandoval
@ 2021-06-23 22:26                                           ` Al Viro
  2021-06-24  2:00                                           ` Matthew Wilcox
  1 sibling, 0 replies; 50+ messages in thread
From: Al Viro @ 2021-06-23 22:26 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Linus Torvalds, Dave Chinner, linux-fsdevel, linux-btrfs,
	Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 02:58:32PM -0700, Omar Sandoval wrote:

> Ah, I was stuck on thinking about this calling convention:
> 
> 	struct encoded_iov encoded_iov;
> 	char compressed_data[...];
> 	struct iovec iov[] = {
> 		{ &encoded_iov, sizeof(encoded_iov) },
> 		{ compressed_data, sizeof(compressed_data) },
> 	};
> 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> 
> But what you described would look more like:
> 
> 	// Needs to be large enough for maximum returned header + data.
> 	char buffer[...];
> 	struct iovec iov[] = {
> 		{ buffer, sizeof(buffer) },
> 	};
> 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> 	// We should probably align the buffer.
> 	struct encoded_iov *encoded_iov = (void *)buffer;
> 	char *data = buffer + encoded_iov->size;
> 
> That's a little uglier, but it should work, and allows for arbitrary
> extensions. So, among these three alternatives (fixed size structure
> with reserved space, variable size structure like above, or ioctl),
> which would you prefer?

Variable-sized structure would seem to be the easiest from the kernel
POV and the interface is the easiest to describe - "you read the
encoded data preceded by the header"...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-23 21:58                                         ` Omar Sandoval
  2021-06-23 22:26                                           ` Al Viro
@ 2021-06-24  2:00                                           ` Matthew Wilcox
  2021-06-24  6:14                                             ` Omar Sandoval
  2021-06-24  6:41                                             ` Christoph Hellwig
  1 sibling, 2 replies; 50+ messages in thread
From: Matthew Wilcox @ 2021-06-24  2:00 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 02:58:32PM -0700, Omar Sandoval wrote:
> But what you described would look more like:
> 
> 	// Needs to be large enough for maximum returned header + data.
> 	char buffer[...];
> 	struct iovec iov[] = {
> 		{ buffer, sizeof(buffer) },
> 	};
> 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> 	// We should probably align the buffer.
> 	struct encoded_iov *encoded_iov = (void *)buffer;
> 	char *data = buffer + encoded_iov->size;
> 
> That's a little uglier, but it should work, and allows for arbitrary
> extensions. So, among these three alternatives (fixed size structure
> with reserved space, variable size structure like above, or ioctl),
> which would you prefer?

Does that work for O_DIRECT and the required 512-byte alignment?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24  2:00                                           ` Matthew Wilcox
@ 2021-06-24  6:14                                             ` Omar Sandoval
  2021-06-24 17:52                                               ` Linus Torvalds
  2021-06-24  6:41                                             ` Christoph Hellwig
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-24  6:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Thu, Jun 24, 2021 at 03:00:39AM +0100, Matthew Wilcox wrote:
> On Wed, Jun 23, 2021 at 02:58:32PM -0700, Omar Sandoval wrote:
> > But what you described would look more like:
> > 
> > 	// Needs to be large enough for maximum returned header + data.
> > 	char buffer[...];
> > 	struct iovec iov[] = {
> > 		{ buffer, sizeof(buffer) },
> > 	};
> > 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> > 	// We should probably align the buffer.
> > 	struct encoded_iov *encoded_iov = (void *)buffer;
> > 	char *data = buffer + encoded_iov->size;
> > 
> > That's a little uglier, but it should work, and allows for arbitrary
> > extensions. So, among these three alternatives (fixed size structure
> > with reserved space, variable size structure like above, or ioctl),
> > which would you prefer?
> 
> Does that work for O_DIRECT and the required 512-byte alignment?

I suppose the kernel could pad the encoded_iov structure with zeroes to
the next sector boundary, since zeroes are effectively noops for
encoded_iov. (As an aside, RWF_ENCODED is always "direct I/O" in the
sense that it bypasses the page cache, but not necessarily in the sense
that it does DMA to/from the user buffers. The Btrfs implementation
doesn't do the latter yet.)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24  2:00                                           ` Matthew Wilcox
  2021-06-24  6:14                                             ` Omar Sandoval
@ 2021-06-24  6:41                                             ` Christoph Hellwig
  2021-06-24  7:50                                               ` Omar Sandoval
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2021-06-24  6:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Omar Sandoval, Al Viro, Linus Torvalds, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

I'm also really worried with overloading the regular r/w path and
iov_iter with ever more special cases.  We already have various
performance problems in the path, and adding more special cases ain't
gonna help.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24  6:41                                             ` Christoph Hellwig
@ 2021-06-24  7:50                                               ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-06-24  7:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Al Viro, Linus Torvalds, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Thu, Jun 24, 2021 at 07:41:12AM +0100, Christoph Hellwig wrote:
> I'm also really worried with overloading the regular r/w path and
> iov_iter with ever more special cases.  We already have various
> performance problems in the path, and adding more special cases ain't
> gonna help.

The changes to the normal path are:

* An extra check for RWF_ENCODED and FMODE_ENCODED_IO in kiocb_set_rw_flags().
* Splitting some of the checks in generic_write_checks() into a new
  function.
* Checks for the IOCB_ENCODED flag in the filesystem's
  read_iter/write_iter.

At least for Btrfs, the rest happens in a completely separate code path.
So, there are a couple of extra checks, but it's not as drastic as it
might first appear.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24  6:14                                             ` Omar Sandoval
@ 2021-06-24 17:52                                               ` Linus Torvalds
  2021-06-24 18:28                                                 ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2021-06-24 17:52 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Matthew Wilcox, Al Viro, Dave Chinner, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Wed, Jun 23, 2021 at 11:15 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Thu, Jun 24, 2021 at 03:00:39AM +0100, Matthew Wilcox wrote:
> >
> > Does that work for O_DIRECT and the required 512-byte alignment?
>
> I suppose the kernel could pad the encoded_iov structure with zeroes to
> the next sector boundary, since zeroes are effectively noops for
> encoded_iov.

Ugh.

I really think the whole "embed the control structure in the stream"
is wrong. The alignment issue is just another sign of that.

Separating it out is the right thing to do. At least the "first iov
entry" thing did separate the control structure from the actual data.
I detest the whole "embed the two together".

            Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24 17:52                                               ` Linus Torvalds
@ 2021-06-24 18:28                                                 ` Omar Sandoval
  2021-06-24 21:07                                                   ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-24 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Al Viro, Dave Chinner, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Thu, Jun 24, 2021 at 10:52:17AM -0700, Linus Torvalds wrote:
> On Wed, Jun 23, 2021 at 11:15 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Thu, Jun 24, 2021 at 03:00:39AM +0100, Matthew Wilcox wrote:
> > >
> > > Does that work for O_DIRECT and the required 512-byte alignment?
> >
> > I suppose the kernel could pad the encoded_iov structure with zeroes to
> > the next sector boundary, since zeroes are effectively noops for
> > encoded_iov.
> 
> Ugh.
> 
> I really think the whole "embed the control structure in the stream"
> is wrong. The alignment issue is just another sign of that.
> 
> Separating it out is the right thing to do. At least the "first iov
> entry" thing did separate the control structure from the actual data.
> I detest the whole "embed the two together".

I'll suggest the fixed-size struct encoded_iov again, then. If we're
willing to give up some of the flexibility of a variable size, then
userspace can always put the fixed-size structure in its own iovec or
include it inline with the data, depending on what's more convenient and
whether it's using O_DIRECT. A fixed size is much easier for both the
kernel and userspace to deal with. Do we really need to support
unlimited extensions to encoded_iov, or can we stick 32-64 bytes of
reserved space at the end of the structure and call it a day?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24 18:28                                                 ` Omar Sandoval
@ 2021-06-24 21:07                                                   ` Linus Torvalds
  2021-06-24 22:41                                                     ` Martin K. Petersen
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2021-06-24 21:07 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Matthew Wilcox, Al Viro, Dave Chinner, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Thu, Jun 24, 2021 at 11:28 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> I'll suggest the fixed-size struct encoded_iov again, then. If we're
> willing to give up some of the flexibility of a variable size, then
> userspace can always put the fixed-size structure in its own iovec or
> include it inline with the data, depending on what's more convenient and
> whether it's using O_DIRECT.

I really would prefer to have the separate pointer to it.

Fixed size doesn't help. It's still "mixed in" unless you have a
clearly separate pointer. Sure, user space *could* use a separate iov
entry if it wants to, but then it becomes a user choice rather than
part of the design.

That separate data structure would be the only way to do it for a
ioctl() interface, but in the readv/writev world the whole separate
"first iov entry" does that too.

I also worry that this "raw compressed data" thing isn't the only
thing people will want to do. I could easily see some kind of
"end-to-end CRC read/write" where the user passes in not just the
data, but also checksums for it to validate it (maybe because you're
doing a file copy and had the original checksums, but also maybe
because user space simply has a known good copy and doesn't want
errors re-introduced due to memory corruption).

And I continue to think that this whole issue isn't all that different
from the FSVERITY thing.

Of course, the real take-away is that "preadv2/pwritev2()" is a
horrible interface. It should have been more extensible, rather than
the lazy "just add another flag argument".

I think we finally may have gotten a real extensible interface right
with openat2(), and that "open_how" thing, but maybe I'm being naive
and it will turn out that that wasn't so great either.

Maybe we'll some day end up with a "preadv3()" that has an extensible
"struct io_how" argument.

Interfaces are hard.

                Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24 21:07                                                   ` Linus Torvalds
@ 2021-06-24 22:41                                                     ` Martin K. Petersen
  2021-06-25  3:38                                                       ` Matthew Wilcox
  0 siblings, 1 reply; 50+ messages in thread
From: Martin K. Petersen @ 2021-06-24 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Omar Sandoval, Matthew Wilcox, Al Viro, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner


Linus,

> I also worry that this "raw compressed data" thing isn't the only
> thing people will want to do. I could easily see some kind of
> "end-to-end CRC read/write" where the user passes in not just the
> data, but also checksums for it to validate it (maybe because you're
> doing a file copy and had the original checksums, but also maybe
> because user space simply has a known good copy and doesn't want
> errors re-introduced due to memory corruption).

We already support passing CRCs down to be validated by the hardware for
both NVMe and SCSI. This currently only works from the block layer
down. When enabled, the checksums are generated by the block layer for
writes and the data is validated against the checksums sent by the
storage on reads.

Over the years various attempts at adding support for passing the
checksum buffers in from userland have failed for exactly the reasons
outlined in this thread (Joel, Darrick, Bob). Would love to have a
generic way of passing this kind of information...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-24 22:41                                                     ` Martin K. Petersen
@ 2021-06-25  3:38                                                       ` Matthew Wilcox
  2021-06-25 16:16                                                         ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2021-06-25  3:38 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Linus Torvalds, Omar Sandoval, Al Viro, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Thu, Jun 24, 2021 at 06:41:52PM -0400, Martin K. Petersen wrote:
> 
> Linus,
> 
> > I also worry that this "raw compressed data" thing isn't the only
> > thing people will want to do. I could easily see some kind of
> > "end-to-end CRC read/write" where the user passes in not just the
> > data, but also checksums for it to validate it (maybe because you're
> > doing a file copy and had the original checksums, but also maybe
> > because user space simply has a known good copy and doesn't want
> > errors re-introduced due to memory corruption).
> 
> We already support passing CRCs down to be validated by the hardware for
> both NVMe and SCSI. This currently only works from the block layer
> down. When enabled, the checksums are generated by the block layer for
> writes and the data is validated against the checksums sent by the
> storage on reads.
> 
> Over the years various attempts at adding support for passing the
> checksum buffers in from userland have failed for exactly the reasons
> outlined in this thread (Joel, Darrick, Bob). Would love to have a
> generic way of passing this kind of information...

Does it make any kind of sense to talk about doing this for buffered I/O,
given that we can't generate them for (eg) mmaped files?  Or does this
only make sense to pass in for O_DIRECT accesses?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-25  3:38                                                       ` Matthew Wilcox
@ 2021-06-25 16:16                                                         ` Linus Torvalds
  2021-06-25 21:07                                                           ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2021-06-25 16:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Martin K. Petersen, Omar Sandoval, Al Viro, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Does it make any kind of sense to talk about doing this for buffered I/O,
> given that we can't generate them for (eg) mmaped files?

Sure we can.

Or rather, some people might very well like to do it even for mutable
data. In fact, _especially_ for mutable data.

You might want to do things like "write out the state I verified just
a moment ago", and if it has changed since then, you *want* the result
to be invalid because the checksums no longer match - in case somebody
else changed the data you used for the state calculation and
verification in the meantime. It's very much why you'd want a separate
checksum in the first place.

Yeah, yeah,  you can - and people do - just do things like this with a
separate checksum. But if you know that the filesystem has internal
checksumming support _anyway_, you might want to use it, and basically
say "use this checksum, if the data doesn't match when I read it back
I want to get an IO error".

(The "data doesn't match" _could_ be just due to DRAM corruption etc,
of course. Some people care about things like that. You want
"verified" filesystem contents - it might not be about security, it
might simply be about "I have validated this data and if it's not the
same data any more it's useless and I need to re-generate it").

Am I a big believer in this model? No. Portability concerns (across
OS'es, across filesystems, even just across backups on the same exact
system) means that even if we did this, very few people would use it.

People who want this end up using an external checksum instead and do
it outside of and separately from the actual IO, because then they can
do it on existing systems.

So my argument is not "we want this". My argument is purely that some
buffered filesystem IO case isn't actually any different from the
traditional "I want access to the low-level sector hardware checksum
data". The use cases are basically exactly the same.

Of course, basically nobody does that hw sector checksum either, for
all the same reasons, even if it's been around for decades.

So my "checksum metadata interface" is not something I'm a big
believer in, but I really don't think it's really all _that_ different
from the whole "compressed format interface" that this whole patch
series is about. They are pretty much the same thing in many ways.

                Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-25 16:16                                                         ` Linus Torvalds
@ 2021-06-25 21:07                                                           ` Omar Sandoval
  2021-07-07 17:59                                                             ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-06-25 21:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Martin K. Petersen, Al Viro, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Fri, Jun 25, 2021 at 09:16:15AM -0700, Linus Torvalds wrote:
> On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Does it make any kind of sense to talk about doing this for buffered I/O,
> > given that we can't generate them for (eg) mmaped files?
> 
> Sure we can.
> 
> Or rather, some people might very well like to do it even for mutable
> data. In fact, _especially_ for mutable data.
> 
> You might want to do things like "write out the state I verified just
> a moment ago", and if it has changed since then, you *want* the result
> to be invalid because the checksums no longer match - in case somebody
> else changed the data you used for the state calculation and
> verification in the meantime. It's very much why you'd want a separate
> checksum in the first place.
> 
> Yeah, yeah,  you can - and people do - just do things like this with a
> separate checksum. But if you know that the filesystem has internal
> checksumming support _anyway_, you might want to use it, and basically
> say "use this checksum, if the data doesn't match when I read it back
> I want to get an IO error".
> 
> (The "data doesn't match" _could_ be just due to DRAM corruption etc,
> of course. Some people care about things like that. You want
> "verified" filesystem contents - it might not be about security, it
> might simply be about "I have validated this data and if it's not the
> same data any more it's useless and I need to re-generate it").
> 
> Am I a big believer in this model? No. Portability concerns (across
> OS'es, across filesystems, even just across backups on the same exact
> system) means that even if we did this, very few people would use it.
> 
> People who want this end up using an external checksum instead and do
> it outside of and separately from the actual IO, because then they can
> do it on existing systems.
> 
> So my argument is not "we want this". My argument is purely that some
> buffered filesystem IO case isn't actually any different from the
> traditional "I want access to the low-level sector hardware checksum
> data". The use cases are basically exactly the same.
> 
> Of course, basically nobody does that hw sector checksum either, for
> all the same reasons, even if it's been around for decades.
> 
> So my "checksum metadata interface" is not something I'm a big
> believer in, but I really don't think it's really all _that_ different
> from the whole "compressed format interface" that this whole patch
> series is about. They are pretty much the same thing in many ways.

I see the similarity in the sense that we basically want to pass some
extra metadata down with the read or write. So then do we want to add
preadv3/pwritev3 for encoded I/O now so that checksums can use it in the
future? The encoding metadata could go in this "struct io_how", either
directly or in a separate structure with a pointer in "struct io_how".
It could get messy with compat syscalls.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-06-25 21:07                                                           ` Omar Sandoval
@ 2021-07-07 17:59                                                             ` Omar Sandoval
  2021-07-19 15:44                                                               ` Josef Bacik
  0 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-07-07 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Martin K. Petersen, Al Viro, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

On Fri, Jun 25, 2021 at 02:07:59PM -0700, Omar Sandoval wrote:
> On Fri, Jun 25, 2021 at 09:16:15AM -0700, Linus Torvalds wrote:
> > On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > Does it make any kind of sense to talk about doing this for buffered I/O,
> > > given that we can't generate them for (eg) mmaped files?
> > 
> > Sure we can.
> > 
> > Or rather, some people might very well like to do it even for mutable
> > data. In fact, _especially_ for mutable data.
> > 
> > You might want to do things like "write out the state I verified just
> > a moment ago", and if it has changed since then, you *want* the result
> > to be invalid because the checksums no longer match - in case somebody
> > else changed the data you used for the state calculation and
> > verification in the meantime. It's very much why you'd want a separate
> > checksum in the first place.
> > 
> > Yeah, yeah,  you can - and people do - just do things like this with a
> > separate checksum. But if you know that the filesystem has internal
> > checksumming support _anyway_, you might want to use it, and basically
> > say "use this checksum, if the data doesn't match when I read it back
> > I want to get an IO error".
> > 
> > (The "data doesn't match" _could_ be just due to DRAM corruption etc,
> > of course. Some people care about things like that. You want
> > "verified" filesystem contents - it might not be about security, it
> > might simply be about "I have validated this data and if it's not the
> > same data any more it's useless and I need to re-generate it").
> > 
> > Am I a big believer in this model? No. Portability concerns (across
> > OS'es, across filesystems, even just across backups on the same exact
> > system) means that even if we did this, very few people would use it.
> > 
> > People who want this end up using an external checksum instead and do
> > it outside of and separately from the actual IO, because then they can
> > do it on existing systems.
> > 
> > So my argument is not "we want this". My argument is purely that some
> > buffered filesystem IO case isn't actually any different from the
> > traditional "I want access to the low-level sector hardware checksum
> > data". The use cases are basically exactly the same.
> > 
> > Of course, basically nobody does that hw sector checksum either, for
> > all the same reasons, even if it's been around for decades.
> > 
> > So my "checksum metadata interface" is not something I'm a big
> > believer in, but I really don't think it's really all _that_ different
> > from the whole "compressed format interface" that this whole patch
> > series is about. They are pretty much the same thing in many ways.
> 
> I see the similarity in the sense that we basically want to pass some
> extra metadata down with the read or write. So then do we want to add
> preadv3/pwritev3 for encoded I/O now so that checksums can use it in the
> future? The encoding metadata could go in this "struct io_how", either
> directly or in a separate structure with a pointer in "struct io_how".
> It could get messy with compat syscalls.

Ping. What's the path forward here? At this point, it seems like an
ioctl is the path of least resistance.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter()
  2021-07-07 17:59                                                             ` Omar Sandoval
@ 2021-07-19 15:44                                                               ` Josef Bacik
  0 siblings, 0 replies; 50+ messages in thread
From: Josef Bacik @ 2021-07-19 15:44 UTC (permalink / raw)
  To: Omar Sandoval, Linus Torvalds
  Cc: Matthew Wilcox, Martin K. Petersen, Al Viro, Dave Chinner,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team, Dave Chinner

On 7/7/21 1:59 PM, Omar Sandoval wrote:
> On Fri, Jun 25, 2021 at 02:07:59PM -0700, Omar Sandoval wrote:
>> On Fri, Jun 25, 2021 at 09:16:15AM -0700, Linus Torvalds wrote:
>>> On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> Does it make any kind of sense to talk about doing this for buffered I/O,
>>>> given that we can't generate them for (eg) mmaped files?
>>>
>>> Sure we can.
>>>
>>> Or rather, some people might very well like to do it even for mutable
>>> data. In fact, _especially_ for mutable data.
>>>
>>> You might want to do things like "write out the state I verified just
>>> a moment ago", and if it has changed since then, you *want* the result
>>> to be invalid because the checksums no longer match - in case somebody
>>> else changed the data you used for the state calculation and
>>> verification in the meantime. It's very much why you'd want a separate
>>> checksum in the first place.
>>>
>>> Yeah, yeah,  you can - and people do - just do things like this with a
>>> separate checksum. But if you know that the filesystem has internal
>>> checksumming support _anyway_, you might want to use it, and basically
>>> say "use this checksum, if the data doesn't match when I read it back
>>> I want to get an IO error".
>>>
>>> (The "data doesn't match" _could_ be just due to DRAM corruption etc,
>>> of course. Some people care about things like that. You want
>>> "verified" filesystem contents - it might not be about security, it
>>> might simply be about "I have validated this data and if it's not the
>>> same data any more it's useless and I need to re-generate it").
>>>
>>> Am I a big believer in this model? No. Portability concerns (across
>>> OS'es, across filesystems, even just across backups on the same exact
>>> system) means that even if we did this, very few people would use it.
>>>
>>> People who want this end up using an external checksum instead and do
>>> it outside of and separately from the actual IO, because then they can
>>> do it on existing systems.
>>>
>>> So my argument is not "we want this". My argument is purely that some
>>> buffered filesystem IO case isn't actually any different from the
>>> traditional "I want access to the low-level sector hardware checksum
>>> data". The use cases are basically exactly the same.
>>>
>>> Of course, basically nobody does that hw sector checksum either, for
>>> all the same reasons, even if it's been around for decades.
>>>
>>> So my "checksum metadata interface" is not something I'm a big
>>> believer in, but I really don't think it's really all _that_ different
>>> from the whole "compressed format interface" that this whole patch
>>> series is about. They are pretty much the same thing in many ways.
>>
>> I see the similarity in the sense that we basically want to pass some
>> extra metadata down with the read or write. So then do we want to add
>> preadv3/pwritev3 for encoded I/O now so that checksums can use it in the
>> future? The encoding metadata could go in this "struct io_how", either
>> directly or in a separate structure with a pointer in "struct io_how".
>> It could get messy with compat syscalls.
> 
> Ping. What's the path forward here? At this point, it seems like an
> ioctl is the path of least resistance.
> 

At this point we've been deadlocked on this for too long.  Put it in a btrfs 
IOCTL, if somebody wants to extend it generically in the future then godspeed, 
we can tie into that interface after the fact.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2021-07-19 16:11 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-17 23:51 [PATCH RESEND x3 v9 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 1/9] iov_iter: add copy_struct_from_iter() Omar Sandoval
2021-06-18 18:50   ` Linus Torvalds
2021-06-18 19:42     ` Al Viro
2021-06-18 19:49       ` Al Viro
2021-06-18 20:33         ` Omar Sandoval
2021-06-18 20:32       ` Omar Sandoval
2021-06-18 20:58         ` Al Viro
2021-06-18 21:10           ` Linus Torvalds
2021-06-18 21:32             ` Al Viro
2021-06-18 21:40               ` Linus Torvalds
2021-06-18 22:10                 ` Omar Sandoval
2021-06-18 22:32                   ` Al Viro
2021-06-19  0:43                     ` Omar Sandoval
2021-06-21 18:46                       ` Omar Sandoval
2021-06-21 19:33                         ` Linus Torvalds
2021-06-21 20:46                           ` Omar Sandoval
2021-06-21 20:53                             ` Omar Sandoval
2021-06-21 20:55                             ` Omar Sandoval
2021-06-22 22:06                               ` Dave Chinner
2021-06-23 17:49                                 ` Omar Sandoval
2021-06-23 18:28                                   ` Linus Torvalds
2021-06-23 19:33                                     ` Omar Sandoval
2021-06-23 19:45                                   ` Al Viro
2021-06-23 20:46                                     ` Omar Sandoval
2021-06-23 21:39                                       ` Al Viro
2021-06-23 21:58                                         ` Omar Sandoval
2021-06-23 22:26                                           ` Al Viro
2021-06-24  2:00                                           ` Matthew Wilcox
2021-06-24  6:14                                             ` Omar Sandoval
2021-06-24 17:52                                               ` Linus Torvalds
2021-06-24 18:28                                                 ` Omar Sandoval
2021-06-24 21:07                                                   ` Linus Torvalds
2021-06-24 22:41                                                     ` Martin K. Petersen
2021-06-25  3:38                                                       ` Matthew Wilcox
2021-06-25 16:16                                                         ` Linus Torvalds
2021-06-25 21:07                                                           ` Omar Sandoval
2021-07-07 17:59                                                             ` Omar Sandoval
2021-07-19 15:44                                                               ` Josef Bacik
2021-06-24  6:41                                             ` Christoph Hellwig
2021-06-24  7:50                                               ` Omar Sandoval
2021-06-18 22:14                 ` Al Viro
2021-06-17 23:51 ` [PATCH RESEND x3 v9 2/9] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 3/9] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 4/9] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 5/9] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 6/9] btrfs: support different disk extent size for delalloc Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 7/9] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 8/9] btrfs: implement RWF_ENCODED reads Omar Sandoval
2021-06-17 23:51 ` [PATCH RESEND x3 v9 9/9] btrfs: implement RWF_ENCODED writes Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).