linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH man-pages] Document encoded I/O
  2019-10-15 18:42 [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Omar Sandoval
@ 2019-10-15 18:42 ` Omar Sandoval
  2019-10-20 23:05   ` [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Dave Chinner
  2019-10-21  6:18   ` [PATCH man-pages] Document encoded I/O Amir Goldstein
  2019-10-15 18:42 ` [RFC PATCH v2 1/5] fs: add O_ENCODED open flag Omar Sandoval
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-15 18:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, Jann Horn, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

This adds a new page, rwf_encoded(7), providing an overview of encoded
I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
reference it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 man2/fcntl.2       |  10 +-
 man2/open.2        |  13 ++
 man2/readv.2       |  46 +++++++
 man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 365 insertions(+), 1 deletion(-)
 create mode 100644 man7/rwf_encoded.7

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index fce4f4c2b..76fe9cc6f 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -222,8 +222,9 @@ On Linux, this command can change only the
 .BR O_ASYNC ,
 .BR O_DIRECT ,
 .BR O_NOATIME ,
+.BR O_NONBLOCK ,
 and
-.B O_NONBLOCK
+.B O_ENCODED
 flags.
 It is not possible to change the
 .BR O_DSYNC
@@ -1803,6 +1804,13 @@ Attempted to clear the
 flag on a file that has the append-only attribute set.
 .TP
 .B EPERM
+Attempted to set the
+.B O_ENCODED
+flag and the calling process did not have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
+.B EPERM
 .I cmd
 was
 .BR F_ADD_SEALS ,
diff --git a/man2/open.2 b/man2/open.2
index b0f485b41..cdd3c549c 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -421,6 +421,14 @@ was followed by a call to
 .BR fdatasync (2)).
 .IR "See NOTES below" .
 .TP
+.B O_ENCODED
+Open the file with encoded I/O permissions;
+see
+.BR rwf_encoded (7).
+The caller must have the
+.B CAP_SYS_ADMIN
+capabilty.
+.TP
 .B O_EXCL
 Ensure that this call creates the file:
 if this flag is specified in conjunction with
@@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller was not privileged.
 The operation was prevented by a file seal; see
 .BR fcntl (2).
 .TP
+.B EPERM
+The
+.B O_ENCODED
+flag was specified, but the caller was not privileged.
+.TP
 .B EROFS
 .I pathname
 refers to a file on a read-only filesystem and write access was
diff --git a/man2/readv.2 b/man2/readv.2
index af27aa63e..aa60b980a 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -265,6 +265,11 @@ the data is always appended to the end of the file.
 However, if the
 .I offset
 argument is \-1, the current file offset is updated.
+.TP
+.BR RWF_ENCODED " (since Linux 5.6)"
+Read or write encoded (e.g., compressed) data.
+See
+.BR rwf_encoded (7).
 .SH RETURN VALUE
 On success,
 .BR readv (),
@@ -284,6 +289,13 @@ than requested (see
 and
 .BR write (2)).
 .PP
+If
+.B
+RWF_ENCODED
+was specified in
+.IR flags ,
+then the return value is the number of encoded bytes.
+.PP
 On error, \-1 is returned, and \fIerrno\fP is set appropriately.
 .SH ERRORS
 The errors are as given for
@@ -314,6 +326,40 @@ is less than zero or greater than the permitted maximum.
 .TP
 .B EOPNOTSUPP
 An unknown flag is specified in \fIflags\fP.
+.TP
+.B EOPNOTSUPP
+.B RWF_ENCODED
+is specified in
+.I flags
+and the filesystem does not implement encoded I/O.
+.TP
+.B EPERM
+.B RWF_ENCODED
+is specified in
+.I flags
+and the file was not opened with the
+.B O_ENCODED
+flag.
+.PP
+.BR preadv2 ()
+can fail for the following reasons:
+.TP
+.B EFBIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and buffers in
+.I iov
+were not big enough to return the encoded data.
+.PP
+.BR pwritev2 ()
+can fail for the following reasons:
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the alignment and/or size requirements are not met.
 .SH VERSIONS
 .BR preadv ()
 and
diff --git a/man7/rwf_encoded.7 b/man7/rwf_encoded.7
new file mode 100644
index 000000000..90f5292e2
--- /dev/null
+++ b/man7/rwf_encoded.7
@@ -0,0 +1,297 @@
+.\" Copyright (c) 2019 by Omar Sandoval <osandov@fb.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.\"
+.TH RWF_ENCODED  7 2019-10-14 "Linux" "Linux Programmer's Manual"
+.SH NAME
+rwf_encoded \- overview of encoded I/O
+.SH DESCRIPTION
+Several filesystems (e.g., Btrfs) support transparent encoding
+(e.g., compression, encryption) of data on disk:
+written data is encoded by the kernel before it is written to disk,
+and read data is decoded before being returned to the user.
+In some cases, it is useful to skip this encoding step.
+For example, the user may want to read the compressed contents of a file
+or write pre-compressed data directly to a file.
+This is referred to as "encoded I/O".
+.SS Encoded I/O API
+Encoded I/O is specified with the
+.B RWF_ENCODED
+flag to
+.BR preadv2 (2)
+and
+.BR pwritev2 (2).
+If
+.B RWF_ENCODED
+is specified, then
+.I iov[0].iov_base
+points to an
+.I
+encoded_iov
+structure, defined in
+.I <linux/fs.h>
+as:
+.PP
+.in +4n
+.EX
+struct encoded_iov {
+    __u64 len;
+    __u64 unencoded_len;
+    __u64 unencoded_offset;
+    __u32 compression;
+    __u32 encryption;
+
+};
+.EE
+.in
+.PP
+.I iov[0].iov_len
+must be set to
+.IR "sizeof(struct\ encoded_iov)" .
+The remaining buffers contain the encoded data.
+.PP
+.I compression
+and
+.I encryption
+are the encoding fields.
+.I compression
+is one of
+.B ENCODED_IOV_COMPRESSION_NONE
+(zero),
+.BR ENCODED_IOV_COMPRESSION_ZLIB ,
+.BR ENCODED_IOV_COMPRESSION_LZO ,
+or
+.BR ENCODED_IOV_COMPRESSION_ZSTD .
+.I encryption
+is currently always
+.B ENCODED_IOV_ENCRYPTION_NONE
+(zero).
+.PP
+.I unencoded_len
+is the length of the unencoded (i.e., decrypted and decompressed) data.
+.I unencoded_offset
+is the offset into the unencoded data where the data in the file begins
+(strictly less than
+.IR unencoded_len ).
+.I len
+is the length of the data in the file.
+.PP
+In most cases,
+.I len
+is equal to
+.I unencoded_len
+and
+.I unencoded_offset
+is zero.
+However, it may be necessary to refer to a subset of the unencoded data,
+usually because a read occurred in the middle of an encoded extent,
+because part of an extent was overwritten or deallocated in some
+way (e.g., with
+.BR write (2),
+.BR truncate (2),
+or
+.BR fallocate (2))
+or because part of an extent was added to the file (e.g., with
+.BR ioctl_ficlonerange (2)
+or
+.BR ioctl_fideduperange (2)).
+For example, if
+.I len
+is 300,
+.I unencoded_len
+is 1000,
+and
+.I unencoded_offset
+is 600,
+then the encoded data is 1000 bytes long when decoded,
+of which only the 300 bytes starting at offset 600 are used;
+the first 600 and last 100 bytes should be ignored.
+.PP
+Additionally,
+.I len
+may be greater than
+.I unencoded_len
+-
+.IR unencoded_offset;
+in this case, the data in the file is longer than the unencoded data,
+and the difference is zero-filled.
+.PP
+If the unencoded data is actually longer than
+.IR unencoded_len ,
+then it is truncated;
+if it is shorter, then it is extended with zeroes.
+.PP
+For
+.BR pwritev2 (),
+the metadata should be specified in
+.IR iov[0] ,
+and the encoded data should be passed in the remaining buffers.
+This returns the number of encoded bytes written (that is, the sum of
+.I iov[n].iov_len
+for 1 <=
+.I n
+<
+.IR iovcnt ;
+partial writes will not occur).
+If the
+.I offset
+argument to
+.BR pwritev2 ()
+is -1, then the file offset is incremented by
+.IR len .
+At least one encoding field must be non-zero.
+Note that the encoded data is not validated when it is written;
+if it is not valid (e.g., it cannot be decompressed),
+then a subsequent read may result in an error.
+.PP
+For
+.BR preadv2 (),
+the metadata is returned in
+.IR iov[0] ,
+and the encoded data is returned in the remaining buffers.
+This returns the number of encoded bytes read.
+Note that a return value of zero does not indicate end of file;
+one should refer to
+.I len
+(for example, a hole in the file has a non-zero
+.I len
+but a zero return value).
+A
+.I len
+of zero indicates end of file.
+If the
+.I offset
+argument to
+.BR preadv2 ()
+is -1, then the file offset is incremented by
+.IR len .
+If the provided buffers are not large enough to return an entire encoded
+extent,
+then this returns -1 and sets
+.I errno
+to
+.BR EFBIG .
+This will only return one encoded extent per call.
+This can also read data which is not encoded;
+all encoding fields will be zero in that case.
+.SS Security
+Encoded I/O creates the potential for some security issues:
+.IP * 3
+Encoded writes allow writing arbitrary data which the kernel will decode on
+a subsequent read. Decompression algorithms are complex and may have bugs
+which can be exploited by malicous data.
+.IP *
+Encoded reads may return data which is not logically present in the file
+(see the discussion of
+.I len
+vs.
+.I unencoded_len
+above).
+It may not be intended for this data to be readable.
+.PP
+Therefore, encoded I/O requires privilege.
+Namely, the
+.B RWF_ENCODED
+flag may only be used when the file was opened with the
+.B O_ENCODED
+flag to
+.BR open (2),
+which requires the
+.B CAP_SYS_ADMIN
+capability.
+.B O_ENCODED
+may be set and cleared with
+.BR fcntl (2).
+Note that it is not cleared on
+.BR fork (2)
+or
+.BR execve (2);
+one may wish to use
+.B O_CLOEXEC
+with
+.BR O_ENCODED .
+.SS Filesystem support
+Encoded I/O is supported on the following filesystems:
+.TP
+Btrfs (since Linux 5.6)
+.IP
+Btrfs supports encoded reads and writes of compressed data.
+The data is encoded as follows:
+.RS
+.IP * 3
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_ZLIB ,
+then the encoded data is a single zlib stream.
+.IP *
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_LZO ,
+then the encoded data is compressed page by page with LZO1X
+and wrapped in the format described in the Linux kernel source file
+.IR fs/btrfs/lzo.c .
+.IP *
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_ZSTD ,
+then the encoded data is a single zstd frame compressed with the
+.I windowLog
+compression parameter set to no more than 17.
+.RE
+.IP
+Additionally, there are some restrictions on
+.BR pwritev2 ():
+.RS
+.IP * 3
+.I offset
+(or the current file offset if
+.I offset
+is -1) must be aligned to the sector size of the filesystem.
+.IP *
+.I len
+must be aligned to the sector size of the filesystem
+unless the data ends at or beyond the current end of the file.
+.IP *
+.I unencoded_len
+and the length of the encoded data must each be no more than 128 KiB.
+This limit may increase in the future.
+.IP *
+The length of the encoded data rounded up to the nearest sector must be
+less than
+.I unencoded_len
+rounded up to the nearest sector.
+.IP *
+Referring to a subset of unencoded data is not yet implemented; i.e.,
+.I len
+must equal
+.I unencoded_len
+and
+.I unencoded_offset
+must be zero.
+.IP *
+Writing compressed inline extents is not yet implemented.
+.RE
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data
  2019-10-15 18:42 ` [PATCH man-pages] Document encoded I/O Omar Sandoval
@ 2019-10-15 18:42 Omar Sandoval
  2019-10-15 18:42 ` [PATCH man-pages] Document encoded I/O Omar Sandoval
                   ` (5 more replies)
  1 sibling, 6 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-15 18:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, Jann Horn, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Hello,

This series adds an API for reading compressed data on a filesystem
without decompressing it as well as support for writing compressed data
directly to the filesystem. It is based on my previous series which
added a Btrfs-specific ioctl [1], but it is now an extension to
preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
man page patch describing the API in detail. Test cases and examples
programs are available [3].

The use case that I have in mind is Btrfs send/receive: currently, when
sending data from one compressed filesystem to another, the sending side
decompresses the data and the receiving side recompresses it before
writing it out. This is wasteful and can be avoided if we can just send
and write compressed extents. The send part will be implemented in a
separate series, as this API can stand alone.

Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
Patch 4 implements encoded reads for Btrfs, and patch 5 implements
encoded writes.

Changes from v1 [4]:

- Encoded reads are now also implemented.
- The encoded_iov structure now includes metadata for referring to a
  subset of decoded data. This is required to handle certain cases where
  a compressed extent is truncated, hole punched, or otherwise sliced up
  and Btrfs chooses to reflect this in metadata instead of decompressing
  the whole extent and rewriting the pieces. We call these "bookend
  extents" in Btrfs, but any filesystem supporting transparent encoding
  is likely to have a similar concept.
- The behavior of the filesystem when the decompressed data is longer
  than or shorter than expected is more strictly defined (truncate and
  zero extend, respectively).
- As pointed out by Jann Horn [5], the capability check done at
  read/write time in v1 was incorrect; v2 adds an explicit open flag
  (which can be changed with fcntl()). As this can be trivially combined
  with O_CLOEXEC, I did not add any sort of automatic clearing on exec.

I wanted to get the ball rolling on reviewing the interface, so the
Btrfs implementation has a couple of smaller todos:

- Encoded reads do not yet implement repair for disk/checksum failures.
- Encoded writes do not yet support inline extents or bookend extents.

This is based on v5.4-rc3

Please share any comments on the API or implementation. Thanks!

1: https://lore.kernel.org/linux-fsdevel/cover.1567623877.git.osandov@fb.com/
2: https://lore.kernel.org/linux-fsdevel/20190906212710.GI7452@vader/
3: https://github.com/osandov/xfstests/tree/rwf-encoded
4: https://lore.kernel.org/linux-btrfs/cover.1568875700.git.osandov@fb.com/
5: https://lore.kernel.org/linux-btrfs/CAG48ez2GKv15Uj6Wzv0sG5v2bXyrSaCtRTw5Ok_ovja_CiO_fQ@mail.gmail.com/

Omar Sandoval (5):
  fs: add O_ENCODED open flag
  fs: add RWF_ENCODED for reading/writing compressed data
  btrfs: generalize btrfs_lookup_bio_sums_dio()
  btrfs: implement RWF_ENCODED reads
  btrfs: implement RWF_ENCODED writes

 fs/btrfs/compression.c           |   6 +-
 fs/btrfs/compression.h           |   5 +-
 fs/btrfs/ctree.h                 |   9 +-
 fs/btrfs/file-item.c             |  18 +-
 fs/btrfs/file.c                  |  52 ++-
 fs/btrfs/inode.c                 | 663 ++++++++++++++++++++++++++++++-
 fs/fcntl.c                       |  10 +-
 fs/namei.c                       |   4 +
 include/linux/fcntl.h            |   2 +-
 include/linux/fs.h               |  14 +
 include/uapi/asm-generic/fcntl.h |   4 +
 include/uapi/linux/fs.h          |  26 +-
 mm/filemap.c                     |  82 +++-
 13 files changed, 851 insertions(+), 44 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 1/5] fs: add O_ENCODED open flag
  2019-10-15 18:42 [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Omar Sandoval
  2019-10-15 18:42 ` [PATCH man-pages] Document encoded I/O Omar Sandoval
@ 2019-10-15 18:42 ` Omar Sandoval
  2019-10-19  4:50   ` Aleksa Sarai
  2019-10-15 18:42 ` [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-15 18:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, Jann Horn, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

The upcoming RWF_ENCODED operation introduces some security concerns:

1. Compressed writes will pass arbitrary data to decompression
   algorithms in the kernel.
2. Compressed reads can leak truncated/hole punched data.

Therefore, we need to require privilege for RWF_ENCODED. It's not
possible to do the permissions checks at the time of the read or write
because, e.g., io_uring submits IO from a worker thread. So, add an open
flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
fcntl(). The flag is not cleared in any way on fork or exec; it should
probably be used with O_CLOEXEC in most cases.

Note that the usual issue that unknown open flags are ignored doesn't
really matter for O_ENCODED; if the kernel doesn't support O_ENCODED,
then it doesn't support RWF_ENCODED, either.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/fcntl.c                       | 10 ++++++++--
 fs/namei.c                       |  4 ++++
 include/linux/fcntl.h            |  2 +-
 include/uapi/asm-generic/fcntl.h |  4 ++++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..45ebc6df078e 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -30,7 +30,8 @@
 #include <asm/siginfo.h>
 #include <linux/uaccess.h>
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
+		    O_ENCODED)
 
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
@@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 
+	/* O_ENCODED can only be set by superuser */
+	if ((arg & O_ENCODED) && !(filp->f_flags & O_ENCODED) &&
+	    !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	/* required for strict SunOS emulation */
 	if (O_NONBLOCK != O_NDELAY)
 	       if (arg & O_NDELAY)
@@ -1031,7 +1037,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..ae86b125888a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2978,6 +2978,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
 	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
 		return -EPERM;
 
+	/* O_ENCODED can only be set by superuser */
+	if ((flag & O_ENCODED) && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	return 0;
 }
 
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..5fac02479639 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ENCODED)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..8c5cbd5942e3 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,10 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+#ifndef O_ENCODED
+#define O_ENCODED	040000000
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-15 18:42 [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Omar Sandoval
  2019-10-15 18:42 ` [PATCH man-pages] Document encoded I/O Omar Sandoval
  2019-10-15 18:42 ` [RFC PATCH v2 1/5] fs: add O_ENCODED open flag Omar Sandoval
@ 2019-10-15 18:42 ` Omar Sandoval
  2019-10-16  9:50   ` Nikolay Borisov
                     ` (2 more replies)
  2019-10-15 18:42 ` [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio() Omar Sandoval
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-15 18:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, Jann Horn, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Btrfs supports transparent compression: data written by the user can be
compressed when written to disk and decompressed when read back.
However, we'd like to add an interface to write pre-compressed data
directly to the filesystem, and the matching interface to read
compressed data without decompressing it. This adds support for
so-called "encoded I/O" via preadv2() and pwritev2().

A new RWF_ENCODED flags indicates that a read or write is "encoded". If
this flag is set, iov[0].iov_base points to a struct encoded_iov which
is used for metadata: namely, the compression algorithm, unencoded
(i.e., decompressed) length, and what subrange of the unencoded data
should be used (needed for truncated or hole-punched extents and when
reading in the middle of an extent). For reads, the filesystem returns
this information; for writes, the caller provides it to the filesystem.
iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
used to extend the interface in the future. The remaining iovecs contain
the encoded extent.

Filesystems must indicate that they support encoded writes by setting
FMODE_ENCODED_IO in ->file_open().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/fs.h      | 14 +++++++
 include/uapi/linux/fs.h | 26 ++++++++++++-
 mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
 3 files changed, 108 insertions(+), 14 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0d909d35763..54681f21e05e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
+/* File supports encoded IO */
+#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
@@ -314,6 +317,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_ENCODED		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
+struct encoded_iov;
+extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
+extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
+extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
+				struct iov_iter *);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
 				loff_t *count, unsigned int remap_flags);
@@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 			return -EOPNOTSUPP;
 		ki->ki_flags |= IOCB_NOWAIT;
 	}
+	if (flags & RWF_ENCODED) {
+		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
+			return -EOPNOTSUPP;
+		ki->ki_flags |= IOCB_ENCODED;
+	}
 	if (flags & RWF_HIPRI)
 		ki->ki_flags |= IOCB_HIPRI;
 	if (flags & RWF_DSYNC)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..ed92a8a257cb 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -284,6 +284,27 @@ struct fsxattr {
 
 typedef int __bitwise __kernel_rwf_t;
 
+enum {
+	ENCODED_IOV_COMPRESSION_NONE,
+	ENCODED_IOV_COMPRESSION_ZLIB,
+	ENCODED_IOV_COMPRESSION_LZO,
+	ENCODED_IOV_COMPRESSION_ZSTD,
+	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
+};
+
+enum {
+	ENCODED_IOV_ENCRYPTION_NONE,
+	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
+};
+
+struct encoded_iov {
+	__u64 len;
+	__u64 unencoded_len;
+	__u64 unencoded_offset;
+	__u32 compression;
+	__u32 encryption;
+};
+
 /* high priority request, poll if possible */
 #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
 
@@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* encoded (e.g., compressed or encrypted) IO */
+#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ENCODED)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index 1146fcfa3215..d2e6d9caf353 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
 	return 0;
 }
 
-/*
- * Performs necessary checks before doing a write
- *
- * Can adjust writing position or amount of bytes to write.
- * Returns appropriate error code that caller should return or
- * zero in case that write should be allowed.
- */
-inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
-	loff_t count;
-	int ret;
 
 	if (IS_SWAPFILE(inode))
 		return -ETXTBSY;
 
-	if (!iov_iter_count(from))
+	if (!*count)
 		return 0;
 
 	/* FIXME: this is for backwards compatibility with 2.4 */
@@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
 		return -EINVAL;
 
-	count = iov_iter_count(from);
-	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
+	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
+}
+
+/*
+ * Performs necessary checks before doing a write
+ *
+ * Can adjust writing position or amount of bytes to write.
+ * Returns a negative errno or the new number of bytes to write.
+ */
+inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	loff_t count = iov_iter_count(from);
+	int ret;
+
+	ret = generic_write_checks_common(iocb, &count);
 	if (ret)
 		return ret;
 
@@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+int generic_encoded_write_checks(struct kiocb *iocb,
+				 struct encoded_iov *encoded)
+{
+	loff_t count = encoded->unencoded_len;
+	int ret;
+
+	ret = generic_write_checks_common(iocb, &count);
+	if (ret)
+		return ret;
+
+	if (count != encoded->unencoded_len) {
+		/*
+		 * The write got truncated by generic_write_checks_common(). We
+		 * can't do a partial encoded write.
+		 */
+		return -EFBIG;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(generic_encoded_write_checks);
+
+ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
+{
+	if (!(iocb->ki_filp->f_flags & O_ENCODED))
+		return -EPERM;
+	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
+		return -EINVAL;
+	return iov_iter_count(iter) - sizeof(struct encoded_iov);
+}
+EXPORT_SYMBOL(check_encoded_read);
+
+int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
+			 struct iov_iter *from)
+{
+	if (!(iocb->ki_filp->f_flags & O_ENCODED))
+		return -EPERM;
+	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
+		return -EINVAL;
+	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
+		return -EFAULT;
+	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
+	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
+		return -EINVAL;
+	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
+	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
+		return -EINVAL;
+	if (encoded->unencoded_offset >= encoded->unencoded_len)
+		return -EINVAL;
+	return 0;
+}
+EXPORT_SYMBOL(import_encoded_write);
+
 /*
  * Performs necessary checks before doing a clone.
  *
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio()
  2019-10-15 18:42 [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (2 preceding siblings ...)
  2019-10-15 18:42 ` [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
@ 2019-10-15 18:42 ` Omar Sandoval
  2019-10-16  9:22   ` Nikolay Borisov
  2019-10-15 18:42 ` [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads Omar Sandoval
  2019-10-15 18:42 ` [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes Omar Sandoval
  5 siblings, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-15 18:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, Jann Horn, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

This isn't actually dio-specific; it just looks up the csums starting at
the given offset instead of using the page index. Rename it to
btrfs_lookup_bio_sums_at_offset() and add the dst parameter. We might
even want to expose __btrfs_lookup_bio_sums() as the public API instead
of having two trivial wrappers, but I'll leave that for another day.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h     |  5 +++--
 fs/btrfs/file-item.c | 18 +++++++++---------
 fs/btrfs/inode.c     |  4 ++--
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 19d669d12ca1..71552b2ca340 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2791,8 +2791,9 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		    struct btrfs_fs_info *fs_info, u64 bytenr, u64 len);
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 				   u8 *dst);
-blk_status_t btrfs_lookup_bio_sums_dio(struct inode *inode, struct bio *bio,
-			      u64 logical_offset);
+blk_status_t btrfs_lookup_bio_sums_at_offset(struct inode *inode,
+					     struct bio *bio, u64 offset,
+					     u8 *dst);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 1a599f50837b..d98f06fc2978 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -148,8 +148,9 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
-				   u64 logical_offset, u8 *dst, int dio)
+static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode,
+					    struct bio *bio,
+					    bool at_offset, u64 offset, u8 *dst)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio_vec bvec;
@@ -159,7 +160,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_path *path;
 	u8 *csum;
-	u64 offset = 0;
 	u64 item_start_offset = 0;
 	u64 item_last_offset = 0;
 	u64 disk_bytenr;
@@ -205,15 +205,13 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio
 	}
 
 	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
-	if (dio)
-		offset = logical_offset;
 
 	bio_for_each_segment(bvec, bio, iter) {
 		page_bytes_left = bvec.bv_len;
 		if (count)
 			goto next;
 
-		if (!dio)
+		if (!at_offset)
 			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
 		count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
 					       csum, nblocks);
@@ -291,12 +289,14 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 				   u8 *dst)
 {
-	return __btrfs_lookup_bio_sums(inode, bio, 0, dst, 0);
+	return __btrfs_lookup_bio_sums(inode, bio, false, 0, dst);
 }
 
-blk_status_t btrfs_lookup_bio_sums_dio(struct inode *inode, struct bio *bio, u64 offset)
+blk_status_t btrfs_lookup_bio_sums_at_offset(struct inode *inode,
+					     struct bio *bio, u64 offset,
+					     u8 *dst)
 {
-	return __btrfs_lookup_bio_sums(inode, bio, offset, NULL, 1);
+	return __btrfs_lookup_bio_sums(inode, bio, true, offset, dst);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0f2754eaa05b..8bce46122ef7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8319,8 +8319,8 @@ static inline blk_status_t btrfs_lookup_and_bind_dio_csum(struct inode *inode,
 	 * contention.
 	 */
 	if (dip->logical_offset == file_offset) {
-		ret = btrfs_lookup_bio_sums_dio(inode, dip->orig_bio,
-						file_offset);
+		ret = btrfs_lookup_bio_sums_at_offset(inode, dip->orig_bio,
+						      file_offset, NULL);
 		if (ret)
 			return ret;
 	}
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads
  2019-10-15 18:42 [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (3 preceding siblings ...)
  2019-10-15 18:42 ` [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio() Omar Sandoval
@ 2019-10-15 18:42 ` Omar Sandoval
  2019-10-16 11:10   ` Nikolay Borisov
  2019-10-15 18:42 ` [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes Omar Sandoval
  5 siblings, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-15 18:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, Jann Horn, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

There are 4 main cases:

1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we indicate the size of the extent
   starting from the read position; we don't need to copy zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
   from disk.
4. Regular, compressed extents: we read the entire compressed extent
   from disk and indicate what subset of the decompressed extent is in
   the file.

This initial implementation simplifies a few things that can be improved
in the future:

- We hold the inode lock during the operation.
- Cases 1, 3, and 4 allocate temporary memory to read into before
  copying out to userspace.
- Cases 3 and 4 do not implement repair yet.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h |   2 +
 fs/btrfs/file.c  |  12 +-
 fs/btrfs/inode.c | 462 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 475 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 71552b2ca340..3b2aa1c7218c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2906,6 +2906,8 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
+
 extern const struct dentry_operations btrfs_dentry_operations;
 
 /* ioctl.c */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 27e5b269e729..51740cee39fc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -390,6 +390,16 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EOPNOTSUPP;
+		return btrfs_encoded_read(iocb, iter);
+	}
+	return generic_file_read_iter(iocb, iter);
+}
+
 /* simple helper to fault in pages and copy.  This should go away
  * and be replaced with calls into generic code.
  */
@@ -3457,7 +3467,7 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8bce46122ef7..174d0738d2c9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10593,6 +10593,468 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
 	}
 }
 
+static int encoded_iov_compression_from_btrfs(struct encoded_iov *encoded,
+					      unsigned int compress_type)
+{
+	switch (compress_type) {
+	case BTRFS_COMPRESS_NONE:
+		encoded->compression = ENCODED_IOV_COMPRESSION_NONE;
+		break;
+	case BTRFS_COMPRESS_ZLIB:
+		encoded->compression = ENCODED_IOV_COMPRESSION_ZLIB;
+		break;
+	case BTRFS_COMPRESS_LZO:
+		encoded->compression = ENCODED_IOV_COMPRESSION_LZO;
+		break;
+	case BTRFS_COMPRESS_ZSTD:
+		encoded->compression = ENCODED_IOV_COMPRESSION_ZSTD;
+		break;
+	default:
+		return -EIO;
+	}
+	return 0;
+}
+
+static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
+					 struct iov_iter *iter, u64 start,
+					 u64 lockend,
+					 struct extent_state **cached_state,
+					 u64 extent_start, size_t count,
+					 struct encoded_iov *encoded,
+					 bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *item;
+	u64 ram_bytes;
+	unsigned long ptr;
+	void *tmp;
+	ssize_t ret;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
+				       btrfs_ino(BTRFS_I(inode)), extent_start,
+				       0);
+	if (ret) {
+		if (ret > 0) {
+			/* The extent item disappeared? */
+			ret = -EIO;
+		}
+		goto out;
+	}
+	leaf = path->nodes[0];
+	item = btrfs_item_ptr(leaf, path->slots[0],
+			      struct btrfs_file_extent_item);
+
+	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
+	ptr = btrfs_file_extent_inline_start(item);
+
+	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
+			iocb->ki_pos);
+	ret = encoded_iov_compression_from_btrfs(encoded,
+				 btrfs_file_extent_compression(leaf, item));
+	if (ret)
+		goto out;
+	if (encoded->compression) {
+		size_t inline_size;
+
+		inline_size = btrfs_file_extent_inline_item_len(leaf,
+						btrfs_item_nr(path->slots[0]));
+		if (inline_size > count) {
+			ret = -EFBIG;
+			goto out;
+		}
+		count = inline_size;
+		encoded->unencoded_len = ram_bytes;
+		encoded->unencoded_offset = iocb->ki_pos - extent_start;
+	} else {
+		encoded->len = encoded->unencoded_len = count =
+			min_t(u64, count, encoded->len);
+		ptr += iocb->ki_pos - extent_start;
+	}
+
+	tmp = kmalloc(count, GFP_NOFS);
+	if (!tmp) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	read_extent_buffer(leaf, tmp, ptr, count);
+	btrfs_free_path(path);
+	path = NULL;
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock(inode);
+	*unlocked = true;
+	if (copy_to_iter(encoded, sizeof(*encoded), iter) == sizeof(*encoded) &&
+	    copy_to_iter(tmp, count, iter) == count)
+		ret = count;
+	else
+		ret = -EFAULT;
+	kfree(tmp);
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+struct btrfs_encoded_read_private {
+	struct inode *inode;
+	wait_queue_head_t wait;
+	atomic_t pending;
+	bool uptodate;
+	bool skip_csum;
+};
+
+static bool btrfs_encoded_read_check_csums(struct btrfs_io_bio *io_bio)
+{
+	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
+	struct inode *inode = priv->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u32 sectorsize = fs_info->sectorsize;
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+	u64 offset = 0;
+
+	if (priv->skip_csum)
+		return true;
+	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
+		unsigned int i, nr_sectors, pgoff;
+
+		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
+		pgoff = bvec->bv_offset;
+		for (i = 0; i < nr_sectors; i++) {
+			int csum_pos;
+
+			csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset);
+			if (__readpage_endio_check(inode, io_bio, csum_pos,
+						   bvec->bv_page, pgoff,
+						   io_bio->logical + offset,
+						   sectorsize))
+				return false;
+			offset += sectorsize;
+			pgoff += sectorsize;
+		}
+	}
+	return true;
+}
+
+static void btrfs_encoded_read_endio(struct bio *bio)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+
+	if (bio->bi_status || !btrfs_encoded_read_check_csums(io_bio))
+		priv->uptodate = false;
+	if (!atomic_dec_return(&priv->pending))
+		wake_up(&priv->wait);
+	btrfs_io_bio_free_csum(io_bio);
+	bio_put(bio);
+}
+
+static bool btrfs_submit_encoded_read(struct bio *bio)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct inode *inode = priv->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	blk_status_t status;
+
+	atomic_inc(&priv->pending);
+
+	if (!priv->skip_csum) {
+		status = btrfs_lookup_bio_sums_at_offset(inode, bio,
+							 btrfs_io_bio(bio)->logical,
+							 NULL);
+		if (status)
+			goto out;
+	}
+
+	status = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
+	if (status)
+		goto out;
+
+	status = btrfs_map_bio(fs_info, bio, 0, 0);
+out:
+	if (status) {
+		bio->bi_status = status;
+		bio_endio(bio);
+		return false;
+	}
+	return true;
+}
+
+static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
+					  struct iov_iter *iter,
+					  u64 start, u64 lockend,
+					  struct extent_state **cached_state,
+					  struct block_device *bdev,
+					  u64 offset, u64 disk_io_size,
+					  size_t count,
+					  const struct encoded_iov *encoded,
+					  bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_encoded_read_private priv = {
+		.inode = inode,
+		.wait = __WAIT_QUEUE_HEAD_INITIALIZER(priv.wait),
+		.pending = ATOMIC_INIT(1),
+		.uptodate = true,
+		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
+	};
+	struct page **pages;
+	unsigned long nr_pages, i;
+	struct bio *bio = NULL;
+	u64 cur;
+	size_t page_offset;
+	ssize_t ret;
+
+	nr_pages = (disk_io_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	i = 0;
+	cur = 0;
+	while (cur < disk_io_size) {
+		size_t bytes = min_t(u64, disk_io_size - cur,
+				     PAGE_SIZE);
+
+		if (!bio) {
+			bio = btrfs_bio_alloc(offset + cur);
+			bio_set_dev(bio, bdev);
+			bio->bi_end_io = btrfs_encoded_read_endio;
+			bio->bi_private = &priv;
+			bio->bi_opf = REQ_OP_READ;
+			btrfs_io_bio(bio)->logical = start + cur;
+		}
+
+		if (bio_add_page(bio, pages[i], bytes, 0) < bytes) {
+			bool success;
+
+			success = btrfs_submit_encoded_read(bio);
+			bio = NULL;
+			if (!success)
+				break;
+			continue;
+		}
+		i++;
+		cur += bytes;
+	}
+
+	if (bio)
+		btrfs_submit_encoded_read(bio);
+	if (atomic_dec_return(&priv.pending))
+		wait_event(priv.wait, !atomic_read(&priv.pending));
+	if (!priv.uptodate) {
+		ret = -EIO;
+		goto out;
+	}
+
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock(inode);
+	*unlocked = true;
+
+	if (copy_to_iter(encoded, sizeof(*encoded), iter) != sizeof(*encoded)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (encoded->compression) {
+		i = 0;
+		page_offset = 0;
+	} else {
+		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
+		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
+	}
+	cur = 0;
+	while (cur < count) {
+		size_t bytes = min_t(size_t, count - cur,
+				     PAGE_SIZE - page_offset);
+
+		if (copy_page_to_iter(pages[i], page_offset, bytes,
+				      iter) != bytes) {
+			ret = -EFAULT;
+			goto out;
+		}
+		i++;
+		cur += bytes;
+		page_offset = 0;
+	}
+	ret = count;
+out:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			put_page(pages[i]);
+	}
+	kfree(pages);
+	return ret;
+}
+
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	ssize_t ret;
+	size_t count;
+	struct block_device *em_bdev;
+	u64 start, lockend, offset, disk_io_size;
+	struct extent_state *cached_state = NULL;
+	struct extent_map *em;
+	struct encoded_iov encoded = {};
+	bool unlocked = false;
+
+	ret = check_encoded_read(iocb, iter);
+	if (ret < 0)
+		return ret;
+	if (ret == 0) {
+empty:
+		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
+		    sizeof(encoded))
+			return 0;
+		else
+			return -EFAULT;
+	}
+	count = ret;
+
+	file_accessed(iocb->ki_filp);
+
+	inode_lock(inode);
+
+	if (iocb->ki_pos >= inode->i_size) {
+		inode_unlock(inode);
+		goto empty;
+	}
+	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
+	/*
+	 * We don't know how long the extent containing iocb->ki_pos is, but if
+	 * it's compressed we know that it won't be longer than this.
+	 */
+	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start,
+					       lockend - start + 1);
+		if (ret)
+			goto out_unlock_inode;
+		lock_extent_bits(io_tree, start, lockend, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     lockend - start + 1);
+		if (!ordered)
+			break;
+		btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		cond_resched();
+	}
+
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
+			      lockend - start + 1, 0);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_unlock_extent;
+	}
+	em_bdev = em->bdev;
+
+	if (em->block_start == EXTENT_MAP_INLINE) {
+		u64 extent_start = em->start;
+
+		/*
+		 * For inline extents we get everything we need out of the
+		 * extent item.
+		 */
+		free_extent_map(em);
+		em = NULL;
+		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
+						&cached_state, extent_start,
+						count, &encoded, &unlocked);
+		goto out;
+	}
+
+	/*
+	 * We only want to return up to EOF even if the extent extends beyond
+	 * that.
+	 */
+	encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) -
+		       iocb->ki_pos);
+	if (em->block_start == EXTENT_MAP_HOLE ||
+	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
+		offset = EXTENT_MAP_HOLE;
+	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+		offset = em->block_start;
+		/*
+		 * Bail if the buffer isn't large enough to return the whole
+		 * compressed extent.
+		 */
+		if (em->block_len > count) {
+			ret = -EFBIG;
+			goto out_em;
+		}
+		disk_io_size = count = em->block_len;
+		encoded.unencoded_len = em->ram_bytes;
+		encoded.unencoded_offset = iocb->ki_pos - em->orig_start;
+		ret = encoded_iov_compression_from_btrfs(&encoded,
+							 em->compress_type);
+		if (ret)
+			goto out_em;
+	} else {
+		offset = em->block_start + (start - em->start);
+		if (encoded.len > count)
+			encoded.len = count;
+		/*
+		 * Don't read beyond what we locked. This also limits the page
+		 * allocations that we'll do.
+		 */
+		disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start;
+		encoded.len = encoded.unencoded_len = count =
+			start + disk_io_size - iocb->ki_pos;
+		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
+	}
+	free_extent_map(em);
+	em = NULL;
+
+	if (offset == EXTENT_MAP_HOLE) {
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		inode_unlock(inode);
+		unlocked = true;
+		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
+		    sizeof(encoded))
+			ret = 0;
+		else
+			ret = -EFAULT;
+	} else {
+		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
+						 &cached_state, em_bdev, offset,
+						 disk_io_size, count, &encoded,
+						 &unlocked);
+	}
+
+out:
+	if (ret >= 0)
+		iocb->ki_pos += encoded.len;
+out_em:
+	free_extent_map(em);
+out_unlock_extent:
+	if (!unlocked)
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+out_unlock_inode:
+	if (!unlocked)
+		inode_unlock(inode);
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
  2019-10-15 18:42 [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Omar Sandoval
                   ` (4 preceding siblings ...)
  2019-10-15 18:42 ` [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads Omar Sandoval
@ 2019-10-15 18:42 ` Omar Sandoval
  2019-10-16 10:44   ` Nikolay Borisov
  5 siblings, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-15 18:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, Jann Horn, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback.

Now that read and write are implemented, this also sets the
FMODE_ENCODED_IO flag in btrfs_file_open().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c |   6 +-
 fs/btrfs/compression.h |   5 +-
 fs/btrfs/ctree.h       |   2 +
 fs/btrfs/file.c        |  40 +++++++--
 fs/btrfs/inode.c       | 197 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 237 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b05b361e2062..6632dd8d2e4d 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
 			bio->bi_status == BLK_STS_OK);
 	cb->compressed_pages[0]->mapping = NULL;
 
-	end_compressed_writeback(inode, cb);
+	if (cb->writeback)
+		end_compressed_writeback(inode, cb);
 	/* note, our inode could be gone now */
 
 	/*
@@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				 unsigned long compressed_len,
 				 struct page **compressed_pages,
 				 unsigned long nr_pages,
-				 unsigned int write_flags)
+				 unsigned int write_flags, bool writeback)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio *bio = NULL;
@@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 	cb->mirror_num = 0;
 	cb->compressed_pages = compressed_pages;
 	cb->compressed_len = compressed_len;
+	cb->writeback = writeback;
 	cb->orig_bio = NULL;
 	cb->nr_pages = nr_pages;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 4cb8be9ff88b..d4176384ec15 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -47,6 +47,9 @@ struct compressed_bio {
 	/* the compression algorithm for this bio */
 	int compress_type;
 
+	/* Whether this is a write for writeback. */
+	bool writeback;
+
 	/* number of compressed pages in the array */
 	unsigned long nr_pages;
 
@@ -93,7 +96,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				  unsigned long compressed_len,
 				  struct page **compressed_pages,
 				  unsigned long nr_pages,
-				  unsigned int write_flags);
+				  unsigned int write_flags, bool writeback);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3b2aa1c7218c..9e1719e82cc8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2907,6 +2907,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			    struct encoded_iov *encoded);
 
 extern const struct dentry_operations btrfs_dentry_operations;
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 51740cee39fc..8de6ac9b4b9c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1893,8 +1893,7 @@ static void update_time_for_write(struct inode *inode)
 		inode_inc_iversion(inode);
 }
 
-static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
@@ -1904,14 +1903,22 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	u64 end_pos;
 	ssize_t num_written = 0;
 	const bool sync = iocb->ki_flags & IOCB_DSYNC;
+	struct encoded_iov encoded;
 	ssize_t err;
 	loff_t pos;
 	size_t count;
 	loff_t oldsize;
 	int clean_page = 0;
 
-	if (!(iocb->ki_flags & IOCB_DIRECT) &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		err = import_encoded_write(iocb, &encoded, from);
+		if (err)
+			return err;
+	}
+
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+	    (!(iocb->ki_flags & IOCB_DIRECT) ||
+	     (iocb->ki_flags & IOCB_ENCODED)))
 		return -EOPNOTSUPP;
 
 	if (!inode_trylock(inode)) {
@@ -1920,14 +1927,27 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 		inode_lock(inode);
 	}
 
-	err = generic_write_checks(iocb, from);
-	if (err <= 0) {
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		err = generic_encoded_write_checks(iocb, &encoded);
+		if (err) {
+			inode_unlock(inode);
+			return err;
+		}
+		count = encoded.len;
+	} else {
+		err = generic_write_checks(iocb, from);
+		if (err < 0) {
+			inode_unlock(inode);
+			return err;
+		}
+		count = iov_iter_count(from);
+	}
+	if (count == 0) {
 		inode_unlock(inode);
 		return err;
 	}
 
 	pos = iocb->ki_pos;
-	count = iov_iter_count(from);
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		/*
 		 * We will allocate space in case nodatacow is not set,
@@ -1986,7 +2006,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT) {
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		num_written = btrfs_encoded_write(iocb, from, &encoded);
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
@@ -3461,7 +3483,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_ENCODED_IO;
 	return generic_file_open(inode, filp);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 174d0738d2c9..bcc5a2bed22b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				    ins.objectid,
 				    ins.offset, async_extent->pages,
 				    async_extent->nr_pages,
-				    async_chunk->write_flags)) {
+				    async_chunk->write_flags, true)) {
 			struct page *p = async_extent->pages[0];
 			const u64 start = async_extent->start;
 			const u64 end = start + async_extent->ram_size - 1;
@@ -11055,6 +11055,201 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
+ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			    struct encoded_iov *encoded)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_changeset *data_reserved = NULL;
+	struct extent_state *cached_state = NULL;
+	int compression;
+	size_t orig_count;
+	u64 disk_num_bytes, num_bytes;
+	u64 start, end;
+	unsigned long nr_pages, i;
+	struct page **pages;
+	struct btrfs_key ins;
+	struct extent_map *em;
+	ssize_t ret;
+
+	switch (encoded->compression) {
+	case ENCODED_IOV_COMPRESSION_ZLIB:
+		compression = BTRFS_COMPRESS_ZLIB;
+		break;
+	case ENCODED_IOV_COMPRESSION_LZO:
+		compression = BTRFS_COMPRESS_LZO;
+		break;
+	case ENCODED_IOV_COMPRESSION_ZSTD:
+		compression = BTRFS_COMPRESS_ZSTD;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	disk_num_bytes = orig_count = iov_iter_count(from);
+
+	/* For now, it's too hard to support bookend extents. */
+	if (encoded->unencoded_len != encoded->len ||
+	    encoded->unencoded_offset != 0)
+		return -EINVAL;
+
+	/* The extent size must be sane. */
+	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
+	    disk_num_bytes > BTRFS_MAX_COMPRESSED || disk_num_bytes == 0)
+		return -EINVAL;
+
+	/*
+	 * The compressed data on disk must be sector-aligned. For convenience,
+	 * we extend it with zeroes if it isn't.
+	 */
+	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
+
+	/*
+	 * The extent in the file must also be sector-aligned. However, we allow
+	 * a write which ends at or extends i_size to have an unaligned length;
+	 * we round up the extent size and set i_size to the given length.
+	 */
+	start = iocb->ki_pos;
+	if (!IS_ALIGNED(start, fs_info->sectorsize))
+		return -EINVAL;
+	if (start + encoded->len >= inode->i_size) {
+		num_bytes = ALIGN(encoded->len, fs_info->sectorsize);
+	} else {
+		num_bytes = encoded->len;
+		if (!IS_ALIGNED(num_bytes, fs_info->sectorsize))
+			return -EINVAL;
+	}
+
+	/*
+	 * It's valid to have compressed data which is larger than or the same
+	 * size as the decompressed data. However, for buffered I/O, we fall
+	 * back to writing the decompressed data if compression didn't shrink
+	 * it. So, for now, let's not allow creating such extents.
+	 *
+	 * Note that for now this also implicitly prevents writing data that
+	 * would fit in an inline extent.
+	 */
+	if (disk_num_bytes >= num_bytes)
+		return -EINVAL;
+
+	end = start + num_bytes - 1;
+
+	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_USER);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
+		char *kaddr;
+
+		pages[i] = alloc_page(GFP_HIGHUSER);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out_pages;
+		}
+		kaddr = kmap(pages[i]);
+		if (copy_from_iter(kaddr, bytes, from) != bytes) {
+			kunmap(pages[i]);
+			ret = -EFAULT;
+			goto out_pages;
+		}
+		if (bytes < PAGE_SIZE)
+			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
+		kunmap(pages[i]);
+	}
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
+		if (ret)
+			goto out_pages;
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+						    start >> PAGE_SHIFT,
+						    end >> PAGE_SHIFT);
+		if (ret)
+			goto out_pages;
+		lock_extent_bits(io_tree, start, end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     end - start + 1);
+		if (!ordered &&
+		    !filemap_range_has_page(inode->i_mapping, start, end))
+			break;
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, end, &cached_state);
+		cond_resched();
+	}
+
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
+					   num_bytes);
+	if (ret)
+		goto out_unlock;
+
+	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
+				   disk_num_bytes, 0, 0, &ins, 1, 1);
+	if (ret)
+		goto out_delalloc_release;
+
+	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
+			  ins.offset, ins.offset, num_bytes, compression,
+			  BTRFS_ORDERED_COMPRESSED);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_free_reserve;
+	}
+	free_extent_map(em);
+
+	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
+						num_bytes, ins.offset,
+						BTRFS_ORDERED_COMPRESSED,
+						compression);
+	if (ret) {
+		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
+		goto out_free_reserve;
+	}
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+
+	if (start + encoded->len > inode->i_size)
+		i_size_write(inode, start + encoded->len);
+
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
+
+	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
+					  ins.offset, pages, nr_pages, 0,
+					  false)) {
+		struct page *page = pages[0];
+
+		page->mapping = inode->i_mapping;
+		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
+		page->mapping = NULL;
+		ret = -EIO;
+		goto out_pages;
+	}
+	iocb->ki_pos += encoded->len;
+	return orig_count;
+
+out_free_reserve:
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+out_delalloc_release:
+	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
+				     true);
+out_unlock:
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+out_pages:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			put_page(pages[i]);
+	}
+	kvfree(pages);
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio()
  2019-10-15 18:42 ` [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio() Omar Sandoval
@ 2019-10-16  9:22   ` Nikolay Borisov
  2019-10-18 22:19     ` Omar Sandoval
  0 siblings, 1 reply; 42+ messages in thread
From: Nikolay Borisov @ 2019-10-16  9:22 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs, linux-fsdevel
  Cc: kernel-team, Dave Chinner, Jann Horn, linux-api



On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This isn't actually dio-specific; it just looks up the csums starting at
> the given offset instead of using the page index. Rename it to
> btrfs_lookup_bio_sums_at_offset() and add the dst parameter. We might
> even want to expose __btrfs_lookup_bio_sums() as the public API instead
> of having two trivial wrappers, but I'll leave that for another day.

IMO exposing btrfs_lookup_bio_sums and adding proper kernel doc for its
parameters is the correct way forward. Consider doing this if the
general direction of this patchset is accepted and before sending the
final revision.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-15 18:42 ` [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
@ 2019-10-16  9:50   ` Nikolay Borisov
  2019-10-18 22:19     ` Omar Sandoval
  2019-10-19  5:01   ` Aleksa Sarai
  2019-10-21 18:28   ` Darrick J. Wong
  2 siblings, 1 reply; 42+ messages in thread
From: Nikolay Borisov @ 2019-10-16  9:50 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs, linux-fsdevel
  Cc: kernel-team, Dave Chinner, Jann Horn, linux-api



On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
> 
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data

In the future when encryption is also supported. What should be the
mechanism to enforce ordering of encoding operations i.e. first compress
then encrypt => uncoded_len should be the resulting size after the
encrypt operation. To me (not being a cryptographer) this seems the
sensible thing to do since compression will be effective that way.
However, what if , for whatever reasons, a different filesystem wants to
support this interface but chooses to do it the other around -> encrypt,
then compress?

> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future. The remaining iovecs contain
> the encoded extent.
> 
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  include/linux/fs.h      | 14 +++++++
>  include/uapi/linux/fs.h | 26 ++++++++++++-
>  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
>  3 files changed, 108 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..54681f21e05e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
>  
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> +
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
>   * that indicates that they should check the contents of the iovec are
> @@ -314,6 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC		(1 << 5)
>  #define IOCB_WRITE		(1 << 6)
>  #define IOCB_NOWAIT		(1 << 7)
> +#define IOCB_ENCODED		(1 << 8)
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
>  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
>  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> +				struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>  				struct file *file_out, loff_t pos_out,
>  				loff_t *count, unsigned int remap_flags);
> @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>  			return -EOPNOTSUPP;
>  		ki->ki_flags |= IOCB_NOWAIT;
>  	}
> +	if (flags & RWF_ENCODED) {
> +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> +			return -EOPNOTSUPP;
> +		ki->ki_flags |= IOCB_ENCODED;
> +	}
>  	if (flags & RWF_HIPRI)
>  		ki->ki_flags |= IOCB_HIPRI;
>  	if (flags & RWF_DSYNC)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..ed92a8a257cb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -284,6 +284,27 @@ struct fsxattr {
>  
>  typedef int __bitwise __kernel_rwf_t;
>  
> +enum {
> +	ENCODED_IOV_COMPRESSION_NONE,
> +	ENCODED_IOV_COMPRESSION_ZLIB,
> +	ENCODED_IOV_COMPRESSION_LZO,
> +	ENCODED_IOV_COMPRESSION_ZSTD,
> +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> +};
> +
> +enum {
> +	ENCODED_IOV_ENCRYPTION_NONE,
> +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> +};
> +
> +struct encoded_iov {
> +	__u64 len;
> +	__u64 unencoded_len;
> +	__u64 unencoded_offset;
> +	__u32 compression;
> +	__u32 encryption;
> +};
> +
>  /* high priority request, poll if possible */
>  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
>  
> @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
>  
> +/* encoded (e.g., compressed or encrypted) IO */

nit: s/or/and\/or/ or both are exclusive?

> +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -			 RWF_APPEND)
> +			 RWF_APPEND | RWF_ENCODED)
>  
>  #endif /* _UAPI_LINUX_FS_H */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1146fcfa3215..d2e6d9caf353 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
>  	return 0;
>  }
>  
> -/*
> - * Performs necessary checks before doing a write
> - *
> - * Can adjust writing position or amount of bytes to write.
> - * Returns appropriate error code that caller should return or
> - * zero in case that write should be allowed.
> - */
> -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
> -	loff_t count;
> -	int ret;
>  
>  	if (IS_SWAPFILE(inode))
>  		return -ETXTBSY;
>  
> -	if (!iov_iter_count(from))
> +	if (!*count)
>  		return 0;
>  
>  	/* FIXME: this is for backwards compatibility with 2.4 */
> @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
>  		return -EINVAL;
>  
> -	count = iov_iter_count(from);
> -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> +}
> +
> +/*
> + * Performs necessary checks before doing a write
> + *
> + * Can adjust writing position or amount of bytes to write.
> + * Returns a negative errno or the new number of bytes to write.
> + */
> +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	loff_t count = iov_iter_count(from);
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
>  	if (ret)
>  		return ret;
>  
> @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  }
>  EXPORT_SYMBOL(generic_write_checks);
>  
> +int generic_encoded_write_checks(struct kiocb *iocb,
> +				 struct encoded_iov *encoded)
> +{
> +	loff_t count = encoded->unencoded_len;
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);

That's a bit confusing. You will only ever write encoded len bytes, yet
you check the unencoded len. Presumably that's to ensure the data can be
read back successfully? Still it feels a bit odd. IMO this warrants a
comment.

When you use this function in patch 5 all the checks are performed
against unencoded_len yet you do :

count = encoded.len;

> +	if (ret)
> +		return ret;
> +
> +	if (count != encoded->unencoded_len) {
> +		/*
> +		 * The write got truncated by generic_write_checks_common(). We
> +		 * can't do a partial encoded write.
> +		 */
> +		return -EFBIG;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(generic_encoded_write_checks);
> +
> +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> +		return -EINVAL;
> +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> +}
> +EXPORT_SYMBOL(check_encoded_read);
> +
> +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,

nit: This might be just me but 'import' doesn't sound right, how about
parse_encoded_write ?


> +			 struct iov_iter *from)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> +		return -EINVAL;
> +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> +		return -EFAULT;
> +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> +		return -EINVAL;
> +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> +		return -EINVAL;
> +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> +		return -EINVAL;
> +	return 0;
> +}
> +EXPORT_SYMBOL(import_encoded_write);
> +
>  /*
>   * Performs necessary checks before doing a clone.
>   *
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
  2019-10-15 18:42 ` [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes Omar Sandoval
@ 2019-10-16 10:44   ` Nikolay Borisov
  2019-10-18 22:55     ` Omar Sandoval
  0 siblings, 1 reply; 42+ messages in thread
From: Nikolay Borisov @ 2019-10-16 10:44 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs, linux-fsdevel
  Cc: kernel-team, Dave Chinner, Jann Horn, linux-api



On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The implementation resembles direct I/O: we have to flush any ordered
> extents, invalidate the page cache, and do the io tree/delalloc/extent
> map/ordered extent dance. From there, we can reuse the compression code
> with a minor modification to distinguish the write from writeback.
> 
> Now that read and write are implemented, this also sets the
> FMODE_ENCODED_IO flag in btrfs_file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/compression.c |   6 +-
>  fs/btrfs/compression.h |   5 +-
>  fs/btrfs/ctree.h       |   2 +
>  fs/btrfs/file.c        |  40 +++++++--
>  fs/btrfs/inode.c       | 197 ++++++++++++++++++++++++++++++++++++++++-
>  5 files changed, 237 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index b05b361e2062..6632dd8d2e4d 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
>  			bio->bi_status == BLK_STS_OK);
>  	cb->compressed_pages[0]->mapping = NULL;
>  
> -	end_compressed_writeback(inode, cb);
> +	if (cb->writeback)
> +		end_compressed_writeback(inode, cb);
>  	/* note, our inode could be gone now */
>  
>  	/*
> @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  				 unsigned long compressed_len,
>  				 struct page **compressed_pages,
>  				 unsigned long nr_pages,
> -				 unsigned int write_flags)
> +				 unsigned int write_flags, bool writeback)

I don't see this function being called with true in this patch set,
meaning it essentially eliminates end_compressed_writeback call in
end_compressed_bio_write? Am I missing anything?

>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct bio *bio = NULL;
> @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  	cb->mirror_num = 0;
>  	cb->compressed_pages = compressed_pages;
>  	cb->compressed_len = compressed_len;
> +	cb->writeback = writeback;
>  	cb->orig_bio = NULL;
>  	cb->nr_pages = nr_pages;
>  
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 4cb8be9ff88b..d4176384ec15 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -47,6 +47,9 @@ struct compressed_bio {
>  	/* the compression algorithm for this bio */
>  	int compress_type;
>  
> +	/* Whether this is a write for writeback. */
> +	bool writeback;
> +
>  	/* number of compressed pages in the array */
>  	unsigned long nr_pages;
>  
> @@ -93,7 +96,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  				  unsigned long compressed_len,
>  				  struct page **compressed_pages,
>  				  unsigned long nr_pages,
> -				  unsigned int write_flags);
> +				  unsigned int write_flags, bool writeback);
>  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  				 int mirror_num, unsigned long bio_flags);
>  
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 3b2aa1c7218c..9e1719e82cc8 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2907,6 +2907,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>  					  u64 end, int uptodate);
>  ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> +			    struct encoded_iov *encoded);
>  
>  extern const struct dentry_operations btrfs_dentry_operations;
>  
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 51740cee39fc..8de6ac9b4b9c 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1893,8 +1893,7 @@ static void update_time_for_write(struct inode *inode)
>  		inode_inc_iversion(inode);
>  }
>  
> -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> -				    struct iov_iter *from)
> +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file_inode(file);
> @@ -1904,14 +1903,22 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	u64 end_pos;
>  	ssize_t num_written = 0;
>  	const bool sync = iocb->ki_flags & IOCB_DSYNC;
> +	struct encoded_iov encoded;
>  	ssize_t err;
>  	loff_t pos;
>  	size_t count;
>  	loff_t oldsize;
>  	int clean_page = 0;
>  
> -	if (!(iocb->ki_flags & IOCB_DIRECT) &&
> -	    (iocb->ki_flags & IOCB_NOWAIT))
> +	if (iocb->ki_flags & IOCB_ENCODED) {
> +		err = import_encoded_write(iocb, &encoded, from);
> +		if (err)
> +			return err;
> +	}
> +
> +	if ((iocb->ki_flags & IOCB_NOWAIT) &&
> +	    (!(iocb->ki_flags & IOCB_DIRECT) ||
> +	     (iocb->ki_flags & IOCB_ENCODED)))
>  		return -EOPNOTSUPP;
>  
>  	if (!inode_trylock(inode)) {
> @@ -1920,14 +1927,27 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  		inode_lock(inode);
>  	}
>  
> -	err = generic_write_checks(iocb, from);
> -	if (err <= 0) {
> +	if (iocb->ki_flags & IOCB_ENCODED) {
> +		err = generic_encoded_write_checks(iocb, &encoded);
> +		if (err) {
> +			inode_unlock(inode);
> +			return err;
> +		}
> +		count = encoded.len;
> +	} else {
> +		err = generic_write_checks(iocb, from);
> +		if (err < 0) {
> +			inode_unlock(inode);
> +			return err;
> +		}
> +		count = iov_iter_count(from);
> +	}
> +	if (count == 0) {
>  		inode_unlock(inode);
>  		return err;
>  	}
>  
>  	pos = iocb->ki_pos;
> -	count = iov_iter_count(from);
>  	if (iocb->ki_flags & IOCB_NOWAIT) {
>  		/*
>  		 * We will allocate space in case nodatacow is not set,
> @@ -1986,7 +2006,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	if (sync)
>  		atomic_inc(&BTRFS_I(inode)->sync_writers);
>  
> -	if (iocb->ki_flags & IOCB_DIRECT) {
> +	if (iocb->ki_flags & IOCB_ENCODED) {
> +		num_written = btrfs_encoded_write(iocb, from, &encoded);
> +	} else if (iocb->ki_flags & IOCB_DIRECT) {
>  		num_written = __btrfs_direct_write(iocb, from);
>  	} else {
>  		num_written = btrfs_buffered_write(iocb, from);
> @@ -3461,7 +3483,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
>  
>  static int btrfs_file_open(struct inode *inode, struct file *filp)
>  {
> -	filp->f_mode |= FMODE_NOWAIT;
> +	filp->f_mode |= FMODE_NOWAIT | FMODE_ENCODED_IO;
>  	return generic_file_open(inode, filp);
>  }
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 174d0738d2c9..bcc5a2bed22b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>  				    ins.objectid,
>  				    ins.offset, async_extent->pages,
>  				    async_extent->nr_pages,
> -				    async_chunk->write_flags)) {
> +				    async_chunk->write_flags, true)) {
>  			struct page *p = async_extent->pages[0];
>  			const u64 start = async_extent->start;
>  			const u64 end = start + async_extent->ram_size - 1;
> @@ -11055,6 +11055,201 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
>  	return ret;
>  }
>  
> +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> +			    struct encoded_iov *encoded)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct extent_changeset *data_reserved = NULL;
> +	struct extent_state *cached_state = NULL;
> +	int compression;
> +	size_t orig_count;
> +	u64 disk_num_bytes, num_bytes;
> +	u64 start, end;
> +	unsigned long nr_pages, i;
> +	struct page **pages;
> +	struct btrfs_key ins;
> +	struct extent_map *em;
> +	ssize_t ret;
> +
> +	switch (encoded->compression) {
> +	case ENCODED_IOV_COMPRESSION_ZLIB:
> +		compression = BTRFS_COMPRESS_ZLIB;
> +		break;
> +	case ENCODED_IOV_COMPRESSION_LZO:
> +		compression = BTRFS_COMPRESS_LZO;
> +		break;
> +	case ENCODED_IOV_COMPRESSION_ZSTD:
> +		compression = BTRFS_COMPRESS_ZSTD;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	disk_num_bytes = orig_count = iov_iter_count(from);
> +
> +	/* For now, it's too hard to support bookend extents. */
> +	if (encoded->unencoded_len != encoded->len ||
> +	    encoded->unencoded_offset != 0)
> +		return -EINVAL;
> +
> +	/* The extent size must be sane. */
> +	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
> +	    disk_num_bytes > BTRFS_MAX_COMPRESSED || disk_num_bytes == 0)
> +		return -EINVAL;
> +
> +	/*
> +	 * The compressed data on disk must be sector-aligned. For convenience,
> +	 * we extend it with zeroes if it isn't.
> +	 */
> +	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
> +
> +	/*
> +	 * The extent in the file must also be sector-aligned. However, we allow
> +	 * a write which ends at or extends i_size to have an unaligned length;
> +	 * we round up the extent size and set i_size to the given length.
> +	 */
> +	start = iocb->ki_pos;
> +	if (!IS_ALIGNED(start, fs_info->sectorsize))
> +		return -EINVAL;
> +	if (start + encoded->len >= inode->i_size) {
> +		num_bytes = ALIGN(encoded->len, fs_info->sectorsize);
> +	} else {
> +		num_bytes = encoded->len;
> +		if (!IS_ALIGNED(num_bytes, fs_info->sectorsize))
> +			return -EINVAL;
> +	}
> +
> +	/*
> +	 * It's valid to have compressed data which is larger than or the same
> +	 * size as the decompressed data. However, for buffered I/O, we fall
> +	 * back to writing the decompressed data if compression didn't shrink
> +	 * it. So, for now, let's not allow creating such extents.
> +	 *
> +	 * Note that for now this also implicitly prevents writing data that
> +	 * would fit in an inline extent.
> +	 */
> +	if (disk_num_bytes >= num_bytes)
> +		return -EINVAL;
> +
> +	end = start + num_bytes - 1;
> +
> +	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;

nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)

> +	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_USER);

This could be a simple GFP_KERNEL  allocation

> +	if (!pages)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_pages; i++) {
> +		size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
> +		char *kaddr;
> +
> +		pages[i] = alloc_page(GFP_HIGHUSER);

Why GFP_HIGHUSER? You are reading from userspace,  not writing to it. A
plain, NOFS allocation should suffice (of course using the newer
memalloc_nofs_save api)?


> +		if (!pages[i]) {
> +			ret = -ENOMEM;
> +			goto out_pages;
> +		}
> +		kaddr = kmap(pages[i]);
> +		if (copy_from_iter(kaddr, bytes, from) != bytes) {
> +			kunmap(pages[i]);
> +			ret = -EFAULT;
> +			goto out_pages;
> +		}
> +		if (bytes < PAGE_SIZE)
> +			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
> +		kunmap(pages[i]);
> +	}
> +
> +	for (;;) {
> +		struct btrfs_ordered_extent *ordered;
> +
> +		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
> +		if (ret)
> +			goto out_pages;
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +						    start >> PAGE_SHIFT,
> +						    end >> PAGE_SHIFT);
> +		if (ret)
> +			goto out_pages;
> +		lock_extent_bits(io_tree, start, end, &cached_state);
> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +						     end - start + 1);
> +		if (!ordered &&
> +		    !filemap_range_has_page(inode->i_mapping, start, end))
> +			break;
> +		if (ordered)
> +			btrfs_put_ordered_extent(ordered);
> +		unlock_extent_cached(io_tree, start, end, &cached_state);
> +		cond_resched();
> +	}
> +
> +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> +					   num_bytes);
> +	if (ret)
> +		goto out_unlock;
> +
> +	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> +				   disk_num_bytes, 0, 0, &ins, 1, 1);
> +	if (ret)
> +		goto out_delalloc_release;
> +
> +	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> +			  ins.offset, ins.offset, num_bytes, compression,
> +			  BTRFS_ORDERED_COMPRESSED);
> +	if (IS_ERR(em)) {
> +		ret = PTR_ERR(em);
> +		goto out_free_reserve;
> +	}
> +	free_extent_map(em);
> +
> +	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> +						num_bytes, ins.offset,
> +						BTRFS_ORDERED_COMPRESSED,
> +						compression);
> +	if (ret) {
> +		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> +		goto out_free_reserve;
> +	}
> +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +
> +	if (start + encoded->len > inode->i_size)
> +		i_size_write(inode, start + encoded->len);

Don't we want the inode size to be updated once data hits disk and
btrfs_finish_ordered_io is called?

> +
> +	unlock_extent_cached(io_tree, start, end, &cached_state);
> +
> +	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> +
> +	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> +					  ins.offset, pages, nr_pages, 0,
> +					  false)) {
> +		struct page *page = pages[0];
> +
> +		page->mapping = inode->i_mapping;
> +		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> +		page->mapping = NULL;
> +		ret = -EIO;
> +		goto out_pages;
> +	}
> +	iocb->ki_pos += encoded->len;
> +	return orig_count;
> +
> +out_free_reserve:
> +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> +out_delalloc_release:
> +	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> +				     true);
> +out_unlock:
> +	unlock_extent_cached(io_tree, start, end, &cached_state);
> +out_pages:
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pages[i])
> +			put_page(pages[i]);
> +	}
> +	kvfree(pages);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * Add an entry indicating a block group or device which is pinned by a
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads
  2019-10-15 18:42 ` [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads Omar Sandoval
@ 2019-10-16 11:10   ` Nikolay Borisov
  2019-10-18 22:23     ` Omar Sandoval
  0 siblings, 1 reply; 42+ messages in thread
From: Nikolay Borisov @ 2019-10-16 11:10 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs, linux-fsdevel
  Cc: kernel-team, Dave Chinner, Jann Horn, linux-api



On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There are 4 main cases:
> 
> 1. Inline extents: we copy the data straight out of the extent buffer.
> 2. Hole/preallocated extents: we indicate the size of the extent
>    starting from the read position; we don't need to copy zeroes.
> 3. Regular, uncompressed extents: we read the sectors we need directly
>    from disk.
> 4. Regular, compressed extents: we read the entire compressed extent
>    from disk and indicate what subset of the decompressed extent is in
>    the file.
> 
> This initial implementation simplifies a few things that can be improved
> in the future:
> 
> - We hold the inode lock during the operation.
> - Cases 1, 3, and 4 allocate temporary memory to read into before
>   copying out to userspace.
> - Cases 3 and 4 do not implement repair yet.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/ctree.h |   2 +
>  fs/btrfs/file.c  |  12 +-
>  fs/btrfs/inode.c | 462 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 475 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 71552b2ca340..3b2aa1c7218c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2906,6 +2906,8 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
>  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>  					  u64 end, int uptodate);
> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> +
>  extern const struct dentry_operations btrfs_dentry_operations;
>  
>  /* ioctl.c */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 27e5b269e729..51740cee39fc 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -390,6 +390,16 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
>  	return 0;
>  }
>  
> +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	if (iocb->ki_flags & IOCB_ENCODED) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EOPNOTSUPP;
> +		return btrfs_encoded_read(iocb, iter);
> +	}
> +	return generic_file_read_iter(iocb, iter);
> +}
> +
>  /* simple helper to fault in pages and copy.  This should go away
>   * and be replaced with calls into generic code.
>   */
> @@ -3457,7 +3467,7 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
>  
>  const struct file_operations btrfs_file_operations = {
>  	.llseek		= btrfs_file_llseek,
> -	.read_iter      = generic_file_read_iter,
> +	.read_iter      = btrfs_file_read_iter,
>  	.splice_read	= generic_file_splice_read,
>  	.write_iter	= btrfs_file_write_iter,
>  	.mmap		= btrfs_file_mmap,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8bce46122ef7..174d0738d2c9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10593,6 +10593,468 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
>  	}
>  }
>  
> +static int encoded_iov_compression_from_btrfs(struct encoded_iov *encoded,
> +					      unsigned int compress_type)
> +{
> +	switch (compress_type) {
> +	case BTRFS_COMPRESS_NONE:
> +		encoded->compression = ENCODED_IOV_COMPRESSION_NONE;
> +		break;
> +	case BTRFS_COMPRESS_ZLIB:
> +		encoded->compression = ENCODED_IOV_COMPRESSION_ZLIB;
> +		break;
> +	case BTRFS_COMPRESS_LZO:
> +		encoded->compression = ENCODED_IOV_COMPRESSION_LZO;
> +		break;
> +	case BTRFS_COMPRESS_ZSTD:
> +		encoded->compression = ENCODED_IOV_COMPRESSION_ZSTD;
> +		break;
> +	default:
> +		return -EIO;
> +	}
> +	return 0;
> +}
> +
> +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
> +					 struct iov_iter *iter, u64 start,
> +					 u64 lockend,
> +					 struct extent_state **cached_state,
> +					 u64 extent_start, size_t count,
> +					 struct encoded_iov *encoded,
> +					 bool *unlocked)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_path *path;
> +	struct extent_buffer *leaf;
> +	struct btrfs_file_extent_item *item;
> +	u64 ram_bytes;
> +	unsigned long ptr;
> +	void *tmp;
> +	ssize_t ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> +				       0);
> +	if (ret) {
> +		if (ret > 0) {
> +			/* The extent item disappeared? */
> +			ret = -EIO;
> +		}
> +		goto out;
> +	}
> +	leaf = path->nodes[0];
> +	item = btrfs_item_ptr(leaf, path->slots[0],
> +			      struct btrfs_file_extent_item);
> +
> +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> +	ptr = btrfs_file_extent_inline_start(item);
> +
> +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> +			iocb->ki_pos);
> +	ret = encoded_iov_compression_from_btrfs(encoded,
> +				 btrfs_file_extent_compression(leaf, item));
> +	if (ret)
> +		goto out;
> +	if (encoded->compression) {
> +		size_t inline_size;
> +
> +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> +						btrfs_item_nr(path->slots[0]));
> +		if (inline_size > count) {
> +			ret = -EFBIG;
> +			goto out;
> +		}
> +		count = inline_size;
> +		encoded->unencoded_len = ram_bytes;
> +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> +	} else {
> +		encoded->len = encoded->unencoded_len = count =
> +			min_t(u64, count, encoded->len);
> +		ptr += iocb->ki_pos - extent_start;
> +	}
> +
> +	tmp = kmalloc(count, GFP_NOFS);
> +	if (!tmp) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	read_extent_buffer(leaf, tmp, ptr, count);
> +	btrfs_free_path(path);
> +	path = NULL;
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock(inode);
> +	*unlocked = true;
> +	if (copy_to_iter(encoded, sizeof(*encoded), iter) == sizeof(*encoded) &&
> +	    copy_to_iter(tmp, count, iter) == count)
> +		ret = count;
> +	else
> +		ret = -EFAULT;
> +	kfree(tmp);
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +struct btrfs_encoded_read_private {
> +	struct inode *inode;
> +	wait_queue_head_t wait;
> +	atomic_t pending;
> +	bool uptodate;
> +	bool skip_csum;
> +};
> +
> +static bool btrfs_encoded_read_check_csums(struct btrfs_io_bio *io_bio)
> +{
> +	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
> +	struct inode *inode = priv->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	u32 sectorsize = fs_info->sectorsize;
> +	struct bio_vec *bvec;
> +	struct bvec_iter_all iter_all;
> +	u64 offset = 0;
> +
> +	if (priv->skip_csum)
> +		return true;
> +	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
> +		unsigned int i, nr_sectors, pgoff;
> +
> +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> +		pgoff = bvec->bv_offset;
> +		for (i = 0; i < nr_sectors; i++) {
> +			int csum_pos;
> +
> +			csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset);
> +			if (__readpage_endio_check(inode, io_bio, csum_pos,
> +						   bvec->bv_page, pgoff,
> +						   io_bio->logical + offset,
> +						   sectorsize))
> +				return false;
> +			offset += sectorsize;
> +			pgoff += sectorsize;
> +		}
> +	}
> +	return true;
> +}
> +
> +static void btrfs_encoded_read_endio(struct bio *bio)
> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +
> +	if (bio->bi_status || !btrfs_encoded_read_check_csums(io_bio))
> +		priv->uptodate = false;
> +	if (!atomic_dec_return(&priv->pending))
> +		wake_up(&priv->wait);
> +	btrfs_io_bio_free_csum(io_bio);
> +	bio_put(bio);
> +}
> +
> +static bool btrfs_submit_encoded_read(struct bio *bio)
> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct inode *inode = priv->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	blk_status_t status;
> +
> +	atomic_inc(&priv->pending);
> +
> +	if (!priv->skip_csum) {
> +		status = btrfs_lookup_bio_sums_at_offset(inode, bio,
> +							 btrfs_io_bio(bio)->logical,
> +							 NULL);
> +		if (status)
> +			goto out;
> +	}
> +
> +	status = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> +	if (status)
> +		goto out;
> +
> +	status = btrfs_map_bio(fs_info, bio, 0, 0);
> +out:
> +	if (status) {
> +		bio->bi_status = status;
> +		bio_endio(bio);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> +					  struct iov_iter *iter,
> +					  u64 start, u64 lockend,
> +					  struct extent_state **cached_state,
> +					  struct block_device *bdev,
> +					  u64 offset, u64 disk_io_size,
> +					  size_t count,
> +					  const struct encoded_iov *encoded,
> +					  bool *unlocked)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_encoded_read_private priv = {
> +		.inode = inode,
> +		.wait = __WAIT_QUEUE_HEAD_INITIALIZER(priv.wait),
> +		.pending = ATOMIC_INIT(1),
> +		.uptodate = true,
> +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> +	};
> +	struct page **pages;
> +	unsigned long nr_pages, i;
> +	struct bio *bio = NULL;
> +	u64 cur;
> +	size_t page_offset;
> +	ssize_t ret;
> +
> +	nr_pages = (disk_io_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> +	if (!pages)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_pages; i++) {
> +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> +		if (!pages[i]) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	}
> +
> +	i = 0;
> +	cur = 0;
> +	while (cur < disk_io_size) {
> +		size_t bytes = min_t(u64, disk_io_size - cur,
> +				     PAGE_SIZE);
> +
> +		if (!bio) {
> +			bio = btrfs_bio_alloc(offset + cur);
> +			bio_set_dev(bio, bdev);
> +			bio->bi_end_io = btrfs_encoded_read_endio;
> +			bio->bi_private = &priv;
> +			bio->bi_opf = REQ_OP_READ;
> +			btrfs_io_bio(bio)->logical = start + cur;
> +		}
> +
> +		if (bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> +			bool success;
> +
> +			success = btrfs_submit_encoded_read(bio);
> +			bio = NULL;
> +			if (!success)
> +				break;
> +			continue;
> +		}
> +		i++;
> +		cur += bytes;
> +	}
> +
> +	if (bio)
> +		btrfs_submit_encoded_read(bio);
> +	if (atomic_dec_return(&priv.pending))
> +		wait_event(priv.wait, !atomic_read(&priv.pending));
> +	if (!priv.uptodate) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock(inode);
> +	*unlocked = true;
> +
> +	if (copy_to_iter(encoded, sizeof(*encoded), iter) != sizeof(*encoded)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +	if (encoded->compression) {
> +		i = 0;
> +		page_offset = 0;
> +	} else {
> +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> +	}
> +	cur = 0;
> +	while (cur < count) {
> +		size_t bytes = min_t(size_t, count - cur,
> +				     PAGE_SIZE - page_offset);
> +
> +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> +				      iter) != bytes) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +		i++;
> +		cur += bytes;
> +		page_offset = 0;
> +	}
> +	ret = count;
> +out:
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pages[i])
> +			put_page(pages[i]);
> +	}
> +	kfree(pages);
> +	return ret;
> +}
> +
> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	ssize_t ret;
> +	size_t count;
> +	struct block_device *em_bdev;
> +	u64 start, lockend, offset, disk_io_size;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_map *em;
> +	struct encoded_iov encoded = {};
> +	bool unlocked = false;
> +
> +	ret = check_encoded_read(iocb, iter);
> +	if (ret < 0)
> +		return ret;
> +	if (ret == 0) {
> +empty:
> +		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
> +		    sizeof(encoded))
> +			return 0;
> +		else
> +			return -EFAULT;

nit: Just put the label at the end of the function since it's a simple
error handler.

> +	}
> +	count = ret;
> +
> +	file_accessed(iocb->ki_filp);
> +
> +	inode_lock(inode);
> +
> +	if (iocb->ki_pos >= inode->i_size) {
> +		inode_unlock(inode);
> +		goto empty;

That way you won't have to jump backwards here ...

> +	}
> +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> +	/*
> +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> +	 * it's compressed we know that it won't be longer than this.
> +	 */
> +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> +
> +	for (;;) {
> +		struct btrfs_ordered_extent *ordered;
> +
> +		ret = btrfs_wait_ordered_range(inode, start,
> +					       lockend - start + 1);
> +		if (ret)
> +			goto out_unlock_inode;
> +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +						     lockend - start + 1);
> +		if (!ordered)
> +			break;
> +		btrfs_put_ordered_extent(ordered);
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +		cond_resched();
> +	}
> +
> +	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
> +			      lockend - start + 1, 0);
> +	if (IS_ERR(em)) {
> +		ret = PTR_ERR(em);
> +		goto out_unlock_extent;
> +	}
> +	em_bdev = em->bdev;
> +
> +	if (em->block_start == EXTENT_MAP_INLINE) {
> +		u64 extent_start = em->start;
> +
> +		/*
> +		 * For inline extents we get everything we need out of the
> +		 * extent item.
> +		 */
> +		free_extent_map(em);
> +		em = NULL;
> +		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
> +						&cached_state, extent_start,
> +						count, &encoded, &unlocked);
> +		goto out;
> +	}
> +
> +	/*
> +	 * We only want to return up to EOF even if the extent extends beyond
> +	 * that.
> +	 */
> +	encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) -
> +		       iocb->ki_pos);
> +	if (em->block_start == EXTENT_MAP_HOLE ||
> +	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
> +		offset = EXTENT_MAP_HOLE;
> +	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> +		offset = em->block_start;
> +		/*
> +		 * Bail if the buffer isn't large enough to return the whole
> +		 * compressed extent.
> +		 */
> +		if (em->block_len > count) {
> +			ret = -EFBIG;
> +			goto out_em;
> +		}
> +		disk_io_size = count = em->block_len;
> +		encoded.unencoded_len = em->ram_bytes;
> +		encoded.unencoded_offset = iocb->ki_pos - em->orig_start;
> +		ret = encoded_iov_compression_from_btrfs(&encoded,
> +							 em->compress_type);
> +		if (ret)
> +			goto out_em;
> +	} else {
> +		offset = em->block_start + (start - em->start);
> +		if (encoded.len > count)
> +			encoded.len = count;
> +		/*
> +		 * Don't read beyond what we locked. This also limits the page
> +		 * allocations that we'll do.
> +		 */
> +		disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start;
> +		encoded.len = encoded.unencoded_len = count =
> +			start + disk_io_size - iocb->ki_pos;
> +		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
> +	}
> +	free_extent_map(em);
> +	em = NULL;
> +
> +	if (offset == EXTENT_MAP_HOLE) {
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +		inode_unlock(inode);
> +		unlocked = true;
> +		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
> +		    sizeof(encoded))
> +			ret = 0;
> +		else
> +			ret = -EFAULT;
> +	} else {
> +		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
> +						 &cached_state, em_bdev, offset,
> +						 disk_io_size, count, &encoded,
> +						 &unlocked);
> +	}
> +
> +out:
> +	if (ret >= 0)
> +		iocb->ki_pos += encoded.len;
> +out_em:
> +	free_extent_map(em);
> +out_unlock_extent:
> +	if (!unlocked)
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +out_unlock_inode:
> +	if (!unlocked)
> +		inode_unlock(inode);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * Add an entry indicating a block group or device which is pinned by a
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-16  9:50   ` Nikolay Borisov
@ 2019-10-18 22:19     ` Omar Sandoval
  0 siblings, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-18 22:19 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api

On Wed, Oct 16, 2019 at 12:50:48PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> 
> In the future when encryption is also supported. What should be the
> mechanism to enforce ordering of encoding operations i.e. first compress
> then encrypt => uncoded_len should be the resulting size after the
> encrypt operation. To me (not being a cryptographer) this seems the
> sensible thing to do since compression will be effective that way.
> However, what if , for whatever reasons, a different filesystem wants to
> support this interface but chooses to do it the other around -> encrypt,
> then compress?

Compress-then-encrypt is the only sane way to do it (because properly
encrypted data is indistinguishable from random data, which doesn't
compress very well). When we add encryption support, we can add a note
to the man page.

If someone _really_ wants encrypt-then-compress, they can add another
encoding field, compression_after_encryption.

> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  include/linux/fs.h      | 14 +++++++
> >  include/uapi/linux/fs.h | 26 ++++++++++++-
> >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC		(1 << 5)
> >  #define IOCB_WRITE		(1 << 6)
> >  #define IOCB_NOWAIT		(1 << 7)
> > +#define IOCB_ENCODED		(1 << 8)
> >  
> >  struct kiocb {
> >  	struct file		*ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +				struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >  				struct file *file_out, loff_t pos_out,
> >  				loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> >  			return -EOPNOTSUPP;
> >  		ki->ki_flags |= IOCB_NOWAIT;
> >  	}
> > +	if (flags & RWF_ENCODED) {
> > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +			return -EOPNOTSUPP;
> > +		ki->ki_flags |= IOCB_ENCODED;
> > +	}
> >  	if (flags & RWF_HIPRI)
> >  		ki->ki_flags |= IOCB_HIPRI;
> >  	if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +	ENCODED_IOV_COMPRESSION_NONE,
> > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > +	ENCODED_IOV_COMPRESSION_LZO,
> > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +	ENCODED_IOV_ENCRYPTION_NONE,
> > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> > +};
> > +
> >  /* high priority request, poll if possible */
> >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> >  
> > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> >  /* per-IO O_APPEND */
> >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> >  
> > +/* encoded (e.g., compressed or encrypted) IO */
> 
> nit: s/or/and\/or/ or both are exclusive?

Changed, thanks.

> > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > +
> >  /* mask of flags supported by the kernel */
> >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > -			 RWF_APPEND)
> > +			 RWF_APPEND | RWF_ENCODED)
> >  
> >  #endif /* _UAPI_LINUX_FS_H */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1146fcfa3215..d2e6d9caf353 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> >  	return 0;
> >  }
> >  
> > -/*
> > - * Performs necessary checks before doing a write
> > - *
> > - * Can adjust writing position or amount of bytes to write.
> > - * Returns appropriate error code that caller should return or
> > - * zero in case that write should be allowed.
> > - */
> > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file->f_mapping->host;
> > -	loff_t count;
> > -	int ret;
> >  
> >  	if (IS_SWAPFILE(inode))
> >  		return -ETXTBSY;
> >  
> > -	if (!iov_iter_count(from))
> > +	if (!*count)
> >  		return 0;
> >  
> >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> >  		return -EINVAL;
> >  
> > -	count = iov_iter_count(from);
> > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > +}
> > +
> > +/*
> > + * Performs necessary checks before doing a write
> > + *
> > + * Can adjust writing position or amount of bytes to write.
> > + * Returns a negative errno or the new number of bytes to write.
> > + */
> > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	loff_t count = iov_iter_count(from);
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  }
> >  EXPORT_SYMBOL(generic_write_checks);
> >  
> > +int generic_encoded_write_checks(struct kiocb *iocb,
> > +				 struct encoded_iov *encoded)
> > +{
> > +	loff_t count = encoded->unencoded_len;
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> 
> That's a bit confusing. You will only ever write encoded len bytes, yet
> you check the unencoded len. Presumably that's to ensure the data can be
> read back successfully? Still it feels a bit odd. IMO this warrants a
> comment.
> 
> When you use this function in patch 5 all the checks are performed
> against unencoded_len yet you do :
> 
> count = encoded.len;

Oops, this is supposed to check encoded->len. I forgot to update it when
I made the file length and unencoded length distinct fields. Good catch!

This needs to check the file length rather than the encoded length
because the checks in generic_write_check_limits() are concerned with
the file size (RLIMIT_FSIZE and s_maxbytes).

> > +	if (ret)
> > +		return ret;
> > +
> > +	if (count != encoded->unencoded_len) {
> > +		/*
> > +		 * The write got truncated by generic_write_checks_common(). We
> > +		 * can't do a partial encoded write.
> > +		 */
> > +		return -EFBIG;
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > +
> > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > +		return -EINVAL;
> > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > +}
> > +EXPORT_SYMBOL(check_encoded_read);
> > +
> > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> 
> nit: This might be just me but 'import' doesn't sound right, how about
> parse_encoded_write ?

The naming is borrowed from import_iovec(). IMO that's more descriptive
since we're not really parsing anything.

> > +			 struct iov_iter *from)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > +		return -EINVAL;
> > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > +		return -EFAULT;
> > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > +		return -EINVAL;
> > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > +		return -EINVAL;
> > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > +		return -EINVAL;
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(import_encoded_write);
> > +
> >  /*
> >   * Performs necessary checks before doing a clone.
> >   *
> > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio()
  2019-10-16  9:22   ` Nikolay Borisov
@ 2019-10-18 22:19     ` Omar Sandoval
  0 siblings, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-18 22:19 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api

On Wed, Oct 16, 2019 at 12:22:33PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > This isn't actually dio-specific; it just looks up the csums starting at
> > the given offset instead of using the page index. Rename it to
> > btrfs_lookup_bio_sums_at_offset() and add the dst parameter. We might
> > even want to expose __btrfs_lookup_bio_sums() as the public API instead
> > of having two trivial wrappers, but I'll leave that for another day.
> 
> IMO exposing btrfs_lookup_bio_sums and adding proper kernel doc for its
> parameters is the correct way forward. Consider doing this if the
> general direction of this patchset is accepted and before sending the
> final revision.

Ok, if I'm not the only one that thinks it's a good idea, I'll go ahead
with that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads
  2019-10-16 11:10   ` Nikolay Borisov
@ 2019-10-18 22:23     ` Omar Sandoval
  0 siblings, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-18 22:23 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api

On Wed, Oct 16, 2019 at 02:10:10PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > There are 4 main cases:
> > 
> > 1. Inline extents: we copy the data straight out of the extent buffer.
> > 2. Hole/preallocated extents: we indicate the size of the extent
> >    starting from the read position; we don't need to copy zeroes.
> > 3. Regular, uncompressed extents: we read the sectors we need directly
> >    from disk.
> > 4. Regular, compressed extents: we read the entire compressed extent
> >    from disk and indicate what subset of the decompressed extent is in
> >    the file.
> > 
> > This initial implementation simplifies a few things that can be improved
> > in the future:
> > 
> > - We hold the inode lock during the operation.
> > - Cases 1, 3, and 4 allocate temporary memory to read into before
> >   copying out to userspace.
> > - Cases 3 and 4 do not implement repair yet.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/btrfs/ctree.h |   2 +
> >  fs/btrfs/file.c  |  12 +-
> >  fs/btrfs/inode.c | 462 +++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 475 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 71552b2ca340..3b2aa1c7218c 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2906,6 +2906,8 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
> >  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> >  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> >  					  u64 end, int uptodate);
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> > +
> >  extern const struct dentry_operations btrfs_dentry_operations;
> >  
> >  /* ioctl.c */
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 27e5b269e729..51740cee39fc 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -390,6 +390,16 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
> >  	return 0;
> >  }
> >  
> > +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EOPNOTSUPP;
> > +		return btrfs_encoded_read(iocb, iter);
> > +	}
> > +	return generic_file_read_iter(iocb, iter);
> > +}
> > +
> >  /* simple helper to fault in pages and copy.  This should go away
> >   * and be replaced with calls into generic code.
> >   */
> > @@ -3457,7 +3467,7 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
> >  
> >  const struct file_operations btrfs_file_operations = {
> >  	.llseek		= btrfs_file_llseek,
> > -	.read_iter      = generic_file_read_iter,
> > +	.read_iter      = btrfs_file_read_iter,
> >  	.splice_read	= generic_file_splice_read,
> >  	.write_iter	= btrfs_file_write_iter,
> >  	.mmap		= btrfs_file_mmap,
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8bce46122ef7..174d0738d2c9 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -10593,6 +10593,468 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
> >  	}
> >  }
> >  
> > +static int encoded_iov_compression_from_btrfs(struct encoded_iov *encoded,
> > +					      unsigned int compress_type)
> > +{
> > +	switch (compress_type) {
> > +	case BTRFS_COMPRESS_NONE:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_NONE;
> > +		break;
> > +	case BTRFS_COMPRESS_ZLIB:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_ZLIB;
> > +		break;
> > +	case BTRFS_COMPRESS_LZO:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_LZO;
> > +		break;
> > +	case BTRFS_COMPRESS_ZSTD:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_ZSTD;
> > +		break;
> > +	default:
> > +		return -EIO;
> > +	}
> > +	return 0;
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
> > +					 struct iov_iter *iter, u64 start,
> > +					 u64 lockend,
> > +					 struct extent_state **cached_state,
> > +					 u64 extent_start, size_t count,
> > +					 struct encoded_iov *encoded,
> > +					 bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct btrfs_path *path;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_file_extent_item *item;
> > +	u64 ram_bytes;
> > +	unsigned long ptr;
> > +	void *tmp;
> > +	ssize_t ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> > +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> > +				       0);
> > +	if (ret) {
> > +		if (ret > 0) {
> > +			/* The extent item disappeared? */
> > +			ret = -EIO;
> > +		}
> > +		goto out;
> > +	}
> > +	leaf = path->nodes[0];
> > +	item = btrfs_item_ptr(leaf, path->slots[0],
> > +			      struct btrfs_file_extent_item);
> > +
> > +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> > +	ptr = btrfs_file_extent_inline_start(item);
> > +
> > +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> > +			iocb->ki_pos);
> > +	ret = encoded_iov_compression_from_btrfs(encoded,
> > +				 btrfs_file_extent_compression(leaf, item));
> > +	if (ret)
> > +		goto out;
> > +	if (encoded->compression) {
> > +		size_t inline_size;
> > +
> > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > +						btrfs_item_nr(path->slots[0]));
> > +		if (inline_size > count) {
> > +			ret = -EFBIG;
> > +			goto out;
> > +		}
> > +		count = inline_size;
> > +		encoded->unencoded_len = ram_bytes;
> > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > +	} else {
> > +		encoded->len = encoded->unencoded_len = count =
> > +			min_t(u64, count, encoded->len);
> > +		ptr += iocb->ki_pos - extent_start;
> > +	}
> > +
> > +	tmp = kmalloc(count, GFP_NOFS);
> > +	if (!tmp) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	read_extent_buffer(leaf, tmp, ptr, count);
> > +	btrfs_free_path(path);
> > +	path = NULL;
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock(inode);
> > +	*unlocked = true;
> > +	if (copy_to_iter(encoded, sizeof(*encoded), iter) == sizeof(*encoded) &&
> > +	    copy_to_iter(tmp, count, iter) == count)
> > +		ret = count;
> > +	else
> > +		ret = -EFAULT;
> > +	kfree(tmp);
> > +
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> > +struct btrfs_encoded_read_private {
> > +	struct inode *inode;
> > +	wait_queue_head_t wait;
> > +	atomic_t pending;
> > +	bool uptodate;
> > +	bool skip_csum;
> > +};
> > +
> > +static bool btrfs_encoded_read_check_csums(struct btrfs_io_bio *io_bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
> > +	struct inode *inode = priv->inode;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	u32 sectorsize = fs_info->sectorsize;
> > +	struct bio_vec *bvec;
> > +	struct bvec_iter_all iter_all;
> > +	u64 offset = 0;
> > +
> > +	if (priv->skip_csum)
> > +		return true;
> > +	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
> > +		unsigned int i, nr_sectors, pgoff;
> > +
> > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > +		pgoff = bvec->bv_offset;
> > +		for (i = 0; i < nr_sectors; i++) {
> > +			int csum_pos;
> > +
> > +			csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset);
> > +			if (__readpage_endio_check(inode, io_bio, csum_pos,
> > +						   bvec->bv_page, pgoff,
> > +						   io_bio->logical + offset,
> > +						   sectorsize))
> > +				return false;
> > +			offset += sectorsize;
> > +			pgoff += sectorsize;
> > +		}
> > +	}
> > +	return true;
> > +}
> > +
> > +static void btrfs_encoded_read_endio(struct bio *bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +
> > +	if (bio->bi_status || !btrfs_encoded_read_check_csums(io_bio))
> > +		priv->uptodate = false;
> > +	if (!atomic_dec_return(&priv->pending))
> > +		wake_up(&priv->wait);
> > +	btrfs_io_bio_free_csum(io_bio);
> > +	bio_put(bio);
> > +}
> > +
> > +static bool btrfs_submit_encoded_read(struct bio *bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct inode *inode = priv->inode;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	blk_status_t status;
> > +
> > +	atomic_inc(&priv->pending);
> > +
> > +	if (!priv->skip_csum) {
> > +		status = btrfs_lookup_bio_sums_at_offset(inode, bio,
> > +							 btrfs_io_bio(bio)->logical,
> > +							 NULL);
> > +		if (status)
> > +			goto out;
> > +	}
> > +
> > +	status = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > +	if (status)
> > +		goto out;
> > +
> > +	status = btrfs_map_bio(fs_info, bio, 0, 0);
> > +out:
> > +	if (status) {
> > +		bio->bi_status = status;
> > +		bio_endio(bio);
> > +		return false;
> > +	}
> > +	return true;
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > +					  struct iov_iter *iter,
> > +					  u64 start, u64 lockend,
> > +					  struct extent_state **cached_state,
> > +					  struct block_device *bdev,
> > +					  u64 offset, u64 disk_io_size,
> > +					  size_t count,
> > +					  const struct encoded_iov *encoded,
> > +					  bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct btrfs_encoded_read_private priv = {
> > +		.inode = inode,
> > +		.wait = __WAIT_QUEUE_HEAD_INITIALIZER(priv.wait),
> > +		.pending = ATOMIC_INIT(1),
> > +		.uptodate = true,
> > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > +	};
> > +	struct page **pages;
> > +	unsigned long nr_pages, i;
> > +	struct bio *bio = NULL;
> > +	u64 cur;
> > +	size_t page_offset;
> > +	ssize_t ret;
> > +
> > +	nr_pages = (disk_io_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	i = 0;
> > +	cur = 0;
> > +	while (cur < disk_io_size) {
> > +		size_t bytes = min_t(u64, disk_io_size - cur,
> > +				     PAGE_SIZE);
> > +
> > +		if (!bio) {
> > +			bio = btrfs_bio_alloc(offset + cur);
> > +			bio_set_dev(bio, bdev);
> > +			bio->bi_end_io = btrfs_encoded_read_endio;
> > +			bio->bi_private = &priv;
> > +			bio->bi_opf = REQ_OP_READ;
> > +			btrfs_io_bio(bio)->logical = start + cur;
> > +		}
> > +
> > +		if (bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > +			bool success;
> > +
> > +			success = btrfs_submit_encoded_read(bio);
> > +			bio = NULL;
> > +			if (!success)
> > +				break;
> > +			continue;
> > +		}
> > +		i++;
> > +		cur += bytes;
> > +	}
> > +
> > +	if (bio)
> > +		btrfs_submit_encoded_read(bio);
> > +	if (atomic_dec_return(&priv.pending))
> > +		wait_event(priv.wait, !atomic_read(&priv.pending));
> > +	if (!priv.uptodate) {
> > +		ret = -EIO;
> > +		goto out;
> > +	}
> > +
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock(inode);
> > +	*unlocked = true;
> > +
> > +	if (copy_to_iter(encoded, sizeof(*encoded), iter) != sizeof(*encoded)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +	if (encoded->compression) {
> > +		i = 0;
> > +		page_offset = 0;
> > +	} else {
> > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > +	}
> > +	cur = 0;
> > +	while (cur < count) {
> > +		size_t bytes = min_t(size_t, count - cur,
> > +				     PAGE_SIZE - page_offset);
> > +
> > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > +				      iter) != bytes) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +		i++;
> > +		cur += bytes;
> > +		page_offset = 0;
> > +	}
> > +	ret = count;
> > +out:
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pages[i])
> > +			put_page(pages[i]);
> > +	}
> > +	kfree(pages);
> > +	return ret;
> > +}
> > +
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	ssize_t ret;
> > +	size_t count;
> > +	struct block_device *em_bdev;
> > +	u64 start, lockend, offset, disk_io_size;
> > +	struct extent_state *cached_state = NULL;
> > +	struct extent_map *em;
> > +	struct encoded_iov encoded = {};
> > +	bool unlocked = false;
> > +
> > +	ret = check_encoded_read(iocb, iter);
> > +	if (ret < 0)
> > +		return ret;
> > +	if (ret == 0) {
> > +empty:
> > +		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
> > +		    sizeof(encoded))
> > +			return 0;
> > +		else
> > +			return -EFAULT;
> 
> nit: Just put the label at the end of the function since it's a simple
> error handler.

It's not really an error handler, it's a corner case. It doesn't seem
any nicer to have the corner case far away at the end of the function
and goto it from two places.

> > +	}
> > +	count = ret;
> > +
> > +	file_accessed(iocb->ki_filp);
> > +
> > +	inode_lock(inode);
> > +
> > +	if (iocb->ki_pos >= inode->i_size) {
> > +		inode_unlock(inode);
> > +		goto empty;
> 
> That way you won't have to jump backwards here ...
> 
> > +	}
> > +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> > +	/*
> > +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> > +	 * it's compressed we know that it won't be longer than this.
> > +	 */
> > +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> > +
> > +	for (;;) {
> > +		struct btrfs_ordered_extent *ordered;
> > +
> > +		ret = btrfs_wait_ordered_range(inode, start,
> > +					       lockend - start + 1);
> > +		if (ret)
> > +			goto out_unlock_inode;
> > +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > +						     lockend - start + 1);
> > +		if (!ordered)
> > +			break;
> > +		btrfs_put_ordered_extent(ordered);
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +		cond_resched();
> > +	}
> > +
> > +	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
> > +			      lockend - start + 1, 0);
> > +	if (IS_ERR(em)) {
> > +		ret = PTR_ERR(em);
> > +		goto out_unlock_extent;
> > +	}
> > +	em_bdev = em->bdev;
> > +
> > +	if (em->block_start == EXTENT_MAP_INLINE) {
> > +		u64 extent_start = em->start;
> > +
> > +		/*
> > +		 * For inline extents we get everything we need out of the
> > +		 * extent item.
> > +		 */
> > +		free_extent_map(em);
> > +		em = NULL;
> > +		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
> > +						&cached_state, extent_start,
> > +						count, &encoded, &unlocked);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * We only want to return up to EOF even if the extent extends beyond
> > +	 * that.
> > +	 */
> > +	encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) -
> > +		       iocb->ki_pos);
> > +	if (em->block_start == EXTENT_MAP_HOLE ||
> > +	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
> > +		offset = EXTENT_MAP_HOLE;
> > +	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > +		offset = em->block_start;
> > +		/*
> > +		 * Bail if the buffer isn't large enough to return the whole
> > +		 * compressed extent.
> > +		 */
> > +		if (em->block_len > count) {
> > +			ret = -EFBIG;
> > +			goto out_em;
> > +		}
> > +		disk_io_size = count = em->block_len;
> > +		encoded.unencoded_len = em->ram_bytes;
> > +		encoded.unencoded_offset = iocb->ki_pos - em->orig_start;
> > +		ret = encoded_iov_compression_from_btrfs(&encoded,
> > +							 em->compress_type);
> > +		if (ret)
> > +			goto out_em;
> > +	} else {
> > +		offset = em->block_start + (start - em->start);
> > +		if (encoded.len > count)
> > +			encoded.len = count;
> > +		/*
> > +		 * Don't read beyond what we locked. This also limits the page
> > +		 * allocations that we'll do.
> > +		 */
> > +		disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start;
> > +		encoded.len = encoded.unencoded_len = count =
> > +			start + disk_io_size - iocb->ki_pos;
> > +		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
> > +	}
> > +	free_extent_map(em);
> > +	em = NULL;
> > +
> > +	if (offset == EXTENT_MAP_HOLE) {
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +		inode_unlock(inode);
> > +		unlocked = true;
> > +		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
> > +		    sizeof(encoded))
> > +			ret = 0;
> > +		else
> > +			ret = -EFAULT;
> > +	} else {
> > +		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
> > +						 &cached_state, em_bdev, offset,
> > +						 disk_io_size, count, &encoded,
> > +						 &unlocked);
> > +	}
> > +
> > +out:
> > +	if (ret >= 0)
> > +		iocb->ki_pos += encoded.len;
> > +out_em:
> > +	free_extent_map(em);
> > +out_unlock_extent:
> > +	if (!unlocked)
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +out_unlock_inode:
> > +	if (!unlocked)
> > +		inode_unlock(inode);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_SWAP
> >  /*
> >   * Add an entry indicating a block group or device which is pinned by a
> > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
  2019-10-16 10:44   ` Nikolay Borisov
@ 2019-10-18 22:55     ` Omar Sandoval
  2019-10-18 23:33       ` Omar Sandoval
  2019-10-21 13:14       ` David Sterba
  0 siblings, 2 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-18 22:55 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api

On Wed, Oct 16, 2019 at 01:44:56PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The implementation resembles direct I/O: we have to flush any ordered
> > extents, invalidate the page cache, and do the io tree/delalloc/extent
> > map/ordered extent dance. From there, we can reuse the compression code
> > with a minor modification to distinguish the write from writeback.
> > 
> > Now that read and write are implemented, this also sets the
> > FMODE_ENCODED_IO flag in btrfs_file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/btrfs/compression.c |   6 +-
> >  fs/btrfs/compression.h |   5 +-
> >  fs/btrfs/ctree.h       |   2 +
> >  fs/btrfs/file.c        |  40 +++++++--
> >  fs/btrfs/inode.c       | 197 ++++++++++++++++++++++++++++++++++++++++-
> >  5 files changed, 237 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > index b05b361e2062..6632dd8d2e4d 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
> >  			bio->bi_status == BLK_STS_OK);
> >  	cb->compressed_pages[0]->mapping = NULL;
> >  
> > -	end_compressed_writeback(inode, cb);
> > +	if (cb->writeback)
> > +		end_compressed_writeback(inode, cb);
> >  	/* note, our inode could be gone now */
> >  
> >  	/*
> > @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  				 unsigned long compressed_len,
> >  				 struct page **compressed_pages,
> >  				 unsigned long nr_pages,
> > -				 unsigned int write_flags)
> > +				 unsigned int write_flags, bool writeback)
> 
> I don't see this function being called with true in this patch set,
> meaning it essentially eliminates end_compressed_writeback call in
> end_compressed_bio_write? Am I missing anything?

I'll point it out below.

> >  {
> >  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  	struct bio *bio = NULL;
> > @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  	cb->mirror_num = 0;
> >  	cb->compressed_pages = compressed_pages;
> >  	cb->compressed_len = compressed_len;
> > +	cb->writeback = writeback;
> >  	cb->orig_bio = NULL;
> >  	cb->nr_pages = nr_pages;
> >  
> > diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> > index 4cb8be9ff88b..d4176384ec15 100644
> > --- a/fs/btrfs/compression.h
> > +++ b/fs/btrfs/compression.h
> > @@ -47,6 +47,9 @@ struct compressed_bio {
> >  	/* the compression algorithm for this bio */
> >  	int compress_type;
> >  
> > +	/* Whether this is a write for writeback. */
> > +	bool writeback;
> > +
> >  	/* number of compressed pages in the array */
> >  	unsigned long nr_pages;
> >  
> > @@ -93,7 +96,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  				  unsigned long compressed_len,
> >  				  struct page **compressed_pages,
> >  				  unsigned long nr_pages,
> > -				  unsigned int write_flags);
> > +				  unsigned int write_flags, bool writeback);
> >  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
> >  				 int mirror_num, unsigned long bio_flags);
> >  
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 3b2aa1c7218c..9e1719e82cc8 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2907,6 +2907,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> >  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> >  					  u64 end, int uptodate);
> >  ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> > +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> > +			    struct encoded_iov *encoded);
> >  
> >  extern const struct dentry_operations btrfs_dentry_operations;
> >  
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 51740cee39fc..8de6ac9b4b9c 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1893,8 +1893,7 @@ static void update_time_for_write(struct inode *inode)
> >  		inode_inc_iversion(inode);
> >  }
> >  
> > -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > -				    struct iov_iter *from)
> > +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file_inode(file);
> > @@ -1904,14 +1903,22 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  	u64 end_pos;
> >  	ssize_t num_written = 0;
> >  	const bool sync = iocb->ki_flags & IOCB_DSYNC;
> > +	struct encoded_iov encoded;
> >  	ssize_t err;
> >  	loff_t pos;
> >  	size_t count;
> >  	loff_t oldsize;
> >  	int clean_page = 0;
> >  
> > -	if (!(iocb->ki_flags & IOCB_DIRECT) &&
> > -	    (iocb->ki_flags & IOCB_NOWAIT))
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		err = import_encoded_write(iocb, &encoded, from);
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	if ((iocb->ki_flags & IOCB_NOWAIT) &&
> > +	    (!(iocb->ki_flags & IOCB_DIRECT) ||
> > +	     (iocb->ki_flags & IOCB_ENCODED)))
> >  		return -EOPNOTSUPP;
> >  
> >  	if (!inode_trylock(inode)) {
> > @@ -1920,14 +1927,27 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  		inode_lock(inode);
> >  	}
> >  
> > -	err = generic_write_checks(iocb, from);
> > -	if (err <= 0) {
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		err = generic_encoded_write_checks(iocb, &encoded);
> > +		if (err) {
> > +			inode_unlock(inode);
> > +			return err;
> > +		}
> > +		count = encoded.len;
> > +	} else {
> > +		err = generic_write_checks(iocb, from);
> > +		if (err < 0) {
> > +			inode_unlock(inode);
> > +			return err;
> > +		}
> > +		count = iov_iter_count(from);
> > +	}
> > +	if (count == 0) {
> >  		inode_unlock(inode);
> >  		return err;
> >  	}
> >  
> >  	pos = iocb->ki_pos;
> > -	count = iov_iter_count(from);
> >  	if (iocb->ki_flags & IOCB_NOWAIT) {
> >  		/*
> >  		 * We will allocate space in case nodatacow is not set,
> > @@ -1986,7 +2006,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  	if (sync)
> >  		atomic_inc(&BTRFS_I(inode)->sync_writers);
> >  
> > -	if (iocb->ki_flags & IOCB_DIRECT) {
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		num_written = btrfs_encoded_write(iocb, from, &encoded);
> > +	} else if (iocb->ki_flags & IOCB_DIRECT) {
> >  		num_written = __btrfs_direct_write(iocb, from);
> >  	} else {
> >  		num_written = btrfs_buffered_write(iocb, from);
> > @@ -3461,7 +3483,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
> >  
> >  static int btrfs_file_open(struct inode *inode, struct file *filp)
> >  {
> > -	filp->f_mode |= FMODE_NOWAIT;
> > +	filp->f_mode |= FMODE_NOWAIT | FMODE_ENCODED_IO;
> >  	return generic_file_open(inode, filp);
> >  }
> >  
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 174d0738d2c9..bcc5a2bed22b 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
> >  				    ins.objectid,
> >  				    ins.offset, async_extent->pages,
> >  				    async_extent->nr_pages,
> > -				    async_chunk->write_flags)) {
> > +				    async_chunk->write_flags, true)) {

This is the btrfs_submit_compressed_write() call, it's just so long that
the diff context doesn't include the name :)

> >  			struct page *p = async_extent->pages[0];
> >  			const u64 start = async_extent->start;
> >  			const u64 end = start + async_extent->ram_size - 1;
> > @@ -11055,6 +11055,201 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> >  	return ret;
> >  }
> >  
> > +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> > +			    struct encoded_iov *encoded)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct extent_changeset *data_reserved = NULL;
> > +	struct extent_state *cached_state = NULL;
> > +	int compression;
> > +	size_t orig_count;
> > +	u64 disk_num_bytes, num_bytes;
> > +	u64 start, end;
> > +	unsigned long nr_pages, i;
> > +	struct page **pages;
> > +	struct btrfs_key ins;
> > +	struct extent_map *em;
> > +	ssize_t ret;
> > +
> > +	switch (encoded->compression) {
> > +	case ENCODED_IOV_COMPRESSION_ZLIB:
> > +		compression = BTRFS_COMPRESS_ZLIB;
> > +		break;
> > +	case ENCODED_IOV_COMPRESSION_LZO:
> > +		compression = BTRFS_COMPRESS_LZO;
> > +		break;
> > +	case ENCODED_IOV_COMPRESSION_ZSTD:
> > +		compression = BTRFS_COMPRESS_ZSTD;
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	disk_num_bytes = orig_count = iov_iter_count(from);
> > +
> > +	/* For now, it's too hard to support bookend extents. */
> > +	if (encoded->unencoded_len != encoded->len ||
> > +	    encoded->unencoded_offset != 0)
> > +		return -EINVAL;
> > +
> > +	/* The extent size must be sane. */
> > +	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
> > +	    disk_num_bytes > BTRFS_MAX_COMPRESSED || disk_num_bytes == 0)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * The compressed data on disk must be sector-aligned. For convenience,
> > +	 * we extend it with zeroes if it isn't.
> > +	 */
> > +	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
> > +
> > +	/*
> > +	 * The extent in the file must also be sector-aligned. However, we allow
> > +	 * a write which ends at or extends i_size to have an unaligned length;
> > +	 * we round up the extent size and set i_size to the given length.
> > +	 */
> > +	start = iocb->ki_pos;
> > +	if (!IS_ALIGNED(start, fs_info->sectorsize))
> > +		return -EINVAL;
> > +	if (start + encoded->len >= inode->i_size) {
> > +		num_bytes = ALIGN(encoded->len, fs_info->sectorsize);
> > +	} else {
> > +		num_bytes = encoded->len;
> > +		if (!IS_ALIGNED(num_bytes, fs_info->sectorsize))
> > +			return -EINVAL;
> > +	}
> > +
> > +	/*
> > +	 * It's valid to have compressed data which is larger than or the same
> > +	 * size as the decompressed data. However, for buffered I/O, we fall
> > +	 * back to writing the decompressed data if compression didn't shrink
> > +	 * it. So, for now, let's not allow creating such extents.
> > +	 *
> > +	 * Note that for now this also implicitly prevents writing data that
> > +	 * would fit in an inline extent.
> > +	 */
> > +	if (disk_num_bytes >= num_bytes)
> > +		return -EINVAL;
> > +
> > +	end = start + num_bytes - 1;
> > +
> > +	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 
> nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)

disk_num_bytes is a u64, so that would expand to a 64-bit division. The
compiler is probably smart enough to optimize it to a shift, but I
didn't want to rely on that, because that would cause build failures on
32-bit.

> > +	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_USER);
> 
> This could be a simple GFP_KERNEL  allocation

I mixed up GFP_USER and GFP_KERNEL_ACCOUNT. I think we want
GFP_KERNEL_ACCOUNT rather than GFP_KERNEL here because the allocation is
triggered by a userspace request. (Obviously we're not very consistent
about that in Btrfs, but for new stuff we might as well be more careful
about it).

> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
> > +		char *kaddr;
> > +
> > +		pages[i] = alloc_page(GFP_HIGHUSER);
> 
> Why GFP_HIGHUSER? You are reading from userspace,  not writing to it. A
> plain, NOFS allocation should suffice (of course using the newer
> memalloc_nofs_save api)?

The __GFP_HIGHMEM bit is just to give the allocator more to work with
since we only ever access the pages with kmap() (we do the same thing
elsewhere in Btrfs). It doesn't need to be NOFS, but this should
probably be GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM.

> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out_pages;
> > +		}
> > +		kaddr = kmap(pages[i]);
> > +		if (copy_from_iter(kaddr, bytes, from) != bytes) {
> > +			kunmap(pages[i]);
> > +			ret = -EFAULT;
> > +			goto out_pages;
> > +		}
> > +		if (bytes < PAGE_SIZE)
> > +			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
> > +		kunmap(pages[i]);
> > +	}
> > +
> > +	for (;;) {
> > +		struct btrfs_ordered_extent *ordered;
> > +
> > +		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
> > +		if (ret)
> > +			goto out_pages;
> > +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> > +						    start >> PAGE_SHIFT,
> > +						    end >> PAGE_SHIFT);
> > +		if (ret)
> > +			goto out_pages;
> > +		lock_extent_bits(io_tree, start, end, &cached_state);
> > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > +						     end - start + 1);
> > +		if (!ordered &&
> > +		    !filemap_range_has_page(inode->i_mapping, start, end))
> > +			break;
> > +		if (ordered)
> > +			btrfs_put_ordered_extent(ordered);
> > +		unlock_extent_cached(io_tree, start, end, &cached_state);
> > +		cond_resched();
> > +	}
> > +
> > +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> > +					   num_bytes);
> > +	if (ret)
> > +		goto out_unlock;
> > +
> > +	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> > +				   disk_num_bytes, 0, 0, &ins, 1, 1);
> > +	if (ret)
> > +		goto out_delalloc_release;
> > +
> > +	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> > +			  ins.offset, ins.offset, num_bytes, compression,
> > +			  BTRFS_ORDERED_COMPRESSED);
> > +	if (IS_ERR(em)) {
> > +		ret = PTR_ERR(em);
> > +		goto out_free_reserve;
> > +	}
> > +	free_extent_map(em);
> > +
> > +	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> > +						num_bytes, ins.offset,
> > +						BTRFS_ORDERED_COMPRESSED,
> > +						compression);
> > +	if (ret) {
> > +		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> > +		goto out_free_reserve;
> > +	}
> > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > +
> > +	if (start + encoded->len > inode->i_size)
> > +		i_size_write(inode, start + encoded->len);
> 
> Don't we want the inode size to be updated once data hits disk and
> btrfs_finish_ordered_io is called?

Yup, you're right, this is too early.

> > +
> > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > +
> > +	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> > +
> > +	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> > +					  ins.offset, pages, nr_pages, 0,
> > +					  false)) {
> > +		struct page *page = pages[0];
> > +
> > +		page->mapping = inode->i_mapping;
> > +		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> > +		page->mapping = NULL;
> > +		ret = -EIO;
> > +		goto out_pages;
> > +	}

I also need to wait for the I/O to finish here.

> > +	iocb->ki_pos += encoded->len;
> > +	return orig_count;
> > +
> > +out_free_reserve:
> > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > +	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> > +out_delalloc_release:
> > +	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> > +				     true);
> > +out_unlock:
> > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > +out_pages:
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pages[i])
> > +			put_page(pages[i]);
> > +	}
> > +	kvfree(pages);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_SWAP
> >  /*
> >   * Add an entry indicating a block group or device which is pinned by a
> > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
  2019-10-18 22:55     ` Omar Sandoval
@ 2019-10-18 23:33       ` Omar Sandoval
  2019-10-21 13:14       ` David Sterba
  1 sibling, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-18 23:33 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api

On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> On Wed, Oct 16, 2019 at 01:44:56PM +0300, Nikolay Borisov wrote:
> > 
> > 
> > On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > The implementation resembles direct I/O: we have to flush any ordered
> > > extents, invalidate the page cache, and do the io tree/delalloc/extent
> > > map/ordered extent dance. From there, we can reuse the compression code
> > > with a minor modification to distinguish the write from writeback.
> > > 
> > > Now that read and write are implemented, this also sets the
> > > FMODE_ENCODED_IO flag in btrfs_file_open().
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  fs/btrfs/compression.c |   6 +-
> > >  fs/btrfs/compression.h |   5 +-
> > >  fs/btrfs/ctree.h       |   2 +
> > >  fs/btrfs/file.c        |  40 +++++++--
> > >  fs/btrfs/inode.c       | 197 ++++++++++++++++++++++++++++++++++++++++-
> > >  5 files changed, 237 insertions(+), 13 deletions(-)
> > > 

[snip]

> > > +	for (;;) {
> > > +		struct btrfs_ordered_extent *ordered;
> > > +
> > > +		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
> > > +		if (ret)
> > > +			goto out_pages;
> > > +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > +						    start >> PAGE_SHIFT,
> > > +						    end >> PAGE_SHIFT);
> > > +		if (ret)
> > > +			goto out_pages;
> > > +		lock_extent_bits(io_tree, start, end, &cached_state);
> > > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > > +						     end - start + 1);
> > > +		if (!ordered &&
> > > +		    !filemap_range_has_page(inode->i_mapping, start, end))
> > > +			break;
> > > +		if (ordered)
> > > +			btrfs_put_ordered_extent(ordered);
> > > +		unlock_extent_cached(io_tree, start, end, &cached_state);
> > > +		cond_resched();
> > > +	}
> > > +
> > > +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> > > +					   num_bytes);
> > > +	if (ret)
> > > +		goto out_unlock;
> > > +
> > > +	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> > > +				   disk_num_bytes, 0, 0, &ins, 1, 1);
> > > +	if (ret)
> > > +		goto out_delalloc_release;
> > > +
> > > +	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> > > +			  ins.offset, ins.offset, num_bytes, compression,
> > > +			  BTRFS_ORDERED_COMPRESSED);
> > > +	if (IS_ERR(em)) {
> > > +		ret = PTR_ERR(em);
> > > +		goto out_free_reserve;
> > > +	}
> > > +	free_extent_map(em);
> > > +
> > > +	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> > > +						num_bytes, ins.offset,
> > > +						BTRFS_ORDERED_COMPRESSED,
> > > +						compression);
> > > +	if (ret) {
> > > +		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> > > +		goto out_free_reserve;
> > > +	}
> > > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > > +
> > > +	if (start + encoded->len > inode->i_size)
> > > +		i_size_write(inode, start + encoded->len);
> > 
> > Don't we want the inode size to be updated once data hits disk and
> > btrfs_finish_ordered_io is called?
> 
> Yup, you're right, this is too early.

Actually, no, this part is fine. Compare to the call to i_size_write()
in btrfs_get_blocks_direct_write(): we lock the extent in the io_tree,
create the ordered extent, update i_size, then unlock the extent. Anyone
else who comes in is going to find the ordered extent and wait on that.

> > > +
> > > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > > +
> > > +	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> > > +
> > > +	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> > > +					  ins.offset, pages, nr_pages, 0,
> > > +					  false)) {
> > > +		struct page *page = pages[0];
> > > +
> > > +		page->mapping = inode->i_mapping;
> > > +		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> > > +		page->mapping = NULL;
> > > +		ret = -EIO;
> > > +		goto out_pages;
> > > +	}
> 
> I also need to wait for the I/O to finish here.
> 
> > > +	iocb->ki_pos += encoded->len;
> > > +	return orig_count;
> > > +
> > > +out_free_reserve:
> > > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > > +	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> > > +out_delalloc_release:
> > > +	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> > > +				     true);
> > > +out_unlock:
> > > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > > +out_pages:
> > > +	for (i = 0; i < nr_pages; i++) {
> > > +		if (pages[i])
> > > +			put_page(pages[i]);
> > > +	}
> > > +	kvfree(pages);
> > > +	return ret;
> > > +}
> > > +
> > >  #ifdef CONFIG_SWAP
> > >  /*
> > >   * Add an entry indicating a block group or device which is pinned by a
> > > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 1/5] fs: add O_ENCODED open flag
  2019-10-15 18:42 ` [RFC PATCH v2 1/5] fs: add O_ENCODED open flag Omar Sandoval
@ 2019-10-19  4:50   ` Aleksa Sarai
  2019-10-23  4:46     ` Aleksa Sarai
  2019-10-30 22:55     ` Omar Sandoval
  0 siblings, 2 replies; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-19  4:50 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 4790 bytes --]

On 2019-10-15, Omar Sandoval <osandov@osandov.com> wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The upcoming RWF_ENCODED operation introduces some security concerns:
> 
> 1. Compressed writes will pass arbitrary data to decompression
>    algorithms in the kernel.
> 2. Compressed reads can leak truncated/hole punched data.
> 
> Therefore, we need to require privilege for RWF_ENCODED. It's not
> possible to do the permissions checks at the time of the read or write
> because, e.g., io_uring submits IO from a worker thread. So, add an open
> flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> fcntl(). The flag is not cleared in any way on fork or exec; it should
> probably be used with O_CLOEXEC in most cases.
> 
> Note that the usual issue that unknown open flags are ignored doesn't
> really matter for O_ENCODED; if the kernel doesn't support O_ENCODED,
> then it doesn't support RWF_ENCODED, either.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/fcntl.c                       | 10 ++++++++--
>  fs/namei.c                       |  4 ++++
>  include/linux/fcntl.h            |  2 +-
>  include/uapi/asm-generic/fcntl.h |  4 ++++
>  4 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 3d40771e8e7c..45ebc6df078e 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -30,7 +30,8 @@
>  #include <asm/siginfo.h>
>  #include <linux/uaccess.h>
>  
> -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> +		    O_ENCODED)
>  
>  static int setfl(int fd, struct file * filp, unsigned long arg)
>  {
> @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
>  		if (!inode_owner_or_capable(inode))
>  			return -EPERM;
>  
> +	/* O_ENCODED can only be set by superuser */
> +	if ((arg & O_ENCODED) && !(filp->f_flags & O_ENCODED) &&
> +	    !capable(CAP_SYS_ADMIN))
> +		return -EPERM;

I have a feeling the error should probably be an EACCES and not EPERM.

> +
>  	/* required for strict SunOS emulation */
>  	if (O_NONBLOCK != O_NDELAY)
>  	       if (arg & O_NDELAY)
> @@ -1031,7 +1037,7 @@ static int __init fcntl_init(void)
>  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>  	 * is defined as O_NONBLOCK on some platforms and not on others.
>  	 */
> -	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> +	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
>  		HWEIGHT32(
>  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
>  			__FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/namei.c b/fs/namei.c
> index 671c3c1a3425..ae86b125888a 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2978,6 +2978,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
>  	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
>  		return -EPERM;
>  
> +	/* O_ENCODED can only be set by superuser */
> +	if ((flag & O_ENCODED) && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;

I would suggest that this check be put into build_open_flags() rather
than putting it this late in open(). Also, same nit about the error
return as above.

> +
>  	return 0;
>  }
>  
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index d019df946cb2..5fac02479639 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -9,7 +9,7 @@
>  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
>  	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
>  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ENCODED)
>  
>  #ifndef force_o_largefile
>  #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 9dc0bf0c5a6e..8c5cbd5942e3 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -97,6 +97,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>  
> +#ifndef O_ENCODED
> +#define O_ENCODED	040000000
> +#endif

You should also define this for all of the architectures which don't use
the generic O_* flag values. On alpha, O_PATH is equal to the value you
picked (just be careful on sparc -- 0x4000000 is the next free bit, but
it's used by FMODE_NONOTIFY.)

> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-15 18:42 ` [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
  2019-10-16  9:50   ` Nikolay Borisov
@ 2019-10-19  5:01   ` Aleksa Sarai
  2019-10-21 18:28   ` Darrick J. Wong
  2 siblings, 0 replies; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-19  5:01 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team, christian.brauner

[-- Attachment #1: Type: text/plain, Size: 10424 bytes --]

On 2019-10-15, Omar Sandoval <osandov@osandov.com> wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
> 
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data
> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future. The remaining iovecs contain
> the encoded extent.
> 
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  include/linux/fs.h      | 14 +++++++
>  include/uapi/linux/fs.h | 26 ++++++++++++-
>  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
>  3 files changed, 108 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..54681f21e05e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
>  
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> +
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
>   * that indicates that they should check the contents of the iovec are
> @@ -314,6 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC		(1 << 5)
>  #define IOCB_WRITE		(1 << 6)
>  #define IOCB_NOWAIT		(1 << 7)
> +#define IOCB_ENCODED		(1 << 8)
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
>  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
>  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> +				struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>  				struct file *file_out, loff_t pos_out,
>  				loff_t *count, unsigned int remap_flags);
> @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>  			return -EOPNOTSUPP;
>  		ki->ki_flags |= IOCB_NOWAIT;
>  	}
> +	if (flags & RWF_ENCODED) {
> +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> +			return -EOPNOTSUPP;
> +		ki->ki_flags |= IOCB_ENCODED;
> +	}
>  	if (flags & RWF_HIPRI)
>  		ki->ki_flags |= IOCB_HIPRI;
>  	if (flags & RWF_DSYNC)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..ed92a8a257cb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -284,6 +284,27 @@ struct fsxattr {
>  
>  typedef int __bitwise __kernel_rwf_t;
>  
> +enum {
> +	ENCODED_IOV_COMPRESSION_NONE,
> +	ENCODED_IOV_COMPRESSION_ZLIB,
> +	ENCODED_IOV_COMPRESSION_LZO,
> +	ENCODED_IOV_COMPRESSION_ZSTD,
> +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> +};
> +
> +enum {
> +	ENCODED_IOV_ENCRYPTION_NONE,
> +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> +};
> +
> +struct encoded_iov {
> +	__u64 len;
> +	__u64 unencoded_len;
> +	__u64 unencoded_offset;
> +	__u32 compression;
> +	__u32 encryption;
> +};
> +
>  /* high priority request, poll if possible */
>  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
>  
> @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
>  
> +/* encoded (e.g., compressed or encrypted) IO */
> +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -			 RWF_APPEND)
> +			 RWF_APPEND | RWF_ENCODED)
>  
>  #endif /* _UAPI_LINUX_FS_H */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1146fcfa3215..d2e6d9caf353 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
>  	return 0;
>  }
>  
> -/*
> - * Performs necessary checks before doing a write
> - *
> - * Can adjust writing position or amount of bytes to write.
> - * Returns appropriate error code that caller should return or
> - * zero in case that write should be allowed.
> - */
> -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
> -	loff_t count;
> -	int ret;
>  
>  	if (IS_SWAPFILE(inode))
>  		return -ETXTBSY;
>  
> -	if (!iov_iter_count(from))
> +	if (!*count)
>  		return 0;
>  
>  	/* FIXME: this is for backwards compatibility with 2.4 */
> @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
>  		return -EINVAL;
>  
> -	count = iov_iter_count(from);
> -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> +}
> +
> +/*
> + * Performs necessary checks before doing a write
> + *
> + * Can adjust writing position or amount of bytes to write.
> + * Returns a negative errno or the new number of bytes to write.
> + */
> +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	loff_t count = iov_iter_count(from);
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
>  	if (ret)
>  		return ret;
>  
> @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  }
>  EXPORT_SYMBOL(generic_write_checks);
>  
> +int generic_encoded_write_checks(struct kiocb *iocb,
> +				 struct encoded_iov *encoded)
> +{
> +	loff_t count = encoded->unencoded_len;
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
> +	if (ret)
> +		return ret;
> +
> +	if (count != encoded->unencoded_len) {
> +		/*
> +		 * The write got truncated by generic_write_checks_common(). We
> +		 * can't do a partial encoded write.
> +		 */
> +		return -EFBIG;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(generic_encoded_write_checks);
> +
> +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> +		return -EINVAL;

I'm not sure what is precisely the right way of doing this within the
iov_iter world (maybe we should write some new helpers for that
usecase), but if you want to make forwards-compatibility much more
smooth please take a look at the new copy_struct_from_user() semantics
(it allows you to extend userspace-exposed structures without making it
painful for new software on old kernels).

Basically the semantics boil down to the following (ksize is the
kernel's struct size, and usize is the size that userspace used). All
new features must have their zero-value mean "don't use the new
feature".

  1. If ksize == usize, use it verbatim.
  2. If ksize > usize (old userspace), zero-fill the rest of the kernel
	 struct -- thus if userspace doesn't know about a new feature, it is
	 disabled.
  3. If ksize < usize (old kernel), check whether the trailing
     (usize - ksize) bytes are zero. If they are, then just use the
	 ksize prefix of the userspace struct. If they are non-zero then
	 give -E2BIG. Thus if userspace is newer than the kernel but isn't
	 using a new feature, it won't get an error.

This is how clone3(2) works (and openat2(2) will work), as well as some
older syscalls like perf_event_open(2) and sched_setattr(2). BPF also
has some similar semantics.

I really would like us to have a much more uniform way of defining
extensible APIs in Linux. By returning -EINVAL for all differently-sized
structs means that any new programs will give errors on older kernels
(even if they aren't using any new features).

> +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> +}
> +EXPORT_SYMBOL(check_encoded_read);
> +
> +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> +			 struct iov_iter *from)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> +		return -EINVAL;
> +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> +		return -EFAULT;
> +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> +		return -EINVAL;
> +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> +		return -EINVAL;
> +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> +		return -EINVAL;
> +	return 0;
> +}
> +EXPORT_SYMBOL(import_encoded_write);
> +
>  /*
>   * Performs necessary checks before doing a clone.
>   *

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data
  2019-10-15 18:42 ` [PATCH man-pages] Document encoded I/O Omar Sandoval
@ 2019-10-20 23:05   ` Dave Chinner
  2019-10-21 19:04     ` Omar Sandoval
  2019-10-21  6:18   ` [PATCH man-pages] Document encoded I/O Amir Goldstein
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2019-10-20 23:05 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Jann Horn, linux-api, kernel-team

On Tue, Oct 15, 2019 at 11:42:38AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Hello,
> 
> This series adds an API for reading compressed data on a filesystem
> without decompressing it as well as support for writing compressed data
> directly to the filesystem. It is based on my previous series which
> added a Btrfs-specific ioctl [1], but it is now an extension to
> preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
> man page patch describing the API in detail. Test cases and examples
> programs are available [3].
> 
> The use case that I have in mind is Btrfs send/receive: currently, when
> sending data from one compressed filesystem to another, the sending side
> decompresses the data and the receiving side recompresses it before
> writing it out. This is wasteful and can be avoided if we can just send
> and write compressed extents. The send part will be implemented in a
> separate series, as this API can stand alone.
> 
> Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
> Patch 4 implements encoded reads for Btrfs, and patch 5 implements
> encoded writes.
> 
> Changes from v1 [4]:
> 
> - Encoded reads are now also implemented.
> - The encoded_iov structure now includes metadata for referring to a
>   subset of decoded data. This is required to handle certain cases where
>   a compressed extent is truncated, hole punched, or otherwise sliced up
>   and Btrfs chooses to reflect this in metadata instead of decompressing
>   the whole extent and rewriting the pieces. We call these "bookend
>   extents" in Btrfs, but any filesystem supporting transparent encoding
>   is likely to have a similar concept.

Where's the in-kernel documentation for this API? You're encoding a
specific set of behaviours into the user API, so this needs a whole
heap of documentation in the generic code to describe how it works
so that other filesystems implementing have a well defined guideline
to what they need to support.

Also, I don't see any test code for this - can you please add
support for RWF_ENCODED to xfs_io and write a suite of unit tests
for fstests that exercise the user API fully?  Given our history of
screwing up new user APIs, this absolutely should not be merged
until there is a full set of generic unit tests written and reviewed
for it and support has been added to fsstress, fsx, and other test
utilities to fuzz and stress the implementation as part of normal
day-to-day filesystem development...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-15 18:42 ` [PATCH man-pages] Document encoded I/O Omar Sandoval
  2019-10-20 23:05   ` [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Dave Chinner
@ 2019-10-21  6:18   ` Amir Goldstein
  2019-10-21 18:53     ` Omar Sandoval
  1 sibling, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-10-21  6:18 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, Linux Btrfs, Dave Chinner, Jann Horn, Linux API,
	kernel-team, Theodore Tso

CC: Ted

What ever happened to read/write ext4 encrypted data API?
https://marc.info/?l=linux-ext4&m=145030599010416&w=2

Can we learn anything from the ext4 experience to improve
the new proposed API?


On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This adds a new page, rwf_encoded(7), providing an overview of encoded
> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> reference it.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  man2/fcntl.2       |  10 +-
>  man2/open.2        |  13 ++
>  man2/readv.2       |  46 +++++++
>  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 365 insertions(+), 1 deletion(-)
>  create mode 100644 man7/rwf_encoded.7
>
> diff --git a/man2/fcntl.2 b/man2/fcntl.2
> index fce4f4c2b..76fe9cc6f 100644
> --- a/man2/fcntl.2
> +++ b/man2/fcntl.2
> @@ -222,8 +222,9 @@ On Linux, this command can change only the
>  .BR O_ASYNC ,
>  .BR O_DIRECT ,
>  .BR O_NOATIME ,
> +.BR O_NONBLOCK ,
>  and
> -.B O_NONBLOCK
> +.B O_ENCODED
>  flags.
>  It is not possible to change the
>  .BR O_DSYNC
> @@ -1803,6 +1804,13 @@ Attempted to clear the
>  flag on a file that has the append-only attribute set.
>  .TP
>  .B EPERM
> +Attempted to set the
> +.B O_ENCODED
> +flag and the calling process did not have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
> +.B EPERM
>  .I cmd
>  was
>  .BR F_ADD_SEALS ,
> diff --git a/man2/open.2 b/man2/open.2
> index b0f485b41..cdd3c549c 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -421,6 +421,14 @@ was followed by a call to
>  .BR fdatasync (2)).
>  .IR "See NOTES below" .
>  .TP
> +.B O_ENCODED
> +Open the file with encoded I/O permissions;

1. I find the name of the flag confusing.
Yes, most people don't read documentation so carefully (or at all)
so they will assume O_ENCODED will affect read/write or that it
relates to RWF_ENCODED in a similar way that O_SYNC relates
to RWF_SYNC (i.e. logical OR and not logical AND).

I am not good at naming and to prove it I will propose:
O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED

2. While I see no harm in adding O_ flag to open(2) for this
use case, I also don't see a major benefit in adding it.
What if we only allowed setting the flag via fcntl(2) which returns
an error on old kernels?
Since unlike most O_ flags, O_ENCODED does NOT affect file
i/o without additional opt-in flags, it is not standard anyway and
therefore I find that setting it only via fcntl(2) is less error prone.


> +see
> +.BR rwf_encoded (7).
> +The caller must have the
> +.B CAP_SYS_ADMIN
> +capabilty.
> +.TP
>  .B O_EXCL
>  Ensure that this call creates the file:
>  if this flag is specified in conjunction with
> @@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller was not privileged.
>  The operation was prevented by a file seal; see
>  .BR fcntl (2).
>  .TP
> +.B EPERM
> +The
> +.B O_ENCODED
> +flag was specified, but the caller was not privileged.
> +.TP
>  .B EROFS
>  .I pathname
>  refers to a file on a read-only filesystem and write access was
> diff --git a/man2/readv.2 b/man2/readv.2
> index af27aa63e..aa60b980a 100644
> --- a/man2/readv.2
> +++ b/man2/readv.2
> @@ -265,6 +265,11 @@ the data is always appended to the end of the file.
>  However, if the
>  .I offset
>  argument is \-1, the current file offset is updated.
> +.TP
> +.BR RWF_ENCODED " (since Linux 5.6)"
> +Read or write encoded (e.g., compressed) data.
> +See
> +.BR rwf_encoded (7).
>  .SH RETURN VALUE
>  On success,
>  .BR readv (),
> @@ -284,6 +289,13 @@ than requested (see
>  and
>  .BR write (2)).
>  .PP
> +If
> +.B
> +RWF_ENCODED
> +was specified in
> +.IR flags ,
> +then the return value is the number of encoded bytes.
> +.PP
>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>  .SH ERRORS
>  The errors are as given for
> @@ -314,6 +326,40 @@ is less than zero or greater than the permitted maximum.
>  .TP
>  .B EOPNOTSUPP
>  An unknown flag is specified in \fIflags\fP.
> +.TP
> +.B EOPNOTSUPP
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the filesystem does not implement encoded I/O.
> +.TP
> +.B EPERM
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the file was not opened with the
> +.B O_ENCODED
> +flag.
> +.PP
> +.BR preadv2 ()
> +can fail for the following reasons:
> +.TP
> +.B EFBIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and buffers in
> +.I iov
> +were not big enough to return the encoded data.

I don't like it that EFBIG is returned for read.
While EXXX values meaning is often very vague, EFBIG meaning
is still quite consistent - it is always a write related error when trying
to change i_size above fs/system limits.

In the case above, I find E2BIG much more appropriate.
Although its original meaning was too long arg list, it already grew
several cases where it generally means "buffer cannot hold the result"
like the case with msgrcv(2) and listxattr(2).

Thanks,
Amir.

> +.PP
> +.BR pwritev2 ()
> +can fail for the following reasons:
> +.TP
> +.B EINVAL
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the alignment and/or size requirements are not met.
>  .SH VERSIONS
>  .BR preadv ()
>  and
> diff --git a/man7/rwf_encoded.7 b/man7/rwf_encoded.7
> new file mode 100644
> index 000000000..90f5292e2
> --- /dev/null
> +++ b/man7/rwf_encoded.7
> @@ -0,0 +1,297 @@
> +.\" Copyright (c) 2019 by Omar Sandoval <osandov@fb.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\"
> +.TH RWF_ENCODED  7 2019-10-14 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +rwf_encoded \- overview of encoded I/O
> +.SH DESCRIPTION
> +Several filesystems (e.g., Btrfs) support transparent encoding
> +(e.g., compression, encryption) of data on disk:
> +written data is encoded by the kernel before it is written to disk,
> +and read data is decoded before being returned to the user.
> +In some cases, it is useful to skip this encoding step.
> +For example, the user may want to read the compressed contents of a file
> +or write pre-compressed data directly to a file.
> +This is referred to as "encoded I/O".
> +.SS Encoded I/O API
> +Encoded I/O is specified with the
> +.B RWF_ENCODED
> +flag to
> +.BR preadv2 (2)
> +and
> +.BR pwritev2 (2).
> +If
> +.B RWF_ENCODED
> +is specified, then
> +.I iov[0].iov_base
> +points to an
> +.I
> +encoded_iov
> +structure, defined in
> +.I <linux/fs.h>
> +as:
> +.PP
> +.in +4n
> +.EX
> +struct encoded_iov {
> +    __u64 len;
> +    __u64 unencoded_len;
> +    __u64 unencoded_offset;
> +    __u32 compression;
> +    __u32 encryption;
> +
> +};
> +.EE
> +.in
> +.PP
> +.I iov[0].iov_len
> +must be set to
> +.IR "sizeof(struct\ encoded_iov)" .
> +The remaining buffers contain the encoded data.
> +.PP
> +.I compression
> +and
> +.I encryption
> +are the encoding fields.
> +.I compression
> +is one of
> +.B ENCODED_IOV_COMPRESSION_NONE
> +(zero),
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +or
> +.BR ENCODED_IOV_COMPRESSION_ZSTD .
> +.I encryption
> +is currently always
> +.B ENCODED_IOV_ENCRYPTION_NONE
> +(zero).
> +.PP
> +.I unencoded_len
> +is the length of the unencoded (i.e., decrypted and decompressed) data.
> +.I unencoded_offset
> +is the offset into the unencoded data where the data in the file begins
> +(strictly less than
> +.IR unencoded_len ).
> +.I len
> +is the length of the data in the file.
> +.PP
> +In most cases,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +However, it may be necessary to refer to a subset of the unencoded data,
> +usually because a read occurred in the middle of an encoded extent,
> +because part of an extent was overwritten or deallocated in some
> +way (e.g., with
> +.BR write (2),
> +.BR truncate (2),
> +or
> +.BR fallocate (2))
> +or because part of an extent was added to the file (e.g., with
> +.BR ioctl_ficlonerange (2)
> +or
> +.BR ioctl_fideduperange (2)).
> +For example, if
> +.I len
> +is 300,
> +.I unencoded_len
> +is 1000,
> +and
> +.I unencoded_offset
> +is 600,
> +then the encoded data is 1000 bytes long when decoded,
> +of which only the 300 bytes starting at offset 600 are used;
> +the first 600 and last 100 bytes should be ignored.
> +.PP
> +Additionally,
> +.I len
> +may be greater than
> +.I unencoded_len
> +-
> +.IR unencoded_offset;
> +in this case, the data in the file is longer than the unencoded data,
> +and the difference is zero-filled.
> +.PP
> +If the unencoded data is actually longer than
> +.IR unencoded_len ,
> +then it is truncated;
> +if it is shorter, then it is extended with zeroes.
> +.PP
> +For
> +.BR pwritev2 (),
> +the metadata should be specified in
> +.IR iov[0] ,
> +and the encoded data should be passed in the remaining buffers.
> +This returns the number of encoded bytes written (that is, the sum of
> +.I iov[n].iov_len
> +for 1 <=
> +.I n
> +<
> +.IR iovcnt ;
> +partial writes will not occur).
> +If the
> +.I offset
> +argument to
> +.BR pwritev2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +At least one encoding field must be non-zero.
> +Note that the encoded data is not validated when it is written;
> +if it is not valid (e.g., it cannot be decompressed),
> +then a subsequent read may result in an error.
> +.PP
> +For
> +.BR preadv2 (),
> +the metadata is returned in
> +.IR iov[0] ,
> +and the encoded data is returned in the remaining buffers.
> +This returns the number of encoded bytes read.
> +Note that a return value of zero does not indicate end of file;
> +one should refer to
> +.I len
> +(for example, a hole in the file has a non-zero
> +.I len
> +but a zero return value).
> +A
> +.I len
> +of zero indicates end of file.
> +If the
> +.I offset
> +argument to
> +.BR preadv2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +If the provided buffers are not large enough to return an entire encoded
> +extent,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR EFBIG .
> +This will only return one encoded extent per call.
> +This can also read data which is not encoded;
> +all encoding fields will be zero in that case.
> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by malicous data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.
> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +.B O_ENCODED
> +may be set and cleared with
> +.BR fcntl (2).
> +Note that it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2);
> +one may wish to use
> +.B O_CLOEXEC
> +with
> +.BR O_ENCODED .
> +.SS Filesystem support
> +Encoded I/O is supported on the following filesystems:
> +.TP
> +Btrfs (since Linux 5.6)
> +.IP
> +Btrfs supports encoded reads and writes of compressed data.
> +The data is encoded as follows:
> +.RS
> +.IP * 3
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +then the encoded data is a single zlib stream.
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +then the encoded data is compressed page by page with LZO1X
> +and wrapped in the format described in the Linux kernel source file
> +.IR fs/btrfs/lzo.c .
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZSTD ,
> +then the encoded data is a single zstd frame compressed with the
> +.I windowLog
> +compression parameter set to no more than 17.
> +.RE
> +.IP
> +Additionally, there are some restrictions on
> +.BR pwritev2 ():
> +.RS
> +.IP * 3
> +.I offset
> +(or the current file offset if
> +.I offset
> +is -1) must be aligned to the sector size of the filesystem.
> +.IP *
> +.I len
> +must be aligned to the sector size of the filesystem
> +unless the data ends at or beyond the current end of the file.
> +.IP *
> +.I unencoded_len
> +and the length of the encoded data must each be no more than 128 KiB.
> +This limit may increase in the future.
> +.IP *
> +The length of the encoded data rounded up to the nearest sector must be
> +less than
> +.I unencoded_len
> +rounded up to the nearest sector.
> +.IP *
> +Referring to a subset of unencoded data is not yet implemented; i.e.,
> +.I len
> +must equal
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +must be zero.
> +.IP *
> +Writing compressed inline extents is not yet implemented.
> +.RE
> --
> 2.23.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
  2019-10-18 22:55     ` Omar Sandoval
  2019-10-18 23:33       ` Omar Sandoval
@ 2019-10-21 13:14       ` David Sterba
  2019-10-21 18:05         ` Omar Sandoval
  1 sibling, 1 reply; 42+ messages in thread
From: David Sterba @ 2019-10-21 13:14 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Nikolay Borisov, linux-btrfs, linux-fsdevel, kernel-team,
	Dave Chinner, Jann Horn, linux-api

On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> > > +	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > 
> > nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)
> 
> disk_num_bytes is a u64, so that would expand to a 64-bit division. The
> compiler is probably smart enough to optimize it to a shift, but I
> didn't want to rely on that, because that would cause build failures on
> 32-bit.

There are several DIV_ROUND_UP(u64, PAGE_SIZE) in btrfs code, no build
brekages have been reported so far, you can use it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
  2019-10-21 13:14       ` David Sterba
@ 2019-10-21 18:05         ` Omar Sandoval
  0 siblings, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-21 18:05 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, linux-btrfs, linux-fsdevel,
	kernel-team, Dave Chinner, Jann Horn, linux-api

On Mon, Oct 21, 2019 at 03:14:52PM +0200, David Sterba wrote:
> On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> > > > +	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > 
> > > nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)
> > 
> > disk_num_bytes is a u64, so that would expand to a 64-bit division. The
> > compiler is probably smart enough to optimize it to a shift, but I
> > didn't want to rely on that, because that would cause build failures on
> > 32-bit.
> 
> There are several DIV_ROUND_UP(u64, PAGE_SIZE) in btrfs code, no build
> brekages have been reported so far, you can use it.

Good to know, I'll fix both places I'm doing this, then.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-15 18:42 ` [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
  2019-10-16  9:50   ` Nikolay Borisov
  2019-10-19  5:01   ` Aleksa Sarai
@ 2019-10-21 18:28   ` Darrick J. Wong
  2019-10-21 18:38     ` Aleksa Sarai
  2019-10-21 19:07     ` Omar Sandoval
  2 siblings, 2 replies; 42+ messages in thread
From: Darrick J. Wong @ 2019-10-21 18:28 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team

On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
> 
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data
> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future. The remaining iovecs contain
> the encoded extent.
> 
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  include/linux/fs.h      | 14 +++++++
>  include/uapi/linux/fs.h | 26 ++++++++++++-
>  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
>  3 files changed, 108 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..54681f21e05e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
>  
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> +
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
>   * that indicates that they should check the contents of the iovec are
> @@ -314,6 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC		(1 << 5)
>  #define IOCB_WRITE		(1 << 6)
>  #define IOCB_NOWAIT		(1 << 7)
> +#define IOCB_ENCODED		(1 << 8)
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
>  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
>  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> +				struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>  				struct file *file_out, loff_t pos_out,
>  				loff_t *count, unsigned int remap_flags);
> @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>  			return -EOPNOTSUPP;
>  		ki->ki_flags |= IOCB_NOWAIT;
>  	}
> +	if (flags & RWF_ENCODED) {
> +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> +			return -EOPNOTSUPP;
> +		ki->ki_flags |= IOCB_ENCODED;
> +	}
>  	if (flags & RWF_HIPRI)
>  		ki->ki_flags |= IOCB_HIPRI;
>  	if (flags & RWF_DSYNC)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..ed92a8a257cb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -284,6 +284,27 @@ struct fsxattr {
>  
>  typedef int __bitwise __kernel_rwf_t;
>  
> +enum {
> +	ENCODED_IOV_COMPRESSION_NONE,
> +	ENCODED_IOV_COMPRESSION_ZLIB,
> +	ENCODED_IOV_COMPRESSION_LZO,
> +	ENCODED_IOV_COMPRESSION_ZSTD,
> +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> +};
> +
> +enum {
> +	ENCODED_IOV_ENCRYPTION_NONE,
> +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> +};
> +
> +struct encoded_iov {
> +	__u64 len;
> +	__u64 unencoded_len;
> +	__u64 unencoded_offset;
> +	__u32 compression;
> +	__u32 encryption;

Can we add some must-be-zero padding space at the end here for whomever
comes along next wanting to add more encoding info?

(And maybe a manpage and some basic testing, to reiterate Dave...)

--D

> +};
> +
>  /* high priority request, poll if possible */
>  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
>  
> @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
>  
> +/* encoded (e.g., compressed or encrypted) IO */
> +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -			 RWF_APPEND)
> +			 RWF_APPEND | RWF_ENCODED)
>  
>  #endif /* _UAPI_LINUX_FS_H */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1146fcfa3215..d2e6d9caf353 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
>  	return 0;
>  }
>  
> -/*
> - * Performs necessary checks before doing a write
> - *
> - * Can adjust writing position or amount of bytes to write.
> - * Returns appropriate error code that caller should return or
> - * zero in case that write should be allowed.
> - */
> -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
> -	loff_t count;
> -	int ret;
>  
>  	if (IS_SWAPFILE(inode))
>  		return -ETXTBSY;
>  
> -	if (!iov_iter_count(from))
> +	if (!*count)
>  		return 0;
>  
>  	/* FIXME: this is for backwards compatibility with 2.4 */
> @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
>  		return -EINVAL;
>  
> -	count = iov_iter_count(from);
> -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> +}
> +
> +/*
> + * Performs necessary checks before doing a write
> + *
> + * Can adjust writing position or amount of bytes to write.
> + * Returns a negative errno or the new number of bytes to write.
> + */
> +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	loff_t count = iov_iter_count(from);
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
>  	if (ret)
>  		return ret;
>  
> @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  }
>  EXPORT_SYMBOL(generic_write_checks);
>  
> +int generic_encoded_write_checks(struct kiocb *iocb,
> +				 struct encoded_iov *encoded)
> +{
> +	loff_t count = encoded->unencoded_len;
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
> +	if (ret)
> +		return ret;
> +
> +	if (count != encoded->unencoded_len) {
> +		/*
> +		 * The write got truncated by generic_write_checks_common(). We
> +		 * can't do a partial encoded write.
> +		 */
> +		return -EFBIG;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(generic_encoded_write_checks);
> +
> +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> +		return -EINVAL;
> +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> +}
> +EXPORT_SYMBOL(check_encoded_read);
> +
> +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> +			 struct iov_iter *from)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> +		return -EINVAL;
> +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> +		return -EFAULT;
> +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> +		return -EINVAL;
> +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> +		return -EINVAL;
> +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> +		return -EINVAL;
> +	return 0;
> +}
> +EXPORT_SYMBOL(import_encoded_write);
> +
>  /*
>   * Performs necessary checks before doing a clone.
>   *
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-21 18:28   ` Darrick J. Wong
@ 2019-10-21 18:38     ` Aleksa Sarai
  2019-10-21 19:00       ` Darrick J. Wong
  2019-10-21 19:07     ` Omar Sandoval
  1 sibling, 1 reply; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-21 18:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team

[-- Attachment #1: Type: text/plain, Size: 9945 bytes --]

On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  include/linux/fs.h      | 14 +++++++
> >  include/uapi/linux/fs.h | 26 ++++++++++++-
> >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC		(1 << 5)
> >  #define IOCB_WRITE		(1 << 6)
> >  #define IOCB_NOWAIT		(1 << 7)
> > +#define IOCB_ENCODED		(1 << 8)
> >  
> >  struct kiocb {
> >  	struct file		*ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +				struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >  				struct file *file_out, loff_t pos_out,
> >  				loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> >  			return -EOPNOTSUPP;
> >  		ki->ki_flags |= IOCB_NOWAIT;
> >  	}
> > +	if (flags & RWF_ENCODED) {
> > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +			return -EOPNOTSUPP;
> > +		ki->ki_flags |= IOCB_ENCODED;
> > +	}
> >  	if (flags & RWF_HIPRI)
> >  		ki->ki_flags |= IOCB_HIPRI;
> >  	if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +	ENCODED_IOV_COMPRESSION_NONE,
> > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > +	ENCODED_IOV_COMPRESSION_LZO,
> > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +	ENCODED_IOV_ENCRYPTION_NONE,
> > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> 
> Can we add some must-be-zero padding space at the end here for whomever
> comes along next wanting to add more encoding info?

I would suggest to copy the extension design of copy_struct_from_user().
Adding must-be-zero padding is a less-ideal solution to the extension
problem than length-based extension.

Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
with syscall structure arguments)?

> (And maybe a manpage and some basic testing, to reiterate Dave...)
> 
> --D
> 
> > +};
> > +
> >  /* high priority request, poll if possible */
> >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> >  
> > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> >  /* per-IO O_APPEND */
> >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> >  
> > +/* encoded (e.g., compressed or encrypted) IO */
> > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > +
> >  /* mask of flags supported by the kernel */
> >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > -			 RWF_APPEND)
> > +			 RWF_APPEND | RWF_ENCODED)
> >  
> >  #endif /* _UAPI_LINUX_FS_H */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1146fcfa3215..d2e6d9caf353 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> >  	return 0;
> >  }
> >  
> > -/*
> > - * Performs necessary checks before doing a write
> > - *
> > - * Can adjust writing position or amount of bytes to write.
> > - * Returns appropriate error code that caller should return or
> > - * zero in case that write should be allowed.
> > - */
> > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file->f_mapping->host;
> > -	loff_t count;
> > -	int ret;
> >  
> >  	if (IS_SWAPFILE(inode))
> >  		return -ETXTBSY;
> >  
> > -	if (!iov_iter_count(from))
> > +	if (!*count)
> >  		return 0;
> >  
> >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> >  		return -EINVAL;
> >  
> > -	count = iov_iter_count(from);
> > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > +}
> > +
> > +/*
> > + * Performs necessary checks before doing a write
> > + *
> > + * Can adjust writing position or amount of bytes to write.
> > + * Returns a negative errno or the new number of bytes to write.
> > + */
> > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	loff_t count = iov_iter_count(from);
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  }
> >  EXPORT_SYMBOL(generic_write_checks);
> >  
> > +int generic_encoded_write_checks(struct kiocb *iocb,
> > +				 struct encoded_iov *encoded)
> > +{
> > +	loff_t count = encoded->unencoded_len;
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (count != encoded->unencoded_len) {
> > +		/*
> > +		 * The write got truncated by generic_write_checks_common(). We
> > +		 * can't do a partial encoded write.
> > +		 */
> > +		return -EFBIG;
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > +
> > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > +		return -EINVAL;
> > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > +}
> > +EXPORT_SYMBOL(check_encoded_read);
> > +
> > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > +			 struct iov_iter *from)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > +		return -EINVAL;
> > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > +		return -EFAULT;
> > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > +		return -EINVAL;
> > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > +		return -EINVAL;
> > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > +		return -EINVAL;
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(import_encoded_write);
> > +
> >  /*
> >   * Performs necessary checks before doing a clone.
> >   *
> > -- 
> > 2.23.0
> > 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-21  6:18   ` [PATCH man-pages] Document encoded I/O Amir Goldstein
@ 2019-10-21 18:53     ` Omar Sandoval
  2019-10-22  6:40       ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-21 18:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Linux Btrfs, Dave Chinner, Jann Horn, Linux API,
	kernel-team, Theodore Tso

On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> CC: Ted
> 
> What ever happened to read/write ext4 encrypted data API?
> https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> 
> Can we learn anything from the ext4 experience to improve
> the new proposed API?

I wasn't aware of these patches, thanks for pointing them out. Ted, do
you have any thoughts about making this API work for fscrypt?

> On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > From: Omar Sandoval <osandov@fb.com>
> >
> > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > reference it.
> >
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  man2/fcntl.2       |  10 +-
> >  man2/open.2        |  13 ++
> >  man2/readv.2       |  46 +++++++
> >  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 365 insertions(+), 1 deletion(-)
> >  create mode 100644 man7/rwf_encoded.7
> >
> > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > index fce4f4c2b..76fe9cc6f 100644
> > --- a/man2/fcntl.2
> > +++ b/man2/fcntl.2
> > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> >  .BR O_ASYNC ,
> >  .BR O_DIRECT ,
> >  .BR O_NOATIME ,
> > +.BR O_NONBLOCK ,
> >  and
> > -.B O_NONBLOCK
> > +.B O_ENCODED
> >  flags.
> >  It is not possible to change the
> >  .BR O_DSYNC
> > @@ -1803,6 +1804,13 @@ Attempted to clear the
> >  flag on a file that has the append-only attribute set.
> >  .TP
> >  .B EPERM
> > +Attempted to set the
> > +.B O_ENCODED
> > +flag and the calling process did not have the
> > +.B CAP_SYS_ADMIN
> > +capability.
> > +.TP
> > +.B EPERM
> >  .I cmd
> >  was
> >  .BR F_ADD_SEALS ,
> > diff --git a/man2/open.2 b/man2/open.2
> > index b0f485b41..cdd3c549c 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -421,6 +421,14 @@ was followed by a call to
> >  .BR fdatasync (2)).
> >  .IR "See NOTES below" .
> >  .TP
> > +.B O_ENCODED
> > +Open the file with encoded I/O permissions;
> 
> 1. I find the name of the flag confusing.
> Yes, most people don't read documentation so carefully (or at all)
> so they will assume O_ENCODED will affect read/write or that it
> relates to RWF_ENCODED in a similar way that O_SYNC relates
> to RWF_SYNC (i.e. logical OR and not logical AND).
> 
> I am not good at naming and to prove it I will propose:
> O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED

Agreed, the name is misleading. I can't think of anything better than
O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
something better :)

> 2. While I see no harm in adding O_ flag to open(2) for this
> use case, I also don't see a major benefit in adding it.
> What if we only allowed setting the flag via fcntl(2) which returns
> an error on old kernels?
> Since unlike most O_ flags, O_ENCODED does NOT affect file
> i/o without additional opt-in flags, it is not standard anyway and
> therefore I find that setting it only via fcntl(2) is less error prone.

If I make this fcntl-only, then it probably shouldn't be through
F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
for open(), and also awkward to mix some non-O_ flag with O_ flags for
F_GETFL/F_SETFL). So that leaves a couple of options:

1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
   FD_CLOEXEC. That also silently ignores unknown flags, but as with the
   O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
   overkill to me.

However, both of these options are annoying to implement. Ideally, we
wouldn't have to add another flags field to struct file. But, to reuse
f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
with other O_ flags, and we'd probably want to hide it from F_GETFL. At
that point, it might as well be an O_ flag.

It seems to me that it's more trouble than it's worth to make this not
an O_ flag, but please let me know if you see a nice way to do so.

> > +see
> > +.BR rwf_encoded (7).
> > +The caller must have the
> > +.B CAP_SYS_ADMIN
> > +capabilty.
> > +.TP
> >  .B O_EXCL
> >  Ensure that this call creates the file:
> >  if this flag is specified in conjunction with
> > @@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller was not privileged.
> >  The operation was prevented by a file seal; see
> >  .BR fcntl (2).
> >  .TP
> > +.B EPERM
> > +The
> > +.B O_ENCODED
> > +flag was specified, but the caller was not privileged.
> > +.TP
> >  .B EROFS
> >  .I pathname
> >  refers to a file on a read-only filesystem and write access was
> > diff --git a/man2/readv.2 b/man2/readv.2
> > index af27aa63e..aa60b980a 100644
> > --- a/man2/readv.2
> > +++ b/man2/readv.2
> > @@ -265,6 +265,11 @@ the data is always appended to the end of the file.
> >  However, if the
> >  .I offset
> >  argument is \-1, the current file offset is updated.
> > +.TP
> > +.BR RWF_ENCODED " (since Linux 5.6)"
> > +Read or write encoded (e.g., compressed) data.
> > +See
> > +.BR rwf_encoded (7).
> >  .SH RETURN VALUE
> >  On success,
> >  .BR readv (),
> > @@ -284,6 +289,13 @@ than requested (see
> >  and
> >  .BR write (2)).
> >  .PP
> > +If
> > +.B
> > +RWF_ENCODED
> > +was specified in
> > +.IR flags ,
> > +then the return value is the number of encoded bytes.
> > +.PP
> >  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
> >  .SH ERRORS
> >  The errors are as given for
> > @@ -314,6 +326,40 @@ is less than zero or greater than the permitted maximum.
> >  .TP
> >  .B EOPNOTSUPP
> >  An unknown flag is specified in \fIflags\fP.
> > +.TP
> > +.B EOPNOTSUPP
> > +.B RWF_ENCODED
> > +is specified in
> > +.I flags
> > +and the filesystem does not implement encoded I/O.
> > +.TP
> > +.B EPERM
> > +.B RWF_ENCODED
> > +is specified in
> > +.I flags
> > +and the file was not opened with the
> > +.B O_ENCODED
> > +flag.
> > +.PP
> > +.BR preadv2 ()
> > +can fail for the following reasons:
> > +.TP
> > +.B EFBIG
> > +.B RWF_ENCODED
> > +is specified in
> > +.I flags
> > +and buffers in
> > +.I iov
> > +were not big enough to return the encoded data.
> 
> I don't like it that EFBIG is returned for read.
> While EXXX values meaning is often very vague, EFBIG meaning
> is still quite consistent - it is always a write related error when trying
> to change i_size above fs/system limits.
> 
> In the case above, I find E2BIG much more appropriate.
> Although its original meaning was too long arg list, it already grew
> several cases where it generally means "buffer cannot hold the result"
> like the case with msgrcv(2) and listxattr(2).

Yes, E2BIG is a better fit, I'll change it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-21 18:38     ` Aleksa Sarai
@ 2019-10-21 19:00       ` Darrick J. Wong
  2019-10-22  1:37         ` Aleksa Sarai
  2019-10-22  2:02         ` Aleksa Sarai
  0 siblings, 2 replies; 42+ messages in thread
From: Darrick J. Wong @ 2019-10-21 19:00 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team

On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Btrfs supports transparent compression: data written by the user can be
> > > compressed when written to disk and decompressed when read back.
> > > However, we'd like to add an interface to write pre-compressed data
> > > directly to the filesystem, and the matching interface to read
> > > compressed data without decompressing it. This adds support for
> > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > 
> > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > is used for metadata: namely, the compression algorithm, unencoded
> > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > should be used (needed for truncated or hole-punched extents and when
> > > reading in the middle of an extent). For reads, the filesystem returns
> > > this information; for writes, the caller provides it to the filesystem.
> > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > used to extend the interface in the future. The remaining iovecs contain
> > > the encoded extent.
> > > 
> > > Filesystems must indicate that they support encoded writes by setting
> > > FMODE_ENCODED_IO in ->file_open().
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  include/linux/fs.h      | 14 +++++++
> > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index e0d909d35763..54681f21e05e 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > >  /* File does not contribute to nr_files count */
> > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > >  
> > > +/* File supports encoded IO */
> > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > +
> > >  /*
> > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > >   * that indicates that they should check the contents of the iovec are
> > > @@ -314,6 +317,7 @@ enum rw_hint {
> > >  #define IOCB_SYNC		(1 << 5)
> > >  #define IOCB_WRITE		(1 << 6)
> > >  #define IOCB_NOWAIT		(1 << 7)
> > > +#define IOCB_ENCODED		(1 << 8)
> > >  
> > >  struct kiocb {
> > >  	struct file		*ki_filp;
> > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > +struct encoded_iov;
> > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > +				struct iov_iter *);
> > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > >  				struct file *file_out, loff_t pos_out,
> > >  				loff_t *count, unsigned int remap_flags);
> > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > >  			return -EOPNOTSUPP;
> > >  		ki->ki_flags |= IOCB_NOWAIT;
> > >  	}
> > > +	if (flags & RWF_ENCODED) {
> > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > +			return -EOPNOTSUPP;
> > > +		ki->ki_flags |= IOCB_ENCODED;
> > > +	}
> > >  	if (flags & RWF_HIPRI)
> > >  		ki->ki_flags |= IOCB_HIPRI;
> > >  	if (flags & RWF_DSYNC)
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index 379a612f8f1d..ed92a8a257cb 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -284,6 +284,27 @@ struct fsxattr {
> > >  
> > >  typedef int __bitwise __kernel_rwf_t;
> > >  
> > > +enum {
> > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > +};
> > > +
> > > +enum {
> > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > +};
> > > +
> > > +struct encoded_iov {
> > > +	__u64 len;
> > > +	__u64 unencoded_len;
> > > +	__u64 unencoded_offset;
> > > +	__u32 compression;
> > > +	__u32 encryption;
> > 
> > Can we add some must-be-zero padding space at the end here for whomever
> > comes along next wanting to add more encoding info?
> 
> I would suggest to copy the extension design of copy_struct_from_user().
> Adding must-be-zero padding is a less-ideal solution to the extension
> problem than length-based extension.

Come to think of it, you /do/ have to specify iov_len so... yeah, do
that instead; we can always extend the structure later.

> Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
> with syscall structure arguments)?

<shrug> No idea, that's the first I've heard of that type and it doesn't
seem to be used by the fs code.  Why would we care about alignment for
an incore structure?

--D

> 
> > (And maybe a manpage and some basic testing, to reiterate Dave...)
> > 
> > --D
> > 
> > > +};
> > > +
> > >  /* high priority request, poll if possible */
> > >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> > >  
> > > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> > >  /* per-IO O_APPEND */
> > >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> > >  
> > > +/* encoded (e.g., compressed or encrypted) IO */
> > > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > > +
> > >  /* mask of flags supported by the kernel */
> > >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > > -			 RWF_APPEND)
> > > +			 RWF_APPEND | RWF_ENCODED)
> > >  
> > >  #endif /* _UAPI_LINUX_FS_H */
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 1146fcfa3215..d2e6d9caf353 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> > >  	return 0;
> > >  }
> > >  
> > > -/*
> > > - * Performs necessary checks before doing a write
> > > - *
> > > - * Can adjust writing position or amount of bytes to write.
> > > - * Returns appropriate error code that caller should return or
> > > - * zero in case that write should be allowed.
> > > - */
> > > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> > >  {
> > >  	struct file *file = iocb->ki_filp;
> > >  	struct inode *inode = file->f_mapping->host;
> > > -	loff_t count;
> > > -	int ret;
> > >  
> > >  	if (IS_SWAPFILE(inode))
> > >  		return -ETXTBSY;
> > >  
> > > -	if (!iov_iter_count(from))
> > > +	if (!*count)
> > >  		return 0;
> > >  
> > >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> > >  		return -EINVAL;
> > >  
> > > -	count = iov_iter_count(from);
> > > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > > +}
> > > +
> > > +/*
> > > + * Performs necessary checks before doing a write
> > > + *
> > > + * Can adjust writing position or amount of bytes to write.
> > > + * Returns a negative errno or the new number of bytes to write.
> > > + */
> > > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > +{
> > > +	loff_t count = iov_iter_count(from);
> > > +	int ret;
> > > +
> > > +	ret = generic_write_checks_common(iocb, &count);
> > >  	if (ret)
> > >  		return ret;
> > >  
> > > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  }
> > >  EXPORT_SYMBOL(generic_write_checks);
> > >  
> > > +int generic_encoded_write_checks(struct kiocb *iocb,
> > > +				 struct encoded_iov *encoded)
> > > +{
> > > +	loff_t count = encoded->unencoded_len;
> > > +	int ret;
> > > +
> > > +	ret = generic_write_checks_common(iocb, &count);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (count != encoded->unencoded_len) {
> > > +		/*
> > > +		 * The write got truncated by generic_write_checks_common(). We
> > > +		 * can't do a partial encoded write.
> > > +		 */
> > > +		return -EFBIG;
> > > +	}
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > > +
> > > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > > +{
> > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > +		return -EPERM;
> > > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > > +		return -EINVAL;
> > > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > > +}
> > > +EXPORT_SYMBOL(check_encoded_read);
> > > +
> > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > +			 struct iov_iter *from)
> > > +{
> > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > +		return -EPERM;
> > > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > +		return -EINVAL;
> > > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > +		return -EFAULT;
> > > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > > +		return -EINVAL;
> > > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > +		return -EINVAL;
> > > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > > +		return -EINVAL;
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(import_encoded_write);
> > > +
> > >  /*
> > >   * Performs necessary checks before doing a clone.
> > >   *
> > > -- 
> > > 2.23.0
> > > 
> 
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data
  2019-10-20 23:05   ` [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Dave Chinner
@ 2019-10-21 19:04     ` Omar Sandoval
  0 siblings, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-21 19:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-btrfs, Jann Horn, linux-api, kernel-team

On Mon, Oct 21, 2019 at 10:05:01AM +1100, Dave Chinner wrote:
> On Tue, Oct 15, 2019 at 11:42:38AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Hello,
> > 
> > This series adds an API for reading compressed data on a filesystem
> > without decompressing it as well as support for writing compressed data
> > directly to the filesystem. It is based on my previous series which
> > added a Btrfs-specific ioctl [1], but it is now an extension to
> > preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
> > man page patch describing the API in detail. Test cases and examples
> > programs are available [3].
> > 
> > The use case that I have in mind is Btrfs send/receive: currently, when
> > sending data from one compressed filesystem to another, the sending side
> > decompresses the data and the receiving side recompresses it before
> > writing it out. This is wasteful and can be avoided if we can just send
> > and write compressed extents. The send part will be implemented in a
> > separate series, as this API can stand alone.
> > 
> > Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
> > Patch 4 implements encoded reads for Btrfs, and patch 5 implements
> > encoded writes.
> > 
> > Changes from v1 [4]:
> > 
> > - Encoded reads are now also implemented.
> > - The encoded_iov structure now includes metadata for referring to a
> >   subset of decoded data. This is required to handle certain cases where
> >   a compressed extent is truncated, hole punched, or otherwise sliced up
> >   and Btrfs chooses to reflect this in metadata instead of decompressing
> >   the whole extent and rewriting the pieces. We call these "bookend
> >   extents" in Btrfs, but any filesystem supporting transparent encoding
> >   is likely to have a similar concept.
> 
> Where's the in-kernel documentation for this API? You're encoding a
> specific set of behaviours into the user API, so this needs a whole
> heap of documentation in the generic code to describe how it works
> so that other filesystems implementing have a well defined guideline
> to what they need to support.

The man-page I sent is quite detailed, but sure, I can add the relevant
information to the generic code, as well.

> Also, I don't see any test code for this -

It's in the cover letter: https://github.com/osandov/xfstests/tree/rwf-encoded

I haven't sent those patches up because it's tedious to rework and
resend them for each little tweak we make to the API.

> can you please add
> support for RWF_ENCODED to xfs_io and write a suite of unit tests
> for fstests that exercise the user API fully?

Reading requires filesystem-specific decoding, and I wasn't sure if that
would be a good fit for xfs_io. Alternatively, it could dump the raw
buffer to stdout, but whatever interprets it also needs the metadata, so
there'd need to be some sort of protocol between xfs_io and whatever
interprets it. I added a btrfs_read_encoded program in my xfstests
branch above instead. It should be easy enough to move the encoded_write
test program to xfs_io pwrite, though.

> Given our history of
> screwing up new user APIs, this absolutely should not be merged
> until there is a full set of generic unit tests written and reviewed
> for it and support has been added to fsstress, fsx, and other test
> utilities to fuzz and stress the implementation as part of normal
> day-to-day filesystem development...

Sure thing, I'll add support to those tools once the API isn't in flux
so much.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-21 18:28   ` Darrick J. Wong
  2019-10-21 18:38     ` Aleksa Sarai
@ 2019-10-21 19:07     ` Omar Sandoval
  1 sibling, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-21 19:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team

On Mon, Oct 21, 2019 at 11:28:06AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  include/linux/fs.h      | 14 +++++++
> >  include/uapi/linux/fs.h | 26 ++++++++++++-
> >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC		(1 << 5)
> >  #define IOCB_WRITE		(1 << 6)
> >  #define IOCB_NOWAIT		(1 << 7)
> > +#define IOCB_ENCODED		(1 << 8)
> >  
> >  struct kiocb {
> >  	struct file		*ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +				struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >  				struct file *file_out, loff_t pos_out,
> >  				loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> >  			return -EOPNOTSUPP;
> >  		ki->ki_flags |= IOCB_NOWAIT;
> >  	}
> > +	if (flags & RWF_ENCODED) {
> > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +			return -EOPNOTSUPP;
> > +		ki->ki_flags |= IOCB_ENCODED;
> > +	}
> >  	if (flags & RWF_HIPRI)
> >  		ki->ki_flags |= IOCB_HIPRI;
> >  	if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +	ENCODED_IOV_COMPRESSION_NONE,
> > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > +	ENCODED_IOV_COMPRESSION_LZO,
> > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +	ENCODED_IOV_ENCRYPTION_NONE,
> > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> 
> Can we add some must-be-zero padding space at the end here for whomever
> comes along next wanting to add more encoding info?

From the commit message:

iov_len must be set to sizeof(struct encoded_iov), which can be used to
extend the interface in the future.

> (And maybe a manpage and some basic testing, to reiterate Dave...)

I sent a man page as part of this thread:

https://lore.kernel.org/linux-btrfs/c7e8f93596fee7bb818dc0edf29f484036be1abb.1571164851.git.osandov@fb.com/

See my reply to Dave, I have tests in my xfstests repo that I haven't
sent yet.

> --D

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-21 19:00       ` Darrick J. Wong
@ 2019-10-22  1:37         ` Aleksa Sarai
  2019-10-30 22:21           ` Omar Sandoval
  2019-10-22  2:02         ` Aleksa Sarai
  1 sibling, 1 reply; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-22  1:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team

[-- Attachment #1: Type: text/plain, Size: 11816 bytes --]

On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > Btrfs supports transparent compression: data written by the user can be
> > > > compressed when written to disk and decompressed when read back.
> > > > However, we'd like to add an interface to write pre-compressed data
> > > > directly to the filesystem, and the matching interface to read
> > > > compressed data without decompressing it. This adds support for
> > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > 
> > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > should be used (needed for truncated or hole-punched extents and when
> > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > this information; for writes, the caller provides it to the filesystem.
> > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > used to extend the interface in the future. The remaining iovecs contain
> > > > the encoded extent.
> > > > 
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > > 
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >  include/linux/fs.h      | 14 +++++++
> > > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index e0d909d35763..54681f21e05e 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > > >  /* File does not contribute to nr_files count */
> > > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > > >  
> > > > +/* File supports encoded IO */
> > > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > > +
> > > >  /*
> > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > >   * that indicates that they should check the contents of the iovec are
> > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > >  #define IOCB_SYNC		(1 << 5)
> > > >  #define IOCB_WRITE		(1 << 6)
> > > >  #define IOCB_NOWAIT		(1 << 7)
> > > > +#define IOCB_ENCODED		(1 << 8)
> > > >  
> > > >  struct kiocb {
> > > >  	struct file		*ki_filp;
> > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > +struct encoded_iov;
> > > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > +				struct iov_iter *);
> > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > >  				struct file *file_out, loff_t pos_out,
> > > >  				loff_t *count, unsigned int remap_flags);
> > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > >  			return -EOPNOTSUPP;
> > > >  		ki->ki_flags |= IOCB_NOWAIT;
> > > >  	}
> > > > +	if (flags & RWF_ENCODED) {
> > > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > +			return -EOPNOTSUPP;
> > > > +		ki->ki_flags |= IOCB_ENCODED;
> > > > +	}
> > > >  	if (flags & RWF_HIPRI)
> > > >  		ki->ki_flags |= IOCB_HIPRI;
> > > >  	if (flags & RWF_DSYNC)
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > >  
> > > >  typedef int __bitwise __kernel_rwf_t;
> > > >  
> > > > +enum {
> > > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +};
> > > > +
> > > > +enum {
> > > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > > +};
> > > > +
> > > > +struct encoded_iov {
> > > > +	__u64 len;
> > > > +	__u64 unencoded_len;
> > > > +	__u64 unencoded_offset;
> > > > +	__u32 compression;
> > > > +	__u32 encryption;
> > > 
> > > Can we add some must-be-zero padding space at the end here for whomever
> > > comes along next wanting to add more encoding info?
> > 
> > I would suggest to copy the extension design of copy_struct_from_user().
> > Adding must-be-zero padding is a less-ideal solution to the extension
> > problem than length-based extension.
> 
> Come to think of it, you /do/ have to specify iov_len so... yeah, do
> that instead; we can always extend the structure later.
> 
> > Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
> > with syscall structure arguments)?
> 
> <shrug> No idea, that's the first I've heard of that type and it doesn't
> seem to be used by the fs code.  Why would we care about alignment for
> an incore structure?

When passing u64s from userspace, it's generally considered a good idea
to use __aligned_u64 -- the main reason is that 32-bit userspace on a
64-bit kernel will use different structure alignment for 64-bit fields.

This means you'd need to implement a bunch of COMPAT_SYSCALL-like
handling for that case. It's much simpler to use __aligned_u64 (and on
the plus side I don't think you need to add any fields to ensure the
padding is zero).

> > 
> > > (And maybe a manpage and some basic testing, to reiterate Dave...)
> > > 
> > > --D
> > > 
> > > > +};
> > > > +
> > > >  /* high priority request, poll if possible */
> > > >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> > > >  
> > > > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> > > >  /* per-IO O_APPEND */
> > > >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> > > >  
> > > > +/* encoded (e.g., compressed or encrypted) IO */
> > > > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > > > +
> > > >  /* mask of flags supported by the kernel */
> > > >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > > > -			 RWF_APPEND)
> > > > +			 RWF_APPEND | RWF_ENCODED)
> > > >  
> > > >  #endif /* _UAPI_LINUX_FS_H */
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 1146fcfa3215..d2e6d9caf353 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -/*
> > > > - * Performs necessary checks before doing a write
> > > > - *
> > > > - * Can adjust writing position or amount of bytes to write.
> > > > - * Returns appropriate error code that caller should return or
> > > > - * zero in case that write should be allowed.
> > > > - */
> > > > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> > > >  {
> > > >  	struct file *file = iocb->ki_filp;
> > > >  	struct inode *inode = file->f_mapping->host;
> > > > -	loff_t count;
> > > > -	int ret;
> > > >  
> > > >  	if (IS_SWAPFILE(inode))
> > > >  		return -ETXTBSY;
> > > >  
> > > > -	if (!iov_iter_count(from))
> > > > +	if (!*count)
> > > >  		return 0;
> > > >  
> > > >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > > > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> > > >  		return -EINVAL;
> > > >  
> > > > -	count = iov_iter_count(from);
> > > > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > > > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Performs necessary checks before doing a write
> > > > + *
> > > > + * Can adjust writing position or amount of bytes to write.
> > > > + * Returns a negative errno or the new number of bytes to write.
> > > > + */
> > > > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +{
> > > > +	loff_t count = iov_iter_count(from);
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > >  	if (ret)
> > > >  		return ret;
> > > >  
> > > > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  }
> > > >  EXPORT_SYMBOL(generic_write_checks);
> > > >  
> > > > +int generic_encoded_write_checks(struct kiocb *iocb,
> > > > +				 struct encoded_iov *encoded)
> > > > +{
> > > > +	loff_t count = encoded->unencoded_len;
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (count != encoded->unencoded_len) {
> > > > +		/*
> > > > +		 * The write got truncated by generic_write_checks_common(). We
> > > > +		 * can't do a partial encoded write.
> > > > +		 */
> > > > +		return -EFBIG;
> > > > +	}
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > > > +
> > > > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > > > +		return -EINVAL;
> > > > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > > > +}
> > > > +EXPORT_SYMBOL(check_encoded_read);
> > > > +
> > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > +			 struct iov_iter *from)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > +		return -EINVAL;
> > > > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > +		return -EFAULT;
> > > > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > > > +		return -EINVAL;
> > > > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > +		return -EINVAL;
> > > > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > > > +		return -EINVAL;
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(import_encoded_write);
> > > > +
> > > >  /*
> > > >   * Performs necessary checks before doing a clone.
> > > >   *

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-21 19:00       ` Darrick J. Wong
  2019-10-22  1:37         ` Aleksa Sarai
@ 2019-10-22  2:02         ` Aleksa Sarai
  2019-10-30 22:26           ` Omar Sandoval
  1 sibling, 1 reply; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-22  2:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team

[-- Attachment #1: Type: text/plain, Size: 11806 bytes --]

On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > Btrfs supports transparent compression: data written by the user can be
> > > > compressed when written to disk and decompressed when read back.
> > > > However, we'd like to add an interface to write pre-compressed data
> > > > directly to the filesystem, and the matching interface to read
> > > > compressed data without decompressing it. This adds support for
> > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > 
> > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > should be used (needed for truncated or hole-punched extents and when
> > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > this information; for writes, the caller provides it to the filesystem.
> > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > used to extend the interface in the future. The remaining iovecs contain
> > > > the encoded extent.
> > > > 
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > > 
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >  include/linux/fs.h      | 14 +++++++
> > > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index e0d909d35763..54681f21e05e 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > > >  /* File does not contribute to nr_files count */
> > > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > > >  
> > > > +/* File supports encoded IO */
> > > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > > +
> > > >  /*
> > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > >   * that indicates that they should check the contents of the iovec are
> > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > >  #define IOCB_SYNC		(1 << 5)
> > > >  #define IOCB_WRITE		(1 << 6)
> > > >  #define IOCB_NOWAIT		(1 << 7)
> > > > +#define IOCB_ENCODED		(1 << 8)
> > > >  
> > > >  struct kiocb {
> > > >  	struct file		*ki_filp;
> > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > +struct encoded_iov;
> > > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > +				struct iov_iter *);
> > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > >  				struct file *file_out, loff_t pos_out,
> > > >  				loff_t *count, unsigned int remap_flags);
> > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > >  			return -EOPNOTSUPP;
> > > >  		ki->ki_flags |= IOCB_NOWAIT;
> > > >  	}
> > > > +	if (flags & RWF_ENCODED) {
> > > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > +			return -EOPNOTSUPP;
> > > > +		ki->ki_flags |= IOCB_ENCODED;
> > > > +	}
> > > >  	if (flags & RWF_HIPRI)
> > > >  		ki->ki_flags |= IOCB_HIPRI;
> > > >  	if (flags & RWF_DSYNC)
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > >  
> > > >  typedef int __bitwise __kernel_rwf_t;
> > > >  
> > > > +enum {
> > > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +};
> > > > +
> > > > +enum {
> > > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > > +};
> > > > +
> > > > +struct encoded_iov {
> > > > +	__u64 len;
> > > > +	__u64 unencoded_len;
> > > > +	__u64 unencoded_offset;
> > > > +	__u32 compression;
> > > > +	__u32 encryption;
> > > 
> > > Can we add some must-be-zero padding space at the end here for whomever
> > > comes along next wanting to add more encoding info?
> > 
> > I would suggest to copy the extension design of copy_struct_from_user().
> > Adding must-be-zero padding is a less-ideal solution to the extension
> > problem than length-based extension.
> 
> Come to think of it, you /do/ have to specify iov_len so... yeah, do
> that instead; we can always extend the structure later.

Just to clarify -- if we want to make the interface forward-compatible
from the outset (programs built 4 years from now running on 5.5), we
will need to implement this in the original merge. Otherwise userspace
will need to handle backwards-compatibility themselves once new features
are added.

@Omar: If it'd make your life easier, I can send some draft patches
	   which port copy_struct_from_user() to iovec-land.

> > Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
> > with syscall structure arguments)?
> 
> <shrug> No idea, that's the first I've heard of that type and it doesn't
> seem to be used by the fs code.  Why would we care about alignment for
> an incore structure?
> 
> --D
> 
> > 
> > > (And maybe a manpage and some basic testing, to reiterate Dave...)
> > > 
> > > --D
> > > 
> > > > +};
> > > > +
> > > >  /* high priority request, poll if possible */
> > > >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> > > >  
> > > > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> > > >  /* per-IO O_APPEND */
> > > >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> > > >  
> > > > +/* encoded (e.g., compressed or encrypted) IO */
> > > > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > > > +
> > > >  /* mask of flags supported by the kernel */
> > > >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > > > -			 RWF_APPEND)
> > > > +			 RWF_APPEND | RWF_ENCODED)
> > > >  
> > > >  #endif /* _UAPI_LINUX_FS_H */
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 1146fcfa3215..d2e6d9caf353 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -/*
> > > > - * Performs necessary checks before doing a write
> > > > - *
> > > > - * Can adjust writing position or amount of bytes to write.
> > > > - * Returns appropriate error code that caller should return or
> > > > - * zero in case that write should be allowed.
> > > > - */
> > > > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> > > >  {
> > > >  	struct file *file = iocb->ki_filp;
> > > >  	struct inode *inode = file->f_mapping->host;
> > > > -	loff_t count;
> > > > -	int ret;
> > > >  
> > > >  	if (IS_SWAPFILE(inode))
> > > >  		return -ETXTBSY;
> > > >  
> > > > -	if (!iov_iter_count(from))
> > > > +	if (!*count)
> > > >  		return 0;
> > > >  
> > > >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > > > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> > > >  		return -EINVAL;
> > > >  
> > > > -	count = iov_iter_count(from);
> > > > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > > > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Performs necessary checks before doing a write
> > > > + *
> > > > + * Can adjust writing position or amount of bytes to write.
> > > > + * Returns a negative errno or the new number of bytes to write.
> > > > + */
> > > > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +{
> > > > +	loff_t count = iov_iter_count(from);
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > >  	if (ret)
> > > >  		return ret;
> > > >  
> > > > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  }
> > > >  EXPORT_SYMBOL(generic_write_checks);
> > > >  
> > > > +int generic_encoded_write_checks(struct kiocb *iocb,
> > > > +				 struct encoded_iov *encoded)
> > > > +{
> > > > +	loff_t count = encoded->unencoded_len;
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (count != encoded->unencoded_len) {
> > > > +		/*
> > > > +		 * The write got truncated by generic_write_checks_common(). We
> > > > +		 * can't do a partial encoded write.
> > > > +		 */
> > > > +		return -EFBIG;
> > > > +	}
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > > > +
> > > > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > > > +		return -EINVAL;
> > > > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > > > +}
> > > > +EXPORT_SYMBOL(check_encoded_read);
> > > > +
> > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > +			 struct iov_iter *from)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > +		return -EINVAL;
> > > > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > +		return -EFAULT;
> > > > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > > > +		return -EINVAL;
> > > > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > +		return -EINVAL;
> > > > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > > > +		return -EINVAL;
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(import_encoded_write);
> > > > +
> > > >  /*
> > > >   * Performs necessary checks before doing a clone.
> > > >   *

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-21 18:53     ` Omar Sandoval
@ 2019-10-22  6:40       ` Amir Goldstein
  2019-10-23  4:44         ` Aleksa Sarai
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-10-22  6:40 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, Linux Btrfs, Dave Chinner, Jann Horn, Linux API,
	kernel-team, Theodore Tso

On Mon, Oct 21, 2019 at 9:54 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> > CC: Ted
> >
> > What ever happened to read/write ext4 encrypted data API?
> > https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> >
> > Can we learn anything from the ext4 experience to improve
> > the new proposed API?
>
> I wasn't aware of these patches, thanks for pointing them out. Ted, do
> you have any thoughts about making this API work for fscrypt?
>
> > On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
> > >
> > > From: Omar Sandoval <osandov@fb.com>
> > >
> > > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > > reference it.
> > >
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  man2/fcntl.2       |  10 +-
> > >  man2/open.2        |  13 ++
> > >  man2/readv.2       |  46 +++++++
> > >  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 365 insertions(+), 1 deletion(-)
> > >  create mode 100644 man7/rwf_encoded.7
> > >
> > > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > > index fce4f4c2b..76fe9cc6f 100644
> > > --- a/man2/fcntl.2
> > > +++ b/man2/fcntl.2
> > > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> > >  .BR O_ASYNC ,
> > >  .BR O_DIRECT ,
> > >  .BR O_NOATIME ,
> > > +.BR O_NONBLOCK ,
> > >  and
> > > -.B O_NONBLOCK
> > > +.B O_ENCODED
> > >  flags.
> > >  It is not possible to change the
> > >  .BR O_DSYNC
> > > @@ -1803,6 +1804,13 @@ Attempted to clear the
> > >  flag on a file that has the append-only attribute set.
> > >  .TP
> > >  .B EPERM
> > > +Attempted to set the
> > > +.B O_ENCODED
> > > +flag and the calling process did not have the
> > > +.B CAP_SYS_ADMIN
> > > +capability.
> > > +.TP
> > > +.B EPERM
> > >  .I cmd
> > >  was
> > >  .BR F_ADD_SEALS ,
> > > diff --git a/man2/open.2 b/man2/open.2
> > > index b0f485b41..cdd3c549c 100644
> > > --- a/man2/open.2
> > > +++ b/man2/open.2
> > > @@ -421,6 +421,14 @@ was followed by a call to
> > >  .BR fdatasync (2)).
> > >  .IR "See NOTES below" .
> > >  .TP
> > > +.B O_ENCODED
> > > +Open the file with encoded I/O permissions;
> >
> > 1. I find the name of the flag confusing.
> > Yes, most people don't read documentation so carefully (or at all)
> > so they will assume O_ENCODED will affect read/write or that it
> > relates to RWF_ENCODED in a similar way that O_SYNC relates
> > to RWF_SYNC (i.e. logical OR and not logical AND).
> >
> > I am not good at naming and to prove it I will propose:
> > O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED
>
> Agreed, the name is misleading. I can't think of anything better than
> O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
> something better :)
>
> > 2. While I see no harm in adding O_ flag to open(2) for this
> > use case, I also don't see a major benefit in adding it.
> > What if we only allowed setting the flag via fcntl(2) which returns
> > an error on old kernels?
> > Since unlike most O_ flags, O_ENCODED does NOT affect file
> > i/o without additional opt-in flags, it is not standard anyway and
> > therefore I find that setting it only via fcntl(2) is less error prone.
>
> If I make this fcntl-only, then it probably shouldn't be through
> F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
> for open(), and also awkward to mix some non-O_ flag with O_ flags for
> F_GETFL/F_SETFL). So that leaves a couple of options:
>
> 1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
>    FD_CLOEXEC. That also silently ignores unknown flags, but as with the
>    O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
> 2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
>    overkill to me.
>
> However, both of these options are annoying to implement. Ideally, we
> wouldn't have to add another flags field to struct file. But, to reuse
> f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
> with other O_ flags, and we'd probably want to hide it from F_GETFL. At
> that point, it might as well be an O_ flag.
>
> It seems to me that it's more trouble than it's worth to make this not
> an O_ flag, but please let me know if you see a nice way to do so.
>

No, I see why you choose to add the flag to open(2).
I have no objection.

I once had a crazy thought how to add new open flags
in a non racy manner without adding a new syscall,
but as you wrote, this is not relevant for O_ALLOW_ENCODED.

Something like:

/*
 * Old kernels silently ignore unsupported open flags.
 * New kernels that gets __O_CHECK_NEWFLAGS do
 * the proper checking for unsupported flags AND set the
 * flag __O_HAVE_NEWFLAGS.
 */
#define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
#define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1

fd = open(path, O_FLAG1);
if (fd < 0)
    return -errno;
flags = fcntl(fd, F_GETFL, 0);
if (flags < 0)
    return flags;
if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
    close(fd);
    return -EINVAL;
}

Not pretty, but hidden inside libc, end-users won't need to
be aware of this.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-22  6:40       ` Amir Goldstein
@ 2019-10-23  4:44         ` Aleksa Sarai
  2019-10-23  6:06           ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-23  4:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Omar Sandoval, linux-fsdevel, Linux Btrfs, Dave Chinner,
	Jann Horn, Linux API, kernel-team, Theodore Tso

[-- Attachment #1: Type: text/plain, Size: 6433 bytes --]

On 2019-10-22, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Oct 21, 2019 at 9:54 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> > > CC: Ted
> > >
> > > What ever happened to read/write ext4 encrypted data API?
> > > https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> > >
> > > Can we learn anything from the ext4 experience to improve
> > > the new proposed API?
> >
> > I wasn't aware of these patches, thanks for pointing them out. Ted, do
> > you have any thoughts about making this API work for fscrypt?
> >
> > > On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > >
> > > > From: Omar Sandoval <osandov@fb.com>
> > > >
> > > > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > > > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > > > reference it.
> > > >
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >  man2/fcntl.2       |  10 +-
> > > >  man2/open.2        |  13 ++
> > > >  man2/readv.2       |  46 +++++++
> > > >  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  4 files changed, 365 insertions(+), 1 deletion(-)
> > > >  create mode 100644 man7/rwf_encoded.7
> > > >
> > > > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > > > index fce4f4c2b..76fe9cc6f 100644
> > > > --- a/man2/fcntl.2
> > > > +++ b/man2/fcntl.2
> > > > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> > > >  .BR O_ASYNC ,
> > > >  .BR O_DIRECT ,
> > > >  .BR O_NOATIME ,
> > > > +.BR O_NONBLOCK ,
> > > >  and
> > > > -.B O_NONBLOCK
> > > > +.B O_ENCODED
> > > >  flags.
> > > >  It is not possible to change the
> > > >  .BR O_DSYNC
> > > > @@ -1803,6 +1804,13 @@ Attempted to clear the
> > > >  flag on a file that has the append-only attribute set.
> > > >  .TP
> > > >  .B EPERM
> > > > +Attempted to set the
> > > > +.B O_ENCODED
> > > > +flag and the calling process did not have the
> > > > +.B CAP_SYS_ADMIN
> > > > +capability.
> > > > +.TP
> > > > +.B EPERM
> > > >  .I cmd
> > > >  was
> > > >  .BR F_ADD_SEALS ,
> > > > diff --git a/man2/open.2 b/man2/open.2
> > > > index b0f485b41..cdd3c549c 100644
> > > > --- a/man2/open.2
> > > > +++ b/man2/open.2
> > > > @@ -421,6 +421,14 @@ was followed by a call to
> > > >  .BR fdatasync (2)).
> > > >  .IR "See NOTES below" .
> > > >  .TP
> > > > +.B O_ENCODED
> > > > +Open the file with encoded I/O permissions;
> > >
> > > 1. I find the name of the flag confusing.
> > > Yes, most people don't read documentation so carefully (or at all)
> > > so they will assume O_ENCODED will affect read/write or that it
> > > relates to RWF_ENCODED in a similar way that O_SYNC relates
> > > to RWF_SYNC (i.e. logical OR and not logical AND).
> > >
> > > I am not good at naming and to prove it I will propose:
> > > O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED
> >
> > Agreed, the name is misleading. I can't think of anything better than
> > O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
> > something better :)
> >
> > > 2. While I see no harm in adding O_ flag to open(2) for this
> > > use case, I also don't see a major benefit in adding it.
> > > What if we only allowed setting the flag via fcntl(2) which returns
> > > an error on old kernels?
> > > Since unlike most O_ flags, O_ENCODED does NOT affect file
> > > i/o without additional opt-in flags, it is not standard anyway and
> > > therefore I find that setting it only via fcntl(2) is less error prone.
> >
> > If I make this fcntl-only, then it probably shouldn't be through
> > F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
> > for open(), and also awkward to mix some non-O_ flag with O_ flags for
> > F_GETFL/F_SETFL). So that leaves a couple of options:
> >
> > 1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
> >    FD_CLOEXEC. That also silently ignores unknown flags, but as with the
> >    O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
> > 2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
> >    overkill to me.
> >
> > However, both of these options are annoying to implement. Ideally, we
> > wouldn't have to add another flags field to struct file. But, to reuse
> > f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
> > with other O_ flags, and we'd probably want to hide it from F_GETFL. At
> > that point, it might as well be an O_ flag.
> >
> > It seems to me that it's more trouble than it's worth to make this not
> > an O_ flag, but please let me know if you see a nice way to do so.
> >
> 
> No, I see why you choose to add the flag to open(2).
> I have no objection.
> 
> I once had a crazy thought how to add new open flags
> in a non racy manner without adding a new syscall,
> but as you wrote, this is not relevant for O_ALLOW_ENCODED.
> 
> Something like:
> 
> /*
>  * Old kernels silently ignore unsupported open flags.
>  * New kernels that gets __O_CHECK_NEWFLAGS do
>  * the proper checking for unsupported flags AND set the
>  * flag __O_HAVE_NEWFLAGS.
>  */
> #define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
> #define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1
> 
> fd = open(path, O_FLAG1);
> if (fd < 0)
>     return -errno;
> flags = fcntl(fd, F_GETFL, 0);
> if (flags < 0)
>     return flags;
> if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
>     close(fd);
>     return -EINVAL;
> }

You don't need to add __O_HAVE_NEWFLAGS to do this -- this already works
today for userspace to check whether a flag works properly
(specifically, __O_FLAG1 will only be set if __O_FLAG1 is supported --
otherwise it gets cleared during build_open_flags).

The problem with adding new flags is that an *old* program running on a
*new* kernel could pass a garbage flag (__O_CHECK_NEWFLAGS for instance)
that causes an error only on the new kernel.

The only real solution to this (and several other problems) is
openat2(). As for O_ALLOW_ENCODED -- the current semantics (-EPERM if it
is set without CAP_SYS_ADMIN) *will* cause backwards compatibility
issues for programs that have garbage flags set...

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 1/5] fs: add O_ENCODED open flag
  2019-10-19  4:50   ` Aleksa Sarai
@ 2019-10-23  4:46     ` Aleksa Sarai
  2019-10-30 22:55     ` Omar Sandoval
  1 sibling, 0 replies; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-23  4:46 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 5401 bytes --]

On 2019-10-19, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-10-15, Omar Sandoval <osandov@osandov.com> wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The upcoming RWF_ENCODED operation introduces some security concerns:
> > 
> > 1. Compressed writes will pass arbitrary data to decompression
> >    algorithms in the kernel.
> > 2. Compressed reads can leak truncated/hole punched data.
> > 
> > Therefore, we need to require privilege for RWF_ENCODED. It's not
> > possible to do the permissions checks at the time of the read or write
> > because, e.g., io_uring submits IO from a worker thread. So, add an open
> > flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> > fcntl(). The flag is not cleared in any way on fork or exec; it should
> > probably be used with O_CLOEXEC in most cases.
> > 
> > Note that the usual issue that unknown open flags are ignored doesn't
> > really matter for O_ENCODED; if the kernel doesn't support O_ENCODED,
> > then it doesn't support RWF_ENCODED, either.

I also disagree with this statement -- if an old userspace program sets
O_ENCODED it will now get an -EPERM if it doesn't have CAP_SYS_ADMIN.
That is a break in backwards compatibility.

> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/fcntl.c                       | 10 ++++++++--
> >  fs/namei.c                       |  4 ++++
> >  include/linux/fcntl.h            |  2 +-
> >  include/uapi/asm-generic/fcntl.h |  4 ++++
> >  4 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 3d40771e8e7c..45ebc6df078e 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -30,7 +30,8 @@
> >  #include <asm/siginfo.h>
> >  #include <linux/uaccess.h>
> >  
> > -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> > +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> > +		    O_ENCODED)
> >  
> >  static int setfl(int fd, struct file * filp, unsigned long arg)
> >  {
> > @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
> >  		if (!inode_owner_or_capable(inode))
> >  			return -EPERM;
> >  
> > +	/* O_ENCODED can only be set by superuser */
> > +	if ((arg & O_ENCODED) && !(filp->f_flags & O_ENCODED) &&
> > +	    !capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> 
> I have a feeling the error should probably be an EACCES and not EPERM.
> 
> > +
> >  	/* required for strict SunOS emulation */
> >  	if (O_NONBLOCK != O_NDELAY)
> >  	       if (arg & O_NDELAY)
> > @@ -1031,7 +1037,7 @@ static int __init fcntl_init(void)
> >  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> >  	 * is defined as O_NONBLOCK on some platforms and not on others.
> >  	 */
> > -	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > +	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
> >  		HWEIGHT32(
> >  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> >  			__FMODE_EXEC | __FMODE_NONOTIFY));
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 671c3c1a3425..ae86b125888a 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2978,6 +2978,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
> >  	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
> >  		return -EPERM;
> >  
> > +	/* O_ENCODED can only be set by superuser */
> > +	if ((flag & O_ENCODED) && !capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> 
> I would suggest that this check be put into build_open_flags() rather
> than putting it this late in open(). Also, same nit about the error
> return as above.
> 
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> > index d019df946cb2..5fac02479639 100644
> > --- a/include/linux/fcntl.h
> > +++ b/include/linux/fcntl.h
> > @@ -9,7 +9,7 @@
> >  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> >  	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> >  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> > -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> > +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ENCODED)
> >  
> >  #ifndef force_o_largefile
> >  #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > index 9dc0bf0c5a6e..8c5cbd5942e3 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -97,6 +97,10 @@
> >  #define O_NDELAY	O_NONBLOCK
> >  #endif
> >  
> > +#ifndef O_ENCODED
> > +#define O_ENCODED	040000000
> > +#endif
> 
> You should also define this for all of the architectures which don't use
> the generic O_* flag values. On alpha, O_PATH is equal to the value you
> picked (just be careful on sparc -- 0x4000000 is the next free bit, but
> it's used by FMODE_NONOTIFY.)
> 
> > +
> >  #define F_DUPFD		0	/* dup */
> >  #define F_GETFD		1	/* get close_on_exec */
> >  #define F_SETFD		2	/* set/clear close_on_exec */
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>




-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-23  4:44         ` Aleksa Sarai
@ 2019-10-23  6:06           ` Amir Goldstein
  2019-10-23 12:12             ` Aleksa Sarai
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-10-23  6:06 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Omar Sandoval, linux-fsdevel, Linux Btrfs, Dave Chinner,
	Jann Horn, Linux API, kernel-team, Theodore Tso

> >
> > No, I see why you choose to add the flag to open(2).
> > I have no objection.
> >
> > I once had a crazy thought how to add new open flags
> > in a non racy manner without adding a new syscall,
> > but as you wrote, this is not relevant for O_ALLOW_ENCODED.
> >
> > Something like:
> >
> > /*
> >  * Old kernels silently ignore unsupported open flags.
> >  * New kernels that gets __O_CHECK_NEWFLAGS do
> >  * the proper checking for unsupported flags AND set the
> >  * flag __O_HAVE_NEWFLAGS.
> >  */
> > #define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
> > #define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1
> >
> > fd = open(path, O_FLAG1);
> > if (fd < 0)
> >     return -errno;
> > flags = fcntl(fd, F_GETFL, 0);
> > if (flags < 0)
> >     return flags;
> > if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
> >     close(fd);
> >     return -EINVAL;
> > }
>
> You don't need to add __O_HAVE_NEWFLAGS to do this -- this already works
> today for userspace to check whether a flag works properly
> (specifically, __O_FLAG1 will only be set if __O_FLAG1 is supported --
> otherwise it gets cleared during build_open_flags).

That's a behavior of quite recent kernels since
629e014bb834 fs: completely ignore unknown open flags
and maybe some stable kernels. Real old kernels don't have that luxury.

>
> The problem with adding new flags is that an *old* program running on a
> *new* kernel could pass a garbage flag (__O_CHECK_NEWFLAGS for instance)
> that causes an error only on the new kernel.
>

That's a theoretic problem. Same as O_PATH|O_TMPFILE.
Show me a real life program that passes garbage files to open.

> The only real solution to this (and several other problems) is
> openat2().

No argue about that. Come on, let's get it merged ;-)

> As for O_ALLOW_ENCODED -- the current semantics (-EPERM if it
> is set without CAP_SYS_ADMIN) *will* cause backwards compatibility
> issues for programs that have garbage flags set...
>

Again, that's theoretical.
In practice, O_ALLOW_ENCODED can work with open()/openat().
In fact, even if O_ALLOW_ENCODED gets merged after openat2(),
I don't think it should be forbidden by open()/openat(), right?
Do in that sense, O_ALLOW_ENCODED does not depend on openat2().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-23  6:06           ` Amir Goldstein
@ 2019-10-23 12:12             ` Aleksa Sarai
  2019-10-30 22:46               ` Omar Sandoval
  0 siblings, 1 reply; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-23 12:12 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Omar Sandoval, linux-fsdevel, Linux Btrfs, Dave Chinner,
	Jann Horn, Linux API, kernel-team, Theodore Tso

[-- Attachment #1: Type: text/plain, Size: 3533 bytes --]

On 2019-10-23, Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > No, I see why you choose to add the flag to open(2).
> > > I have no objection.
> > >
> > > I once had a crazy thought how to add new open flags
> > > in a non racy manner without adding a new syscall,
> > > but as you wrote, this is not relevant for O_ALLOW_ENCODED.
> > >
> > > Something like:
> > >
> > > /*
> > >  * Old kernels silently ignore unsupported open flags.
> > >  * New kernels that gets __O_CHECK_NEWFLAGS do
> > >  * the proper checking for unsupported flags AND set the
> > >  * flag __O_HAVE_NEWFLAGS.
> > >  */
> > > #define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
> > > #define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1
> > >
> > > fd = open(path, O_FLAG1);
> > > if (fd < 0)
> > >     return -errno;
> > > flags = fcntl(fd, F_GETFL, 0);
> > > if (flags < 0)
> > >     return flags;
> > > if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
> > >     close(fd);
> > >     return -EINVAL;
> > > }
> >
> > You don't need to add __O_HAVE_NEWFLAGS to do this -- this already works
> > today for userspace to check whether a flag works properly
> > (specifically, __O_FLAG1 will only be set if __O_FLAG1 is supported --
> > otherwise it gets cleared during build_open_flags).
> 
> That's a behavior of quite recent kernels since
> 629e014bb834 fs: completely ignore unknown open flags
> and maybe some stable kernels. Real old kernels don't have that luxury.

Ah okay -- so the key feature is that __O_CHECK_NEWFLAGS gets
transformed into __O_HAVE_NEWFLAGS (making it so that both the older and
current behaviours are detected). Apologies, I missed that on my first
read-through.

While it is a little bit ugly, it probably wouldn't be a bad idea to
have something like that.

> > The problem with adding new flags is that an *old* program running on a
> > *new* kernel could pass a garbage flag (__O_CHECK_NEWFLAGS for instance)
> > that causes an error only on the new kernel.
> 
> That's a theoretic problem. Same as O_PATH|O_TMPFILE.
> Show me a real life program that passes garbage files to open.

Has "that's a theoretical problem" helped when we faced this issue in
the past? I don't disagree that this is mostly theoretical, but I have a
feeling that this is an argument that won't hold water.

As for an example of semi-garbage flag passing -- systemd passes
O_PATH|O_NOCTTY in several places. Yes, they're known flags (so not
entirely applicable to this discussion) but it's also not a meaningful
combination of flags and yet is permitted.

> > The only real solution to this (and several other problems) is
> > openat2().
> 
> No argue about that. Come on, let's get it merged ;-)

Believe me, I'm trying. ;)

> > As for O_ALLOW_ENCODED -- the current semantics (-EPERM if it
> > is set without CAP_SYS_ADMIN) *will* cause backwards compatibility
> > issues for programs that have garbage flags set...
> >
> 
> Again, that's theoretical. In practice, O_ALLOW_ENCODED can work with
> open()/openat(). In fact, even if O_ALLOW_ENCODED gets merged after
> openat2(), I don't think it should be forbidden by open()/openat(),
> right? Do in that sense, O_ALLOW_ENCODED does not depend on openat2().

If it's a valid open() flag it'll also be a valid openat2(2) flag. The
only question is whether the garbage-flag problem justifies making it a
no-op for open(2).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-22  1:37         ` Aleksa Sarai
@ 2019-10-30 22:21           ` Omar Sandoval
  0 siblings, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-30 22:21 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Darrick J. Wong, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team

On Tue, Oct 22, 2019 at 12:37:17PM +1100, Aleksa Sarai wrote:
> On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > 
> > > > > Btrfs supports transparent compression: data written by the user can be
> > > > > compressed when written to disk and decompressed when read back.
> > > > > However, we'd like to add an interface to write pre-compressed data
> > > > > directly to the filesystem, and the matching interface to read
> > > > > compressed data without decompressing it. This adds support for
> > > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > > 
> > > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > > should be used (needed for truncated or hole-punched extents and when
> > > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > > this information; for writes, the caller provides it to the filesystem.
> > > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > > used to extend the interface in the future. The remaining iovecs contain
> > > > > the encoded extent.
> > > > > 
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > > 
> > > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > > ---
> > > > >  include/linux/fs.h      | 14 +++++++
> > > > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > > > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index e0d909d35763..54681f21e05e 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > > > >  /* File does not contribute to nr_files count */
> > > > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > > > >  
> > > > > +/* File supports encoded IO */
> > > > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > > > +
> > > > >  /*
> > > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > > >   * that indicates that they should check the contents of the iovec are
> > > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > > >  #define IOCB_SYNC		(1 << 5)
> > > > >  #define IOCB_WRITE		(1 << 6)
> > > > >  #define IOCB_NOWAIT		(1 << 7)
> > > > > +#define IOCB_ENCODED		(1 << 8)
> > > > >  
> > > > >  struct kiocb {
> > > > >  	struct file		*ki_filp;
> > > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > > +struct encoded_iov;
> > > > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > > +				struct iov_iter *);
> > > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > > >  				struct file *file_out, loff_t pos_out,
> > > > >  				loff_t *count, unsigned int remap_flags);
> > > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > > >  			return -EOPNOTSUPP;
> > > > >  		ki->ki_flags |= IOCB_NOWAIT;
> > > > >  	}
> > > > > +	if (flags & RWF_ENCODED) {
> > > > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > > +			return -EOPNOTSUPP;
> > > > > +		ki->ki_flags |= IOCB_ENCODED;
> > > > > +	}
> > > > >  	if (flags & RWF_HIPRI)
> > > > >  		ki->ki_flags |= IOCB_HIPRI;
> > > > >  	if (flags & RWF_DSYNC)
> > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > > --- a/include/uapi/linux/fs.h
> > > > > +++ b/include/uapi/linux/fs.h
> > > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > > >  
> > > > >  typedef int __bitwise __kernel_rwf_t;
> > > > >  
> > > > > +enum {
> > > > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > > > +};
> > > > > +
> > > > > +enum {
> > > > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > > > +};
> > > > > +
> > > > > +struct encoded_iov {
> > > > > +	__u64 len;
> > > > > +	__u64 unencoded_len;
> > > > > +	__u64 unencoded_offset;
> > > > > +	__u32 compression;
> > > > > +	__u32 encryption;
> > > > 
> > > > Can we add some must-be-zero padding space at the end here for whomever
> > > > comes along next wanting to add more encoding info?
> > > 
> > > I would suggest to copy the extension design of copy_struct_from_user().
> > > Adding must-be-zero padding is a less-ideal solution to the extension
> > > problem than length-based extension.
> > 
> > Come to think of it, you /do/ have to specify iov_len so... yeah, do
> > that instead; we can always extend the structure later.
> > 
> > > Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
> > > with syscall structure arguments)?
> > 
> > <shrug> No idea, that's the first I've heard of that type and it doesn't
> > seem to be used by the fs code.  Why would we care about alignment for
> > an incore structure?
> 
> When passing u64s from userspace, it's generally considered a good idea
> to use __aligned_u64 -- the main reason is that 32-bit userspace on a
> 64-bit kernel will use different structure alignment for 64-bit fields.
> 
> This means you'd need to implement a bunch of COMPAT_SYSCALL-like
> handling for that case. It's much simpler to use __aligned_u64 (and on
> the plus side I don't think you need to add any fields to ensure the
> padding is zero).

I'll used __aligned_u64 for the next submission.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-22  2:02         ` Aleksa Sarai
@ 2019-10-30 22:26           ` Omar Sandoval
  2019-10-30 23:11             ` Aleksa Sarai
  0 siblings, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-30 22:26 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Darrick J. Wong, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team

On Tue, Oct 22, 2019 at 01:02:15PM +1100, Aleksa Sarai wrote:
> On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > 
> > > > > Btrfs supports transparent compression: data written by the user can be
> > > > > compressed when written to disk and decompressed when read back.
> > > > > However, we'd like to add an interface to write pre-compressed data
> > > > > directly to the filesystem, and the matching interface to read
> > > > > compressed data without decompressing it. This adds support for
> > > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > > 
> > > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > > should be used (needed for truncated or hole-punched extents and when
> > > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > > this information; for writes, the caller provides it to the filesystem.
> > > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > > used to extend the interface in the future. The remaining iovecs contain
> > > > > the encoded extent.
> > > > > 
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > > 
> > > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > > ---
> > > > >  include/linux/fs.h      | 14 +++++++
> > > > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > > > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index e0d909d35763..54681f21e05e 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > > > >  /* File does not contribute to nr_files count */
> > > > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > > > >  
> > > > > +/* File supports encoded IO */
> > > > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > > > +
> > > > >  /*
> > > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > > >   * that indicates that they should check the contents of the iovec are
> > > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > > >  #define IOCB_SYNC		(1 << 5)
> > > > >  #define IOCB_WRITE		(1 << 6)
> > > > >  #define IOCB_NOWAIT		(1 << 7)
> > > > > +#define IOCB_ENCODED		(1 << 8)
> > > > >  
> > > > >  struct kiocb {
> > > > >  	struct file		*ki_filp;
> > > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > > +struct encoded_iov;
> > > > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > > +				struct iov_iter *);
> > > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > > >  				struct file *file_out, loff_t pos_out,
> > > > >  				loff_t *count, unsigned int remap_flags);
> > > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > > >  			return -EOPNOTSUPP;
> > > > >  		ki->ki_flags |= IOCB_NOWAIT;
> > > > >  	}
> > > > > +	if (flags & RWF_ENCODED) {
> > > > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > > +			return -EOPNOTSUPP;
> > > > > +		ki->ki_flags |= IOCB_ENCODED;
> > > > > +	}
> > > > >  	if (flags & RWF_HIPRI)
> > > > >  		ki->ki_flags |= IOCB_HIPRI;
> > > > >  	if (flags & RWF_DSYNC)
> > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > > --- a/include/uapi/linux/fs.h
> > > > > +++ b/include/uapi/linux/fs.h
> > > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > > >  
> > > > >  typedef int __bitwise __kernel_rwf_t;
> > > > >  
> > > > > +enum {
> > > > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > > > +};
> > > > > +
> > > > > +enum {
> > > > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > > > +};
> > > > > +
> > > > > +struct encoded_iov {
> > > > > +	__u64 len;
> > > > > +	__u64 unencoded_len;
> > > > > +	__u64 unencoded_offset;
> > > > > +	__u32 compression;
> > > > > +	__u32 encryption;
> > > > 
> > > > Can we add some must-be-zero padding space at the end here for whomever
> > > > comes along next wanting to add more encoding info?
> > > 
> > > I would suggest to copy the extension design of copy_struct_from_user().
> > > Adding must-be-zero padding is a less-ideal solution to the extension
> > > problem than length-based extension.
> > 
> > Come to think of it, you /do/ have to specify iov_len so... yeah, do
> > that instead; we can always extend the structure later.
> 
> Just to clarify -- if we want to make the interface forward-compatible
> from the outset (programs built 4 years from now running on 5.5), we
> will need to implement this in the original merge. Otherwise userspace
> will need to handle backwards-compatibility themselves once new features
> are added.
> 
> @Omar: If it'd make your life easier, I can send some draft patches
> 	   which port copy_struct_from_user() to iovec-land.

You're right, I didn't think about the case of newer programs on older
kernels. I can do that for the next submission. RWF_ENCODED should
probably translate the E2BIG from copy_struct_from_user() to EINVAL,
though, to avoid ambiguity with the case that the buffer wasn't big
enough to return the encoded data.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-23 12:12             ` Aleksa Sarai
@ 2019-10-30 22:46               ` Omar Sandoval
  2019-10-30 22:57                 ` Omar Sandoval
  0 siblings, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-30 22:46 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Amir Goldstein, linux-fsdevel, Linux Btrfs, Dave Chinner,
	Jann Horn, Linux API, kernel-team, Theodore Tso

On Wed, Oct 23, 2019 at 11:12:03PM +1100, Aleksa Sarai wrote:
> On 2019-10-23, Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > No, I see why you choose to add the flag to open(2).
> > > > I have no objection.
> > > >
> > > > I once had a crazy thought how to add new open flags
> > > > in a non racy manner without adding a new syscall,
> > > > but as you wrote, this is not relevant for O_ALLOW_ENCODED.
> > > >
> > > > Something like:
> > > >
> > > > /*
> > > >  * Old kernels silently ignore unsupported open flags.
> > > >  * New kernels that gets __O_CHECK_NEWFLAGS do
> > > >  * the proper checking for unsupported flags AND set the
> > > >  * flag __O_HAVE_NEWFLAGS.
> > > >  */
> > > > #define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
> > > > #define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1
> > > >
> > > > fd = open(path, O_FLAG1);
> > > > if (fd < 0)
> > > >     return -errno;
> > > > flags = fcntl(fd, F_GETFL, 0);
> > > > if (flags < 0)
> > > >     return flags;
> > > > if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
> > > >     close(fd);
> > > >     return -EINVAL;
> > > > }
> > >
> > > You don't need to add __O_HAVE_NEWFLAGS to do this -- this already works
> > > today for userspace to check whether a flag works properly
> > > (specifically, __O_FLAG1 will only be set if __O_FLAG1 is supported --
> > > otherwise it gets cleared during build_open_flags).
> > 
> > That's a behavior of quite recent kernels since
> > 629e014bb834 fs: completely ignore unknown open flags
> > and maybe some stable kernels. Real old kernels don't have that luxury.
> 
> Ah okay -- so the key feature is that __O_CHECK_NEWFLAGS gets
> transformed into __O_HAVE_NEWFLAGS (making it so that both the older and
> current behaviours are detected). Apologies, I missed that on my first
> read-through.
> 
> While it is a little bit ugly, it probably wouldn't be a bad idea to
> have something like that.
> 
> > > The problem with adding new flags is that an *old* program running on a
> > > *new* kernel could pass a garbage flag (__O_CHECK_NEWFLAGS for instance)
> > > that causes an error only on the new kernel.
> > 
> > That's a theoretic problem. Same as O_PATH|O_TMPFILE.
> > Show me a real life program that passes garbage files to open.
> 
> Has "that's a theoretical problem" helped when we faced this issue in
> the past? I don't disagree that this is mostly theoretical, but I have a
> feeling that this is an argument that won't hold water.
> 
> As for an example of semi-garbage flag passing -- systemd passes
> O_PATH|O_NOCTTY in several places. Yes, they're known flags (so not
> entirely applicable to this discussion) but it's also not a meaningful
> combination of flags and yet is permitted.
> 
> > > The only real solution to this (and several other problems) is
> > > openat2().
> > 
> > No argue about that. Come on, let's get it merged ;-)
> 
> Believe me, I'm trying. ;)
> 
> > > As for O_ALLOW_ENCODED -- the current semantics (-EPERM if it
> > > is set without CAP_SYS_ADMIN) *will* cause backwards compatibility
> > > issues for programs that have garbage flags set...
> > >
> > 
> > Again, that's theoretical. In practice, O_ALLOW_ENCODED can work with
> > open()/openat(). In fact, even if O_ALLOW_ENCODED gets merged after
> > openat2(), I don't think it should be forbidden by open()/openat(),
> > right? Do in that sense, O_ALLOW_ENCODED does not depend on openat2().
> 
> If it's a valid open() flag it'll also be a valid openat2(2) flag. The
> only question is whether the garbage-flag problem justifies making it a
> no-op for open(2).

Consider O_NOATIME: a (non-root) program passing this flag for files it
didn't own would have been broken by kernel v2.6.8. Or, more recently, a
program accidentally setting O_TMPFILE would suddenly get drastically
different behavior on v3.11. These two flags technically broke backwards
compatibility. I don't think it's worth the trouble to treat
O_ALLOW_ENCODED any differently for open().

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 1/5] fs: add O_ENCODED open flag
  2019-10-19  4:50   ` Aleksa Sarai
  2019-10-23  4:46     ` Aleksa Sarai
@ 2019-10-30 22:55     ` Omar Sandoval
  2019-10-30 23:17       ` Aleksa Sarai
  1 sibling, 1 reply; 42+ messages in thread
From: Omar Sandoval @ 2019-10-30 22:55 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team

On Sat, Oct 19, 2019 at 03:50:57PM +1100, Aleksa Sarai wrote:
> On 2019-10-15, Omar Sandoval <osandov@osandov.com> wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The upcoming RWF_ENCODED operation introduces some security concerns:
> > 
> > 1. Compressed writes will pass arbitrary data to decompression
> >    algorithms in the kernel.
> > 2. Compressed reads can leak truncated/hole punched data.
> > 
> > Therefore, we need to require privilege for RWF_ENCODED. It's not
> > possible to do the permissions checks at the time of the read or write
> > because, e.g., io_uring submits IO from a worker thread. So, add an open
> > flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> > fcntl(). The flag is not cleared in any way on fork or exec; it should
> > probably be used with O_CLOEXEC in most cases.
> > 
> > Note that the usual issue that unknown open flags are ignored doesn't
> > really matter for O_ENCODED; if the kernel doesn't support O_ENCODED,
> > then it doesn't support RWF_ENCODED, either.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/fcntl.c                       | 10 ++++++++--
> >  fs/namei.c                       |  4 ++++
> >  include/linux/fcntl.h            |  2 +-
> >  include/uapi/asm-generic/fcntl.h |  4 ++++
> >  4 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 3d40771e8e7c..45ebc6df078e 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -30,7 +30,8 @@
> >  #include <asm/siginfo.h>
> >  #include <linux/uaccess.h>
> >  
> > -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> > +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> > +		    O_ENCODED)
> >  
> >  static int setfl(int fd, struct file * filp, unsigned long arg)
> >  {
> > @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
> >  		if (!inode_owner_or_capable(inode))
> >  			return -EPERM;
> >  
> > +	/* O_ENCODED can only be set by superuser */
> > +	if ((arg & O_ENCODED) && !(filp->f_flags & O_ENCODED) &&
> > +	    !capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> 
> I have a feeling the error should probably be an EACCES and not EPERM.

Shrug, I wanted to make this consistent with O_NOATIME, which uses
EPERM. EACCES seems more appropriate for lacking permissions for a
particular path rather than for an operation, but the lines are blurry.

> > +
> >  	/* required for strict SunOS emulation */
> >  	if (O_NONBLOCK != O_NDELAY)
> >  	       if (arg & O_NDELAY)
> > @@ -1031,7 +1037,7 @@ static int __init fcntl_init(void)
> >  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> >  	 * is defined as O_NONBLOCK on some platforms and not on others.
> >  	 */
> > -	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > +	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
> >  		HWEIGHT32(
> >  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> >  			__FMODE_EXEC | __FMODE_NONOTIFY));
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 671c3c1a3425..ae86b125888a 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2978,6 +2978,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
> >  	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
> >  		return -EPERM;
> >  
> > +	/* O_ENCODED can only be set by superuser */
> > +	if ((flag & O_ENCODED) && !capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> 
> I would suggest that this check be put into build_open_flags() rather
> than putting it this late in open(). Also, same nit about the error
> return as above.

This is where we check permissions for O_NOATIME, shouldn't we keep all
of those permission checks in the same place? build_open_flags() only
checks for flag validity.

> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> > index d019df946cb2..5fac02479639 100644
> > --- a/include/linux/fcntl.h
> > +++ b/include/linux/fcntl.h
> > @@ -9,7 +9,7 @@
> >  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> >  	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> >  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> > -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> > +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ENCODED)
> >  
> >  #ifndef force_o_largefile
> >  #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > index 9dc0bf0c5a6e..8c5cbd5942e3 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -97,6 +97,10 @@
> >  #define O_NDELAY	O_NONBLOCK
> >  #endif
> >  
> > +#ifndef O_ENCODED
> > +#define O_ENCODED	040000000
> > +#endif
> 
> You should also define this for all of the architectures which don't use
> the generic O_* flag values. On alpha, O_PATH is equal to the value you
> picked (just be careful on sparc -- 0x4000000 is the next free bit, but
> it's used by FMODE_NONOTIFY.)

Good catch, I'll fix that. Thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH man-pages] Document encoded I/O
  2019-10-30 22:46               ` Omar Sandoval
@ 2019-10-30 22:57                 ` Omar Sandoval
  0 siblings, 0 replies; 42+ messages in thread
From: Omar Sandoval @ 2019-10-30 22:57 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Amir Goldstein, linux-fsdevel, Linux Btrfs, Dave Chinner,
	Jann Horn, Linux API, kernel-team, Theodore Tso

On Wed, Oct 30, 2019 at 03:46:06PM -0700, Omar Sandoval wrote:
> On Wed, Oct 23, 2019 at 11:12:03PM +1100, Aleksa Sarai wrote:
> > On 2019-10-23, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > >
> > > > > No, I see why you choose to add the flag to open(2).
> > > > > I have no objection.
> > > > >
> > > > > I once had a crazy thought how to add new open flags
> > > > > in a non racy manner without adding a new syscall,
> > > > > but as you wrote, this is not relevant for O_ALLOW_ENCODED.
> > > > >
> > > > > Something like:
> > > > >
> > > > > /*
> > > > >  * Old kernels silently ignore unsupported open flags.
> > > > >  * New kernels that gets __O_CHECK_NEWFLAGS do
> > > > >  * the proper checking for unsupported flags AND set the
> > > > >  * flag __O_HAVE_NEWFLAGS.
> > > > >  */
> > > > > #define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
> > > > > #define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1
> > > > >
> > > > > fd = open(path, O_FLAG1);
> > > > > if (fd < 0)
> > > > >     return -errno;
> > > > > flags = fcntl(fd, F_GETFL, 0);
> > > > > if (flags < 0)
> > > > >     return flags;
> > > > > if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
> > > > >     close(fd);
> > > > >     return -EINVAL;
> > > > > }
> > > >
> > > > You don't need to add __O_HAVE_NEWFLAGS to do this -- this already works
> > > > today for userspace to check whether a flag works properly
> > > > (specifically, __O_FLAG1 will only be set if __O_FLAG1 is supported --
> > > > otherwise it gets cleared during build_open_flags).
> > > 
> > > That's a behavior of quite recent kernels since
> > > 629e014bb834 fs: completely ignore unknown open flags
> > > and maybe some stable kernels. Real old kernels don't have that luxury.
> > 
> > Ah okay -- so the key feature is that __O_CHECK_NEWFLAGS gets
> > transformed into __O_HAVE_NEWFLAGS (making it so that both the older and
> > current behaviours are detected). Apologies, I missed that on my first
> > read-through.
> > 
> > While it is a little bit ugly, it probably wouldn't be a bad idea to
> > have something like that.
> > 
> > > > The problem with adding new flags is that an *old* program running on a
> > > > *new* kernel could pass a garbage flag (__O_CHECK_NEWFLAGS for instance)
> > > > that causes an error only on the new kernel.
> > > 
> > > That's a theoretic problem. Same as O_PATH|O_TMPFILE.
> > > Show me a real life program that passes garbage files to open.
> > 
> > Has "that's a theoretical problem" helped when we faced this issue in
> > the past? I don't disagree that this is mostly theoretical, but I have a
> > feeling that this is an argument that won't hold water.
> > 
> > As for an example of semi-garbage flag passing -- systemd passes
> > O_PATH|O_NOCTTY in several places. Yes, they're known flags (so not
> > entirely applicable to this discussion) but it's also not a meaningful
> > combination of flags and yet is permitted.
> > 
> > > > The only real solution to this (and several other problems) is
> > > > openat2().
> > > 
> > > No argue about that. Come on, let's get it merged ;-)
> > 
> > Believe me, I'm trying. ;)
> > 
> > > > As for O_ALLOW_ENCODED -- the current semantics (-EPERM if it
> > > > is set without CAP_SYS_ADMIN) *will* cause backwards compatibility
> > > > issues for programs that have garbage flags set...
> > > >
> > > 
> > > Again, that's theoretical. In practice, O_ALLOW_ENCODED can work with
> > > open()/openat(). In fact, even if O_ALLOW_ENCODED gets merged after
> > > openat2(), I don't think it should be forbidden by open()/openat(),
> > > right? Do in that sense, O_ALLOW_ENCODED does not depend on openat2().
> > 
> > If it's a valid open() flag it'll also be a valid openat2(2) flag. The
> > only question is whether the garbage-flag problem justifies making it a
> > no-op for open(2).
> 
> Consider O_NOATIME: a (non-root) program passing this flag for files it
> didn't own would have been broken by kernel v2.6.8. Or, more recently, a
> program accidentally setting O_TMPFILE would suddenly get drastically
> different behavior on v3.11. These two flags technically broke backwards
> compatibility. I don't think it's worth the trouble to treat
> O_ALLOW_ENCODED any differently for open().

Ah, I missed that O_TMPFILE is careful to fail on old kernels. My point
still stands about O_NOATIME, though :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
  2019-10-30 22:26           ` Omar Sandoval
@ 2019-10-30 23:11             ` Aleksa Sarai
  0 siblings, 0 replies; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-30 23:11 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Darrick J. Wong, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team

[-- Attachment #1: Type: text/plain, Size: 7658 bytes --]

On 2019-10-30, Omar Sandoval <osandov@osandov.com> wrote:
> On Tue, Oct 22, 2019 at 01:02:15PM +1100, Aleksa Sarai wrote:
> > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > > > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > 
> > > > > > Btrfs supports transparent compression: data written by the user can be
> > > > > > compressed when written to disk and decompressed when read back.
> > > > > > However, we'd like to add an interface to write pre-compressed data
> > > > > > directly to the filesystem, and the matching interface to read
> > > > > > compressed data without decompressing it. This adds support for
> > > > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > > > 
> > > > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > > > should be used (needed for truncated or hole-punched extents and when
> > > > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > > > this information; for writes, the caller provides it to the filesystem.
> > > > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > > > used to extend the interface in the future. The remaining iovecs contain
> > > > > > the encoded extent.
> > > > > > 
> > > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > > FMODE_ENCODED_IO in ->file_open().
> > > > > > 
> > > > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > > > ---
> > > > > >  include/linux/fs.h      | 14 +++++++
> > > > > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > > > > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > > > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index e0d909d35763..54681f21e05e 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > > > > >  /* File does not contribute to nr_files count */
> > > > > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > > > > >  
> > > > > > +/* File supports encoded IO */
> > > > > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > > > > +
> > > > > >  /*
> > > > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > > > >   * that indicates that they should check the contents of the iovec are
> > > > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > > > >  #define IOCB_SYNC		(1 << 5)
> > > > > >  #define IOCB_WRITE		(1 << 6)
> > > > > >  #define IOCB_NOWAIT		(1 << 7)
> > > > > > +#define IOCB_ENCODED		(1 << 8)
> > > > > >  
> > > > > >  struct kiocb {
> > > > > >  	struct file		*ki_filp;
> > > > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > > > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > > > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > > > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > > > +struct encoded_iov;
> > > > > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > > > +				struct iov_iter *);
> > > > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > > > >  				struct file *file_out, loff_t pos_out,
> > > > > >  				loff_t *count, unsigned int remap_flags);
> > > > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > > > >  			return -EOPNOTSUPP;
> > > > > >  		ki->ki_flags |= IOCB_NOWAIT;
> > > > > >  	}
> > > > > > +	if (flags & RWF_ENCODED) {
> > > > > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > > > +			return -EOPNOTSUPP;
> > > > > > +		ki->ki_flags |= IOCB_ENCODED;
> > > > > > +	}
> > > > > >  	if (flags & RWF_HIPRI)
> > > > > >  		ki->ki_flags |= IOCB_HIPRI;
> > > > > >  	if (flags & RWF_DSYNC)
> > > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > > > --- a/include/uapi/linux/fs.h
> > > > > > +++ b/include/uapi/linux/fs.h
> > > > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > > > >  
> > > > > >  typedef int __bitwise __kernel_rwf_t;
> > > > > >  
> > > > > > +enum {
> > > > > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > > > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > > > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > > > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > > > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > > > > +};
> > > > > > +
> > > > > > +enum {
> > > > > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > > > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > > > > +};
> > > > > > +
> > > > > > +struct encoded_iov {
> > > > > > +	__u64 len;
> > > > > > +	__u64 unencoded_len;
> > > > > > +	__u64 unencoded_offset;
> > > > > > +	__u32 compression;
> > > > > > +	__u32 encryption;
> > > > > 
> > > > > Can we add some must-be-zero padding space at the end here for whomever
> > > > > comes along next wanting to add more encoding info?
> > > > 
> > > > I would suggest to copy the extension design of copy_struct_from_user().
> > > > Adding must-be-zero padding is a less-ideal solution to the extension
> > > > problem than length-based extension.
> > > 
> > > Come to think of it, you /do/ have to specify iov_len so... yeah, do
> > > that instead; we can always extend the structure later.
> > 
> > Just to clarify -- if we want to make the interface forward-compatible
> > from the outset (programs built 4 years from now running on 5.5), we
> > will need to implement this in the original merge. Otherwise userspace
> > will need to handle backwards-compatibility themselves once new features
> > are added.
> > 
> > @Omar: If it'd make your life easier, I can send some draft patches
> > 	   which port copy_struct_from_user() to iovec-land.
> 
> You're right, I didn't think about the case of newer programs on older
> kernels. I can do that for the next submission. RWF_ENCODED should
> probably translate the E2BIG from copy_struct_from_user() to EINVAL,
> though, to avoid ambiguity with the case that the buffer wasn't big
> enough to return the encoded data.

Yeah, that seems fair enough. I would've preferred to keep the error
semantics the same everywhere, but adding additional ambiguity to such
error cases isn't a good idea.

It's a bit of a shame we don't have more granular EINVALs to make it
easier to figure out *why* you got an EINVAL (then we wouldn't have had
to abuse E2BIG to indicate to userspace "you're using a new feature on
an old kernel") -- but that's a more generic problem that probably won't
be solved any time soon.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 1/5] fs: add O_ENCODED open flag
  2019-10-30 22:55     ` Omar Sandoval
@ 2019-10-30 23:17       ` Aleksa Sarai
  0 siblings, 0 replies; 42+ messages in thread
From: Aleksa Sarai @ 2019-10-30 23:17 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 6220 bytes --]

On 2019-10-30, Omar Sandoval <osandov@osandov.com> wrote:
> On Sat, Oct 19, 2019 at 03:50:57PM +1100, Aleksa Sarai wrote:
> > On 2019-10-15, Omar Sandoval <osandov@osandov.com> wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > The upcoming RWF_ENCODED operation introduces some security concerns:
> > > 
> > > 1. Compressed writes will pass arbitrary data to decompression
> > >    algorithms in the kernel.
> > > 2. Compressed reads can leak truncated/hole punched data.
> > > 
> > > Therefore, we need to require privilege for RWF_ENCODED. It's not
> > > possible to do the permissions checks at the time of the read or write
> > > because, e.g., io_uring submits IO from a worker thread. So, add an open
> > > flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> > > fcntl(). The flag is not cleared in any way on fork or exec; it should
> > > probably be used with O_CLOEXEC in most cases.
> > > 
> > > Note that the usual issue that unknown open flags are ignored doesn't
> > > really matter for O_ENCODED; if the kernel doesn't support O_ENCODED,
> > > then it doesn't support RWF_ENCODED, either.
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  fs/fcntl.c                       | 10 ++++++++--
> > >  fs/namei.c                       |  4 ++++
> > >  include/linux/fcntl.h            |  2 +-
> > >  include/uapi/asm-generic/fcntl.h |  4 ++++
> > >  4 files changed, 17 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > index 3d40771e8e7c..45ebc6df078e 100644
> > > --- a/fs/fcntl.c
> > > +++ b/fs/fcntl.c
> > > @@ -30,7 +30,8 @@
> > >  #include <asm/siginfo.h>
> > >  #include <linux/uaccess.h>
> > >  
> > > -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> > > +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> > > +		    O_ENCODED)
> > >  
> > >  static int setfl(int fd, struct file * filp, unsigned long arg)
> > >  {
> > > @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
> > >  		if (!inode_owner_or_capable(inode))
> > >  			return -EPERM;
> > >  
> > > +	/* O_ENCODED can only be set by superuser */
> > > +	if ((arg & O_ENCODED) && !(filp->f_flags & O_ENCODED) &&
> > > +	    !capable(CAP_SYS_ADMIN))
> > > +		return -EPERM;
> > 
> > I have a feeling the error should probably be an EACCES and not EPERM.
> 
> Shrug, I wanted to make this consistent with O_NOATIME, which uses
> EPERM. EACCES seems more appropriate for lacking permissions for a
> particular path rather than for an operation, but the lines are blurry.

Fair enough, though I would also argue that O_NOATIME should've also
been EACCES (and there are plenty of examples throughout the kernel
where EPERM was used where EACCES makes more sense). ;)

> > > +
> > >  	/* required for strict SunOS emulation */
> > >  	if (O_NONBLOCK != O_NDELAY)
> > >  	       if (arg & O_NDELAY)
> > > @@ -1031,7 +1037,7 @@ static int __init fcntl_init(void)
> > >  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> > >  	 * is defined as O_NONBLOCK on some platforms and not on others.
> > >  	 */
> > > -	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > > +	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
> > >  		HWEIGHT32(
> > >  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > >  			__FMODE_EXEC | __FMODE_NONOTIFY));
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 671c3c1a3425..ae86b125888a 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -2978,6 +2978,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
> > >  	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
> > >  		return -EPERM;
> > >  
> > > +	/* O_ENCODED can only be set by superuser */
> > > +	if ((flag & O_ENCODED) && !capable(CAP_SYS_ADMIN))
> > > +		return -EPERM;
> > 
> > I would suggest that this check be put into build_open_flags() rather
> > than putting it this late in open(). Also, same nit about the error
> > return as above.
> 
> This is where we check permissions for O_NOATIME, shouldn't we keep all
> of those permission checks in the same place? build_open_flags() only
> checks for flag validity.

Right, but O_NOATIME can't be checked earlier -- you need to have
resolved the inode in order to do the permission check. O_ENCODED only
depends on the capability set, and IMHO checking it earlier seems
cleaner to me.

> > > +
> > >  	return 0;
> > >  }
> > >  
> > > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> > > index d019df946cb2..5fac02479639 100644
> > > --- a/include/linux/fcntl.h
> > > +++ b/include/linux/fcntl.h
> > > @@ -9,7 +9,7 @@
> > >  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> > >  	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> > >  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> > > -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> > > +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ENCODED)
> > >  
> > >  #ifndef force_o_largefile
> > >  #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> > > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > > index 9dc0bf0c5a6e..8c5cbd5942e3 100644
> > > --- a/include/uapi/asm-generic/fcntl.h
> > > +++ b/include/uapi/asm-generic/fcntl.h
> > > @@ -97,6 +97,10 @@
> > >  #define O_NDELAY	O_NONBLOCK
> > >  #endif
> > >  
> > > +#ifndef O_ENCODED
> > > +#define O_ENCODED	040000000
> > > +#endif
> > 
> > You should also define this for all of the architectures which don't use
> > the generic O_* flag values. On alpha, O_PATH is equal to the value you
> > picked (just be careful on sparc -- 0x4000000 is the next free bit, but
> > it's used by FMODE_NONOTIFY.)
> 
> Good catch, I'll fix that. Thanks!

Oh, and please add a one-line comment in the sparc header to ensure
nobody accidentally breaks open() on sparc in the future.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2019-10-30 23:17 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-15 18:42 [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Omar Sandoval
2019-10-15 18:42 ` [PATCH man-pages] Document encoded I/O Omar Sandoval
2019-10-20 23:05   ` [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data Dave Chinner
2019-10-21 19:04     ` Omar Sandoval
2019-10-21  6:18   ` [PATCH man-pages] Document encoded I/O Amir Goldstein
2019-10-21 18:53     ` Omar Sandoval
2019-10-22  6:40       ` Amir Goldstein
2019-10-23  4:44         ` Aleksa Sarai
2019-10-23  6:06           ` Amir Goldstein
2019-10-23 12:12             ` Aleksa Sarai
2019-10-30 22:46               ` Omar Sandoval
2019-10-30 22:57                 ` Omar Sandoval
2019-10-15 18:42 ` [RFC PATCH v2 1/5] fs: add O_ENCODED open flag Omar Sandoval
2019-10-19  4:50   ` Aleksa Sarai
2019-10-23  4:46     ` Aleksa Sarai
2019-10-30 22:55     ` Omar Sandoval
2019-10-30 23:17       ` Aleksa Sarai
2019-10-15 18:42 ` [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
2019-10-16  9:50   ` Nikolay Borisov
2019-10-18 22:19     ` Omar Sandoval
2019-10-19  5:01   ` Aleksa Sarai
2019-10-21 18:28   ` Darrick J. Wong
2019-10-21 18:38     ` Aleksa Sarai
2019-10-21 19:00       ` Darrick J. Wong
2019-10-22  1:37         ` Aleksa Sarai
2019-10-30 22:21           ` Omar Sandoval
2019-10-22  2:02         ` Aleksa Sarai
2019-10-30 22:26           ` Omar Sandoval
2019-10-30 23:11             ` Aleksa Sarai
2019-10-21 19:07     ` Omar Sandoval
2019-10-15 18:42 ` [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio() Omar Sandoval
2019-10-16  9:22   ` Nikolay Borisov
2019-10-18 22:19     ` Omar Sandoval
2019-10-15 18:42 ` [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads Omar Sandoval
2019-10-16 11:10   ` Nikolay Borisov
2019-10-18 22:23     ` Omar Sandoval
2019-10-15 18:42 ` [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes Omar Sandoval
2019-10-16 10:44   ` Nikolay Borisov
2019-10-18 22:55     ` Omar Sandoval
2019-10-18 23:33       ` Omar Sandoval
2019-10-21 13:14       ` David Sterba
2019-10-21 18:05         ` Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).