linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] fs: interface for directly writing encoded (e.g., compressed) data
@ 2019-09-19  6:53 Omar Sandoval
  2019-09-19  6:53 ` [RFC PATCH 1/3] fs: pass READ/WRITE to kiocb_set_rw_flags() Omar Sandoval
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Omar Sandoval @ 2019-09-19  6:53 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs; +Cc: Dave Chinner, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Hello,

This patch series adds an API for writing compressed data directly to
the filesystem. It is based on my previous series which added a
Btrfs-specific ioctl [1], but this is now an extension to pwritev2() as
suggested by Dave Chinner [2]. I've included a man page patch describing
the API in detail. A test case and example program are available [3].

The use case that I have in mind is Btrfs send/receive: currently, when
sending data from one compressed filesystem to another, the sending side
decompresses the data and the receiving side recompresses it before
writing it out. This is wasteful and can be avoided if we can just send
and write compressed extents. The send part will be implemented in a
separate series, as this API can stand alone.

Patch 1 is a prep patch. Patch 2 adds the interface in the VFS. Patch 3
implements it in Btrfs.

This series is based on Linus' tree as of commit
b41dae061bbd722b9d7fa828f35d22035b218e18.

Please share any comments on the API or implementation. Thanks!

1: https://lore.kernel.org/linux-fsdevel/cover.1567623877.git.osandov@fb.com/
2: https://lore.kernel.org/linux-fsdevel/20190906212710.GI7452@vader/
3: https://github.com/osandov/xfstests/tree/btrfs-compressed-write

Omar Sandoval (3):
  fs: pass READ/WRITE to kiocb_set_rw_flags()
  fs: add RWF_ENCODED for writing compressed data
  btrfs: implement encoded (compressed) writes

 fs/aio.c                |   9 +-
 fs/btrfs/compression.c  |   6 +-
 fs/btrfs/compression.h  |   5 +-
 fs/btrfs/ctree.h        |   4 +
 fs/btrfs/file.c         |  40 +++++++--
 fs/btrfs/inode.c        | 190 +++++++++++++++++++++++++++++++++++++++-
 fs/io_uring.c           |   9 +-
 fs/read_write.c         |   2 +-
 include/linux/fs.h      |  18 +++-
 include/uapi/linux/fs.h |  24 ++++-
 mm/filemap.c            |  75 +++++++++++++---
 11 files changed, 344 insertions(+), 38 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/3] fs: pass READ/WRITE to kiocb_set_rw_flags()
  2019-09-19  6:53 [RFC PATCH 0/3] fs: interface for directly writing encoded (e.g., compressed) data Omar Sandoval
@ 2019-09-19  6:53 ` Omar Sandoval
  2019-09-20 14:38   ` Jan Kara
  2019-09-19  6:53 ` [PATCH] readv.2: Document new RWF_ENCODED flag to pwritev2() Omar Sandoval
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Omar Sandoval @ 2019-09-19  6:53 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs
  Cc: Dave Chinner, linux-api, kernel-team, Jan Kara, Jens Axboe

From: Omar Sandoval <osandov@fb.com>

A following change will want to check whether an IO is a read or write
in kiocb_set_rw_flags(). Additionally, aio and io_uring currently set
the IOCB_WRITE flag on a kiocb right before calling call_write_iter(),
but we can move that into the common code.

Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/aio.c           | 9 ++++-----
 fs/io_uring.c      | 9 ++++-----
 fs/read_write.c    | 2 +-
 include/linux/fs.h | 5 ++++-
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 01e0fb9ae45a..72195e182db2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1442,7 +1442,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
 	iocb_put(iocb);
 }
 
-static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
+static int aio_prep_rw(int rw, struct kiocb *req, const struct iocb *iocb)
 {
 	int ret;
 
@@ -1469,7 +1469,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 	} else
 		req->ki_ioprio = get_current_ioprio();
 
-	ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
+	ret = kiocb_set_rw_flags(rw, req, iocb->aio_rw_flags);
 	if (unlikely(ret))
 		return ret;
 
@@ -1525,7 +1525,7 @@ static int aio_read(struct kiocb *req, const struct iocb *iocb,
 	struct file *file;
 	int ret;
 
-	ret = aio_prep_rw(req, iocb);
+	ret = aio_prep_rw(READ, req, iocb);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -1553,7 +1553,7 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb,
 	struct file *file;
 	int ret;
 
-	ret = aio_prep_rw(req, iocb);
+	ret = aio_prep_rw(WRITE, req, iocb);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -1579,7 +1579,6 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb,
 			__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
 			__sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
 		}
-		req->ki_flags |= IOCB_WRITE;
 		aio_rw_done(req, call_write_iter(file, req, &iter));
 	}
 	kfree(iovec);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0dadbdbead0f..548525cb1699 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1002,7 +1002,7 @@ static bool io_file_supports_async(struct file *file)
 	return false;
 }
 
-static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
+static int io_prep_rw(int rw, struct io_kiocb *req, const struct sqe_submit *s,
 		      bool force_nonblock)
 {
 	const struct io_uring_sqe *sqe = s->sqe;
@@ -1031,7 +1031,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 	} else
 		kiocb->ki_ioprio = get_current_ioprio();
 
-	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
+	ret = kiocb_set_rw_flags(rw, kiocb, READ_ONCE(sqe->rw_flags));
 	if (unlikely(ret))
 		return ret;
 
@@ -1258,7 +1258,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	size_t iov_count;
 	ssize_t read_size, ret;
 
-	ret = io_prep_rw(req, s, force_nonblock);
+	ret = io_prep_rw(READ, req, s, force_nonblock);
 	if (ret)
 		return ret;
 	file = kiocb->ki_filp;
@@ -1319,7 +1319,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	size_t iov_count;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s, force_nonblock);
+	ret = io_prep_rw(WRITE, req, s, force_nonblock);
 	if (ret)
 		return ret;
 
@@ -1363,7 +1363,6 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
 			__sb_writers_release(file_inode(file)->i_sb,
 						SB_FREEZE_WRITE);
 		}
-		kiocb->ki_flags |= IOCB_WRITE;
 
 		ret2 = call_write_iter(file, kiocb, &iter);
 		if (!force_nonblock || ret2 != -EAGAIN) {
diff --git a/fs/read_write.c b/fs/read_write.c
index 5bbf587f5bc1..a6548a9d965d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -682,7 +682,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	ret = kiocb_set_rw_flags(&kiocb, flags);
+	ret = kiocb_set_rw_flags(type, &kiocb, flags);
 	if (ret)
 		return ret;
 	kiocb.ki_pos = (ppos ? *ppos : 0);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ffe35d97afcb..75c4b7680385 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3351,8 +3351,11 @@ static inline int iocb_flags(struct file *file)
 	return res;
 }
 
-static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
+static inline int kiocb_set_rw_flags(int rw, struct kiocb *ki, rwf_t flags)
 {
+	if (rw == WRITE)
+		ki->ki_flags |= IOCB_WRITE;
+
 	if (unlikely(flags & ~RWF_SUPPORTED))
 		return -EOPNOTSUPP;
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] readv.2: Document new RWF_ENCODED flag to pwritev2()
  2019-09-19  6:53 [RFC PATCH 0/3] fs: interface for directly writing encoded (e.g., compressed) data Omar Sandoval
  2019-09-19  6:53 ` [RFC PATCH 1/3] fs: pass READ/WRITE to kiocb_set_rw_flags() Omar Sandoval
@ 2019-09-19  6:53 ` Omar Sandoval
  2019-09-19  6:53 ` [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data Omar Sandoval
  2019-09-19  6:53 ` [RFC PATCH 3/3] btrfs: implement encoded (compressed) writes Omar Sandoval
  3 siblings, 0 replies; 22+ messages in thread
From: Omar Sandoval @ 2019-09-19  6:53 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs; +Cc: Dave Chinner, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 man2/readv.2 | 156 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 156 insertions(+)

diff --git a/man2/readv.2 b/man2/readv.2
index af27aa63e..7d32b3bd2 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -265,6 +265,135 @@ the data is always appended to the end of the file.
 However, if the
 .I offset
 argument is \-1, the current file offset is updated.
+.TP
+.BR RWF_ENCODED " (since Linux 5.5)"
+Read or write encoded (e.g., compressed) data.
+.I iov[0].iov_base
+points to an
+.I
+encoded_iov
+structure, defined in
+.I <linux/fs.h>
+as:
+.IP
+.in +4n
+.EX
+struct encoded_iov {
+    __u64 unencoded_len;
+    __u32 compression;
+    __u32 encryption;
+};
+.EE
+.in
+.IP
+.I iov[0].iov_len
+must be set to
+.IR "sizeof(struct\ encoded_iov)" .
+.IP
+.I compression
+is one of
+.B ENCODED_IOV_COMPRESSION_NONE
+(zero),
+.BR ENCODED_IOV_COMPRESSION_ZLIB ,
+.BR ENCODED_IOV_COMPRESSION_LZO ,
+or
+.BR ENCODED_IOV_COMPRESSION_ZSTD .
+.I encryption
+is currently always
+.B ENCODED_IOV_ENCRYPTION_NONE
+(zero).
+.I unencoded_len
+is the length of the unencoded (i.e., decrypted and decompressed) data.
+If
+.I offset
+is -1, then the current file offset is incremented by the unencoded length.
+.IP
+For
+.BR pwritev2 (),
+the remaining buffers in the
+.I iov
+array should point to the encoded data;
+.I unencoded_len
+should be set to the logical length of the data in the file; and
+.I compression
+and
+.I encryption
+should be set to indicate the encoding.
+The calling process must have the
+.B CAP_SYS_ADMIN
+capability.
+If
+.I compression
+is
+.B ENCODED_IOV_COMPRESSION_NONE
+and
+.I encryption
+is
+.BR ENCODED_IOV_ENCRYPTION_NONE ,
+then
+.I unencoded_len
+is ignored and the write behaves as if the
+.B RWF_ENCODED
+flag was not specified.
+.IP
+As of Linux 5.5, this flag is only implemented for
+.BR pwritev2 (),
+and only on Btrfs with the following requirements:
+.RS
+.IP \(bu 3
+.I offset
+(or the current file offset if
+.I offset
+is -1) must be aligned to the sector size of the filesystem.
+.IP \(bu
+.I unencoded_len
+and the length of the encoded data must each be no more than 128 KiB.
+This limit may increase in the future.
+.IP \(bu
+.I unencoded_len
+must be non-zero.
+.IP \(bu
+.I unencoded_len
+must be aligned to the sector size of the filesystem,
+unless the data ends at or beyond the current end of the file.
+.IP \(bu
+If
+.I compression
+is not
+.BR ENCODED_IOV_COMPRESSION_NONE ,
+then the length of the encoded data rounded up to the nearest sector must be less than
+.IR unencoded_len .
+.IP \(bu
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_ZLIB ,
+then the encoded data must be a single zlib stream.
+.IP \(bu
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_LZO ,
+then the encoded data must be be compressed page by page with LZO1X
+and wrapped in the format described in the Linux kernel source file
+.IR fs/btrfs/lzo.c .
+.IP \(bu
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_ZSTD ,
+then the encoded data must be a single zstd stream,
+and the
+.I windowLog
+compression parameter must be no more than 17.
+.RE
+.IP
+Note that the encoded data is not validated when it is written.
+If it is not valid (e.g., it cannot be decompressed,
+or its decompressed length does not match
+.IR unencoded_len ),
+then a subsequent read may result in an error or return truncated data.
+.\" TODO: how should this interact with O_DIRECT?
 .SH RETURN VALUE
 On success,
 .BR readv (),
@@ -284,6 +413,13 @@ than requested (see
 and
 .BR write (2)).
 .PP
+If
+.B
+RWF_ENCODED
+was specified in
+.IR flags ,
+then the return value is the unencoded number of bytes.
+.PP
 On error, \-1 is returned, and \fIerrno\fP is set appropriately.
 .SH ERRORS
 The errors are as given for
@@ -312,8 +448,28 @@ The vector count,
 .IR iovcnt ,
 is less than zero or greater than the permitted maximum.
 .TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the alignment and/or size requirements are not met.
+.TP
 .B EOPNOTSUPP
 An unknown flag is specified in \fIflags\fP.
+.TP
+.B EOPNOTSUPP
+.B RWF_ENCODED
+is specified in
+.I flags
+and the filesystem does not implement encoded I/O.
+.TP
+.B EPERM
+.B RWF_ENCODED
+is specified in
+.I flags
+and the calling process does not have the
+.B CAP_SYS_ADMIN
+capability.
 .SH VERSIONS
 .BR preadv ()
 and
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-19  6:53 [RFC PATCH 0/3] fs: interface for directly writing encoded (e.g., compressed) data Omar Sandoval
  2019-09-19  6:53 ` [RFC PATCH 1/3] fs: pass READ/WRITE to kiocb_set_rw_flags() Omar Sandoval
  2019-09-19  6:53 ` [PATCH] readv.2: Document new RWF_ENCODED flag to pwritev2() Omar Sandoval
@ 2019-09-19  6:53 ` Omar Sandoval
  2019-09-19 15:44   ` Jann Horn
  2019-09-19  6:53 ` [RFC PATCH 3/3] btrfs: implement encoded (compressed) writes Omar Sandoval
  3 siblings, 1 reply; 22+ messages in thread
From: Omar Sandoval @ 2019-09-19  6:53 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs; +Cc: Dave Chinner, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Btrfs can transparently compress data written by the user. However, we'd
like to add an interface to write pre-compressed data directly to the
filesystem. This adds support for so-called "encoded writes" via
pwritev2().

A new RWF_ENCODED flags indicates that a write is "encoded". If this
flag is set, iov[0].iov_base points to a struct encoded_iov which
contains metadata about the write: namely, the compression algorithm and
the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
must be set to sizeof(struct encoded_iov), which can be used to extend
the interface in the future. The remaining iovecs contain the encoded
extent.

A similar interface for reading encoded data can be added to preadv2()
in the future.

Filesystems must indicate that they support encoded writes by setting
FMODE_ENCODED_IO in ->file_open().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/fs.h      | 13 +++++++
 include/uapi/linux/fs.h | 24 ++++++++++++-
 mm/filemap.c            | 75 ++++++++++++++++++++++++++++++++++-------
 3 files changed, 99 insertions(+), 13 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75c4b7680385..ae3ac0312674 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
+/* File supports encoded IO */
+#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
@@ -314,6 +317,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_ENCODED		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3046,6 +3050,10 @@ extern int sb_min_blocksize(struct super_block *, int);
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
+struct encoded_iov;
+extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
+extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
+				struct iov_iter *);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
 				loff_t *count, unsigned int remap_flags);
@@ -3364,6 +3372,11 @@ static inline int kiocb_set_rw_flags(int rw, struct kiocb *ki, rwf_t flags)
 			return -EOPNOTSUPP;
 		ki->ki_flags |= IOCB_NOWAIT;
 	}
+	if (flags & RWF_ENCODED) {
+		if (rw != WRITE || !(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
+			return -EOPNOTSUPP;
+		ki->ki_flags |= IOCB_ENCODED;
+	}
 	if (flags & RWF_HIPRI)
 		ki->ki_flags |= IOCB_HIPRI;
 	if (flags & RWF_DSYNC)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index aad225b05be7..b775d9aea978 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -283,6 +283,25 @@ struct fsxattr {
 
 typedef int __bitwise __kernel_rwf_t;
 
+enum {
+	ENCODED_IOV_COMPRESSION_NONE,
+	ENCODED_IOV_COMPRESSION_ZLIB,
+	ENCODED_IOV_COMPRESSION_LZO,
+	ENCODED_IOV_COMPRESSION_ZSTD,
+	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
+};
+
+enum {
+	ENCODED_IOV_ENCRYPTION_NONE,
+	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
+};
+
+struct encoded_iov {
+	__u64 unencoded_len;
+	__u32 compression;
+	__u32 encryption;
+};
+
 /* high priority request, poll if possible */
 #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
 
@@ -298,8 +317,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* encoded (e.g., compressed or encrypted) IO */
+#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ENCODED)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index 40667c2f3383..3d2555364432 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2974,24 +2974,16 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
 	return 0;
 }
 
-/*
- * Performs necessary checks before doing a write
- *
- * Can adjust writing position or amount of bytes to write.
- * Returns appropriate error code that caller should return or
- * zero in case that write should be allowed.
- */
-inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
-	loff_t count;
 	int ret;
 
 	if (IS_SWAPFILE(inode))
 		return -ETXTBSY;
 
-	if (!iov_iter_count(from))
+	if (!*count)
 		return 0;
 
 	/* FIXME: this is for backwards compatibility with 2.4 */
@@ -3001,8 +2993,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
 		return -EINVAL;
 
-	count = iov_iter_count(from);
-	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
+	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
+}
+
+/*
+ * Performs necessary checks before doing a write
+ *
+ * Can adjust writing position or amount of bytes to write.
+ * Returns a negative errno or the new number of bytes to write.
+ */
+inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	loff_t count = iov_iter_count(from);
+	int ret;
+
+	ret = generic_write_checks_common(iocb, &count);
 	if (ret)
 		return ret;
 
@@ -3011,6 +3016,52 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+int generic_encoded_write_checks(struct kiocb *iocb,
+				 struct encoded_iov *encoded)
+{
+	loff_t count = encoded->unencoded_len;
+	int ret;
+
+	ret = generic_write_checks_common(iocb, &count);
+	if (ret)
+		return ret;
+
+	if (count != encoded->unencoded_len) {
+		/*
+		 * The write got truncated by generic_write_checks(). We can't
+		 * do a partial encoded write.
+		 */
+		return -EFBIG;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(generic_encoded_write_checks);
+
+/*
+ * If no encoding is set, this clears IOCB_ENCODED and the write should be
+ * treated as a normal write.
+ */
+int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
+			 struct iov_iter *from)
+{
+	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
+		return -EINVAL;
+	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
+		return -EFAULT;
+	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
+	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
+		iocb->ki_flags &= ~IOCB_ENCODED;
+		return 0;
+	}
+	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
+	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
+		return -EINVAL;
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	return 0;
+}
+EXPORT_SYMBOL(import_encoded_write);
+
 /*
  * Performs necessary checks before doing a clone.
  *
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/3] btrfs: implement encoded (compressed) writes
  2019-09-19  6:53 [RFC PATCH 0/3] fs: interface for directly writing encoded (e.g., compressed) data Omar Sandoval
                   ` (2 preceding siblings ...)
  2019-09-19  6:53 ` [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data Omar Sandoval
@ 2019-09-19  6:53 ` Omar Sandoval
  3 siblings, 0 replies; 22+ messages in thread
From: Omar Sandoval @ 2019-09-19  6:53 UTC (permalink / raw)
  To: linux-fsdevel, linux-btrfs; +Cc: Dave Chinner, linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

This adds support to Btrfs for the RWF_ENCODED flag to pwritev2(). The
implementation is similar to direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c |   6 +-
 fs/btrfs/compression.h |   5 +-
 fs/btrfs/ctree.h       |   4 +
 fs/btrfs/file.c        |  40 +++++++--
 fs/btrfs/inode.c       | 190 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 232 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b05b361e2062..6632dd8d2e4d 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
 			bio->bi_status == BLK_STS_OK);
 	cb->compressed_pages[0]->mapping = NULL;
 
-	end_compressed_writeback(inode, cb);
+	if (cb->writeback)
+		end_compressed_writeback(inode, cb);
 	/* note, our inode could be gone now */
 
 	/*
@@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				 unsigned long compressed_len,
 				 struct page **compressed_pages,
 				 unsigned long nr_pages,
-				 unsigned int write_flags)
+				 unsigned int write_flags, bool writeback)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio *bio = NULL;
@@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 	cb->mirror_num = 0;
 	cb->compressed_pages = compressed_pages;
 	cb->compressed_len = compressed_len;
+	cb->writeback = writeback;
 	cb->orig_bio = NULL;
 	cb->nr_pages = nr_pages;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 4cb8be9ff88b..d4176384ec15 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -47,6 +47,9 @@ struct compressed_bio {
 	/* the compression algorithm for this bio */
 	int compress_type;
 
+	/* Whether this is a write for writeback. */
+	bool writeback;
+
 	/* number of compressed pages in the array */
 	unsigned long nr_pages;
 
@@ -93,7 +96,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				  unsigned long compressed_len,
 				  struct page **compressed_pages,
 				  unsigned long nr_pages,
-				  unsigned int write_flags);
+				  unsigned int write_flags, bool writeback);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 19d669d12ca1..76962d319316 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2905,6 +2905,10 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
+
+ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			    struct encoded_iov *encoded);
+
 extern const struct dentry_operations btrfs_dentry_operations;
 
 /* ioctl.c */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 8fe4eb7e5045..068b7f2cc243 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1872,8 +1872,7 @@ static void update_time_for_write(struct inode *inode)
 		inode_inc_iversion(inode);
 }
 
-static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
@@ -1883,14 +1882,22 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	u64 end_pos;
 	ssize_t num_written = 0;
 	const bool sync = iocb->ki_flags & IOCB_DSYNC;
+	struct encoded_iov encoded;
 	ssize_t err;
 	loff_t pos;
 	size_t count;
 	loff_t oldsize;
 	int clean_page = 0;
 
-	if (!(iocb->ki_flags & IOCB_DIRECT) &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		err = import_encoded_write(iocb, &encoded, from);
+		if (err)
+			return err;
+	}
+
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+	    (!(iocb->ki_flags & IOCB_DIRECT) ||
+	     (iocb->ki_flags & IOCB_ENCODED)))
 		return -EOPNOTSUPP;
 
 	if (!inode_trylock(inode)) {
@@ -1899,14 +1906,27 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 		inode_lock(inode);
 	}
 
-	err = generic_write_checks(iocb, from);
-	if (err <= 0) {
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		err = generic_encoded_write_checks(iocb, &encoded);
+		if (err) {
+			inode_unlock(inode);
+			return err;
+		}
+		count = encoded.unencoded_len;
+	} else {
+		err = generic_write_checks(iocb, from);
+		if (err < 0) {
+			inode_unlock(inode);
+			return err;
+		}
+		count = iov_iter_count(from);
+	}
+	if (count == 0) {
 		inode_unlock(inode);
 		return err;
 	}
 
 	pos = iocb->ki_pos;
-	count = iov_iter_count(from);
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		/*
 		 * We will allocate space in case nodatacow is not set,
@@ -1965,7 +1985,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT) {
+	if (iocb->ki_flags & IOCB_ENCODED) {
+		num_written = btrfs_encoded_write(iocb, from, &encoded);
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
@@ -3440,7 +3462,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_ENCODED_IO;
 	return generic_file_open(inode, filp);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a0546401bc0a..90ca8537df8e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				    ins.objectid,
 				    ins.offset, async_extent->pages,
 				    async_extent->nr_pages,
-				    async_chunk->write_flags)) {
+				    async_chunk->write_flags, true)) {
 			struct page *p = async_extent->pages[0];
 			const u64 start = async_extent->start;
 			const u64 end = start + async_extent->ram_size - 1;
@@ -10590,6 +10590,194 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
 	}
 }
 
+ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			    struct encoded_iov *encoded)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_changeset *data_reserved = NULL;
+	struct extent_state *cached_state = NULL;
+	int compression;
+	u64 disk_num_bytes, num_bytes;
+	u64 start, end;
+	unsigned long nr_pages, i;
+	struct page **pages;
+	struct btrfs_key ins;
+	struct extent_map *em;
+	ssize_t ret;
+
+	switch (encoded->compression) {
+	case ENCODED_IOV_COMPRESSION_ZLIB:
+		compression = BTRFS_COMPRESS_ZLIB;
+		break;
+	case ENCODED_IOV_COMPRESSION_LZO:
+		compression = BTRFS_COMPRESS_LZO;
+		break;
+	case ENCODED_IOV_COMPRESSION_ZSTD:
+		compression = BTRFS_COMPRESS_ZSTD;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	disk_num_bytes = iov_iter_count(from);
+
+	/* The extent size must be sane. */
+	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
+	    disk_num_bytes > BTRFS_MAX_COMPRESSED ||
+	    disk_num_bytes == 0)
+		return -EINVAL;
+
+	/*
+	 * The compressed data on disk must be sector-aligned. For convenience,
+	 * we extend the compressed data with zeroes if it isn't.
+	 */
+	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
+	/*
+	 * The extent in the file must also be sector-aligned. However, we allow
+	 * a write which ends at or extends i_size to have an unaligned length;
+	 * we round up the extent size and set i_size to the given length.
+	 */
+	start = iocb->ki_pos;
+	if (!IS_ALIGNED(start, fs_info->sectorsize))
+		return -EINVAL;
+	if (start + encoded->unencoded_len >= inode->i_size) {
+		num_bytes = ALIGN(encoded->unencoded_len, fs_info->sectorsize);
+	} else {
+		num_bytes = encoded->unencoded_len;
+		if (!IS_ALIGNED(num_bytes, fs_info->sectorsize))
+			return -EINVAL;
+	}
+	end = start + num_bytes - 1;
+
+	/*
+	 * It's valid for compressed data to be larger than or the same size as
+	 * the decompressed data. However, for buffered I/O, we never write out
+	 * a compressed extent unless it's smaller than the decompressed data,
+	 * so for now, let's not allow creating such extents explicity, either.
+	 */
+	if (disk_num_bytes >= num_bytes)
+		return -EINVAL;
+
+	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_USER);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		size_t bytes;
+		char *kaddr;
+
+		pages[i] = alloc_page(GFP_USER);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out_pages;
+		}
+		kaddr = kmap(pages[i]);
+		bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
+		if (copy_from_iter(kaddr, bytes, from) != bytes) {
+			kunmap(pages[i]);
+			ret = -EFAULT;
+			goto out_pages;
+		}
+		if (bytes < PAGE_SIZE)
+			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
+		kunmap(pages[i]);
+	}
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		lock_extent_bits(io_tree, start, end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     end - start + 1);
+		if (!ordered &&
+		    !filemap_range_has_page(inode->i_mapping, start, end))
+			break;
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end,
+				     &cached_state);
+		cond_resched();
+		ret = btrfs_wait_ordered_range(inode, start, end);
+		if (ret)
+			goto out_pages;
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+						    start >> PAGE_SHIFT,
+						    end >> PAGE_SHIFT);
+		if (ret)
+			goto out_pages;
+	}
+
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
+					   num_bytes);
+	if (ret)
+		goto out_unlock;
+
+	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
+				   disk_num_bytes, 0, 0, &ins, 1, 1);
+	if (ret)
+		goto out_delalloc_release;
+
+	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
+			  ins.offset, ins.offset, num_bytes, compression,
+			  BTRFS_ORDERED_COMPRESSED);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_free_reserve;
+	}
+	free_extent_map(em);
+
+	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
+						num_bytes, ins.offset,
+						BTRFS_ORDERED_COMPRESSED,
+						compression);
+	if (ret) {
+		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
+		goto out_free_reserve;
+	}
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+
+	if (start + encoded->unencoded_len > inode->i_size)
+		i_size_write(inode, start + encoded->unencoded_len);
+
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
+
+	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
+					  ins.offset, pages, nr_pages, 0,
+					  false)) {
+		struct page *page = pages[0];
+
+		page->mapping = inode->i_mapping;
+		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
+		page->mapping = NULL;
+		ret = -EIO;
+		goto out_pages;
+	}
+	iocb->ki_pos += encoded->unencoded_len;
+	return encoded->unencoded_len;
+
+out_free_reserve:
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+out_delalloc_release:
+	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
+				     true);
+out_unlock:
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+out_pages:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			put_page(pages[i]);
+	}
+	kvfree(pages);
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-19  6:53 ` [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data Omar Sandoval
@ 2019-09-19 15:44   ` Jann Horn
  2019-09-20 16:25     ` Jens Axboe
  2019-09-24 17:15     ` Omar Sandoval
  0 siblings, 2 replies; 22+ messages in thread
From: Jann Horn @ 2019-09-19 15:44 UTC (permalink / raw)
  To: Omar Sandoval, Jens Axboe
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Linux API, Kernel Team,
	Andy Lutomirski

On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> Btrfs can transparently compress data written by the user. However, we'd
> like to add an interface to write pre-compressed data directly to the
> filesystem. This adds support for so-called "encoded writes" via
> pwritev2().
>
> A new RWF_ENCODED flags indicates that a write is "encoded". If this
> flag is set, iov[0].iov_base points to a struct encoded_iov which
> contains metadata about the write: namely, the compression algorithm and
> the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> must be set to sizeof(struct encoded_iov), which can be used to extend
> the interface in the future. The remaining iovecs contain the encoded
> extent.
>
> A similar interface for reading encoded data can be added to preadv2()
> in the future.
>
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
[...]
> +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> +                        struct iov_iter *from)
> +{
> +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> +               return -EINVAL;
> +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> +               return -EFAULT;
> +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> +               iocb->ki_flags &= ~IOCB_ENCODED;
> +               return 0;
> +       }
> +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> +               return -EINVAL;
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EPERM;

How does this capable() check interact with io_uring? Without having
looked at this in detail, I suspect that when an encoded write is
requested through io_uring, the capable() check might be executed on
something like a workqueue worker thread, which is probably running
with a full capability set.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/3] fs: pass READ/WRITE to kiocb_set_rw_flags()
  2019-09-19  6:53 ` [RFC PATCH 1/3] fs: pass READ/WRITE to kiocb_set_rw_flags() Omar Sandoval
@ 2019-09-20 14:38   ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2019-09-20 14:38 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, linux-api, kernel-team,
	Jan Kara, Jens Axboe

On Wed 18-09-19 23:53:44, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> A following change will want to check whether an IO is a read or write
> in kiocb_set_rw_flags(). Additionally, aio and io_uring currently set
> the IOCB_WRITE flag on a kiocb right before calling call_write_iter(),
> but we can move that into the common code.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
...
> index ffe35d97afcb..75c4b7680385 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3351,8 +3351,11 @@ static inline int iocb_flags(struct file *file)
>  	return res;
>  }
>  
> -static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> +static inline int kiocb_set_rw_flags(int rw, struct kiocb *ki, rwf_t flags)
>  {
> +	if (rw == WRITE)
> +		ki->ki_flags |= IOCB_WRITE;
> +
>  	if (unlikely(flags & ~RWF_SUPPORTED))
>  		return -EOPNOTSUPP;

I'd find it more natural if the destination argument (i.e., kiocb) stayed
to be the first argument of the function. Otherwise the patch looks good to
me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-19 15:44   ` Jann Horn
@ 2019-09-20 16:25     ` Jens Axboe
  2019-09-24 17:15     ` Omar Sandoval
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2019-09-20 16:25 UTC (permalink / raw)
  To: Jann Horn, Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Linux API, Kernel Team,
	Andy Lutomirski

On 9/19/19 9:44 AM, Jann Horn wrote:
> On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
>> Btrfs can transparently compress data written by the user. However, we'd
>> like to add an interface to write pre-compressed data directly to the
>> filesystem. This adds support for so-called "encoded writes" via
>> pwritev2().
>>
>> A new RWF_ENCODED flags indicates that a write is "encoded". If this
>> flag is set, iov[0].iov_base points to a struct encoded_iov which
>> contains metadata about the write: namely, the compression algorithm and
>> the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
>> must be set to sizeof(struct encoded_iov), which can be used to extend
>> the interface in the future. The remaining iovecs contain the encoded
>> extent.
>>
>> A similar interface for reading encoded data can be added to preadv2()
>> in the future.
>>
>> Filesystems must indicate that they support encoded writes by setting
>> FMODE_ENCODED_IO in ->file_open().
> [...]
>> +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
>> +                        struct iov_iter *from)
>> +{
>> +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
>> +               return -EINVAL;
>> +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
>> +               return -EFAULT;
>> +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
>> +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
>> +               iocb->ki_flags &= ~IOCB_ENCODED;
>> +               return 0;
>> +       }
>> +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
>> +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
>> +               return -EINVAL;
>> +       if (!capable(CAP_SYS_ADMIN))
>> +               return -EPERM;
> 
> How does this capable() check interact with io_uring? Without having
> looked at this in detail, I suspect that when an encoded write is
> requested through io_uring, the capable() check might be executed on
> something like a workqueue worker thread, which is probably running
> with a full capability set.

If we can hit -EAGAIN before doing the import in io_uring, then yes,
this will probably bypass the check as it'll only happen from the
worker.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-19 15:44   ` Jann Horn
  2019-09-20 16:25     ` Jens Axboe
@ 2019-09-24 17:15     ` Omar Sandoval
  2019-09-24 19:35       ` Omar Sandoval
  1 sibling, 1 reply; 22+ messages in thread
From: Omar Sandoval @ 2019-09-24 17:15 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, linux-fsdevel, linux-btrfs, Dave Chinner, Linux API,
	Kernel Team, Andy Lutomirski

On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > Btrfs can transparently compress data written by the user. However, we'd
> > like to add an interface to write pre-compressed data directly to the
> > filesystem. This adds support for so-called "encoded writes" via
> > pwritev2().
> >
> > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > contains metadata about the write: namely, the compression algorithm and
> > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > must be set to sizeof(struct encoded_iov), which can be used to extend
> > the interface in the future. The remaining iovecs contain the encoded
> > extent.
> >
> > A similar interface for reading encoded data can be added to preadv2()
> > in the future.
> >
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> [...]
> > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > +                        struct iov_iter *from)
> > +{
> > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > +               return -EINVAL;
> > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > +               return -EFAULT;
> > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > +               return 0;
> > +       }
> > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > +               return -EINVAL;
> > +       if (!capable(CAP_SYS_ADMIN))
> > +               return -EPERM;
> 
> How does this capable() check interact with io_uring? Without having
> looked at this in detail, I suspect that when an encoded write is
> requested through io_uring, the capable() check might be executed on
> something like a workqueue worker thread, which is probably running
> with a full capability set.

I discussed this more with Jens. You're right, per-IO permission checks
aren't going to work. In fully-polled mode, we never get an opportunity
to check capabilities in right context. So, this will probably require a
new open flag.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-24 17:15     ` Omar Sandoval
@ 2019-09-24 19:35       ` Omar Sandoval
  2019-09-24 20:01         ` Jann Horn
  0 siblings, 1 reply; 22+ messages in thread
From: Omar Sandoval @ 2019-09-24 19:35 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, linux-fsdevel, linux-btrfs, Dave Chinner, Linux API,
	Kernel Team, Andy Lutomirski

On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > Btrfs can transparently compress data written by the user. However, we'd
> > > like to add an interface to write pre-compressed data directly to the
> > > filesystem. This adds support for so-called "encoded writes" via
> > > pwritev2().
> > >
> > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > contains metadata about the write: namely, the compression algorithm and
> > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > the interface in the future. The remaining iovecs contain the encoded
> > > extent.
> > >
> > > A similar interface for reading encoded data can be added to preadv2()
> > > in the future.
> > >
> > > Filesystems must indicate that they support encoded writes by setting
> > > FMODE_ENCODED_IO in ->file_open().
> > [...]
> > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > +                        struct iov_iter *from)
> > > +{
> > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > +               return -EINVAL;
> > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > +               return -EFAULT;
> > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > +               return 0;
> > > +       }
> > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > +               return -EINVAL;
> > > +       if (!capable(CAP_SYS_ADMIN))
> > > +               return -EPERM;
> > 
> > How does this capable() check interact with io_uring? Without having
> > looked at this in detail, I suspect that when an encoded write is
> > requested through io_uring, the capable() check might be executed on
> > something like a workqueue worker thread, which is probably running
> > with a full capability set.
> 
> I discussed this more with Jens. You're right, per-IO permission checks
> aren't going to work. In fully-polled mode, we never get an opportunity
> to check capabilities in right context. So, this will probably require a
> new open flag.

Actually, file_ns_capable() accomplishes the same thing without a new
open flag. Changing the capable() check to file_ns_capable() in
init_user_ns should be enough.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-24 19:35       ` Omar Sandoval
@ 2019-09-24 20:01         ` Jann Horn
  2019-09-24 20:22           ` Christian Brauner
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jann Horn @ 2019-09-24 20:01 UTC (permalink / raw)
  To: Omar Sandoval, Aleksa Sarai
  Cc: Jens Axboe, linux-fsdevel, linux-btrfs, Dave Chinner, Linux API,
	Kernel Team, Andy Lutomirski

On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > like to add an interface to write pre-compressed data directly to the
> > > > filesystem. This adds support for so-called "encoded writes" via
> > > > pwritev2().
> > > >
> > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > contains metadata about the write: namely, the compression algorithm and
> > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > the interface in the future. The remaining iovecs contain the encoded
> > > > extent.
> > > >
> > > > A similar interface for reading encoded data can be added to preadv2()
> > > > in the future.
> > > >
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > [...]
> > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > +                        struct iov_iter *from)
> > > > +{
> > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > +               return -EINVAL;
> > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > +               return -EFAULT;
> > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > +               return 0;
> > > > +       }
> > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > +               return -EINVAL;
> > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > +               return -EPERM;
> > >
> > > How does this capable() check interact with io_uring? Without having
> > > looked at this in detail, I suspect that when an encoded write is
> > > requested through io_uring, the capable() check might be executed on
> > > something like a workqueue worker thread, which is probably running
> > > with a full capability set.
> >
> > I discussed this more with Jens. You're right, per-IO permission checks
> > aren't going to work. In fully-polled mode, we never get an opportunity
> > to check capabilities in right context. So, this will probably require a
> > new open flag.
>
> Actually, file_ns_capable() accomplishes the same thing without a new
> open flag. Changing the capable() check to file_ns_capable() in
> init_user_ns should be enough.

+Aleksa for openat2() and open() space

Mmh... but if the file descriptor has been passed through a privilege
boundary, it isn't really clear whether the original opener of the
file intended for this to be possible. For example, if (as a
hypothetical example) the init process opens a service's logfile with
root privileges, then passes the file descriptor to that logfile to
the service on execve(), that doesn't mean that the service should be
able to perform compressed writes into that file, I think.

I think that an open flag (as you already suggested) or an fcntl()
operation would do the job; but AFAIK the open() flag space has run
out, so if you hook it up that way, I think you might have to wait for
Aleksa Sarai to get something like his sys_openat2() suggestion
(https://lore.kernel.org/lkml/20190904201933.10736-12-cyphar@cyphar.com/)
merged?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-24 20:01         ` Jann Horn
@ 2019-09-24 20:22           ` Christian Brauner
  2019-09-24 20:50             ` Matthew Wilcox
  2019-09-24 20:38           ` Omar Sandoval
  2019-09-25  7:11           ` Dave Chinner
  2 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2019-09-24 20:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Omar Sandoval, Aleksa Sarai, Jens Axboe, linux-fsdevel,
	linux-btrfs, Dave Chinner, Linux API, Kernel Team,
	Andy Lutomirski

On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > > like to add an interface to write pre-compressed data directly to the
> > > > > filesystem. This adds support for so-called "encoded writes" via
> > > > > pwritev2().
> > > > >
> > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > contains metadata about the write: namely, the compression algorithm and
> > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > > the interface in the future. The remaining iovecs contain the encoded
> > > > > extent.
> > > > >
> > > > > A similar interface for reading encoded data can be added to preadv2()
> > > > > in the future.
> > > > >
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > [...]
> > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > > +                        struct iov_iter *from)
> > > > > +{
> > > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > > +               return -EINVAL;
> > > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > > +               return -EFAULT;
> > > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > > +               return 0;
> > > > > +       }
> > > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > > +               return -EINVAL;
> > > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > > +               return -EPERM;
> > > >
> > > > How does this capable() check interact with io_uring? Without having
> > > > looked at this in detail, I suspect that when an encoded write is
> > > > requested through io_uring, the capable() check might be executed on
> > > > something like a workqueue worker thread, which is probably running
> > > > with a full capability set.
> > >
> > > I discussed this more with Jens. You're right, per-IO permission checks
> > > aren't going to work. In fully-polled mode, we never get an opportunity
> > > to check capabilities in right context. So, this will probably require a
> > > new open flag.
> >
> > Actually, file_ns_capable() accomplishes the same thing without a new
> > open flag. Changing the capable() check to file_ns_capable() in
> > init_user_ns should be enough.
> 
> +Aleksa for openat2() and open() space
> 
> Mmh... but if the file descriptor has been passed through a privilege
> boundary, it isn't really clear whether the original opener of the
> file intended for this to be possible. For example, if (as a
> hypothetical example) the init process opens a service's logfile with
> root privileges, then passes the file descriptor to that logfile to
> the service on execve(), that doesn't mean that the service should be
> able to perform compressed writes into that file, I think.

I think we should even generalize this: for most new properties a given
file descriptor can carry we would want it to be explicitly enabled such
that passing the fd around amounts to passing that property around. At
least as soon as we consider it to be associated with some privilege
boundary. I don't think we have done this generally. But I would very
much support moving to such a model.

Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-24 20:01         ` Jann Horn
  2019-09-24 20:22           ` Christian Brauner
@ 2019-09-24 20:38           ` Omar Sandoval
  2019-09-25  7:11           ` Dave Chinner
  2 siblings, 0 replies; 22+ messages in thread
From: Omar Sandoval @ 2019-09-24 20:38 UTC (permalink / raw)
  To: Jann Horn
  Cc: Aleksa Sarai, Jens Axboe, linux-fsdevel, linux-btrfs,
	Dave Chinner, Linux API, Kernel Team, Andy Lutomirski

On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > > like to add an interface to write pre-compressed data directly to the
> > > > > filesystem. This adds support for so-called "encoded writes" via
> > > > > pwritev2().
> > > > >
> > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > contains metadata about the write: namely, the compression algorithm and
> > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > > the interface in the future. The remaining iovecs contain the encoded
> > > > > extent.
> > > > >
> > > > > A similar interface for reading encoded data can be added to preadv2()
> > > > > in the future.
> > > > >
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > [...]
> > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > > +                        struct iov_iter *from)
> > > > > +{
> > > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > > +               return -EINVAL;
> > > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > > +               return -EFAULT;
> > > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > > +               return 0;
> > > > > +       }
> > > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > > +               return -EINVAL;
> > > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > > +               return -EPERM;
> > > >
> > > > How does this capable() check interact with io_uring? Without having
> > > > looked at this in detail, I suspect that when an encoded write is
> > > > requested through io_uring, the capable() check might be executed on
> > > > something like a workqueue worker thread, which is probably running
> > > > with a full capability set.
> > >
> > > I discussed this more with Jens. You're right, per-IO permission checks
> > > aren't going to work. In fully-polled mode, we never get an opportunity
> > > to check capabilities in right context. So, this will probably require a
> > > new open flag.
> >
> > Actually, file_ns_capable() accomplishes the same thing without a new
> > open flag. Changing the capable() check to file_ns_capable() in
> > init_user_ns should be enough.
> 
> +Aleksa for openat2() and open() space
> 
> Mmh... but if the file descriptor has been passed through a privilege
> boundary, it isn't really clear whether the original opener of the
> file intended for this to be possible. For example, if (as a
> hypothetical example) the init process opens a service's logfile with
> root privileges, then passes the file descriptor to that logfile to
> the service on execve(), that doesn't mean that the service should be
> able to perform compressed writes into that file, I think.

Ahh, you're right.

> I think that an open flag (as you already suggested) or an fcntl()
> operation would do the job; but AFAIK the open() flag space has run
> out, so if you hook it up that way, I think you might have to wait for
> Aleksa Sarai to get something like his sys_openat2() suggestion
> (https://lore.kernel.org/lkml/20190904201933.10736-12-cyphar@cyphar.com/)
> merged?

If I counted correctly, there's still space for a new O_ flag. One of
the problems that Aleksa is solving is that unknown O_ flags are
silently ignored, which isn't an issue for an O_ENCODED flag. If the
kernel doesn't support it, it won't support RWF_ENCODED, either, so
you'll get EOPNOTSUPP from pwritev2(). So, open flag it is...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-24 20:22           ` Christian Brauner
@ 2019-09-24 20:50             ` Matthew Wilcox
  0 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2019-09-24 20:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs, Dave Chinner, Linux API, Kernel Team,
	Andy Lutomirski

On Tue, Sep 24, 2019 at 10:22:29PM +0200, Christian Brauner wrote:
> On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> > Mmh... but if the file descriptor has been passed through a privilege
> > boundary, it isn't really clear whether the original opener of the
> > file intended for this to be possible. For example, if (as a
> > hypothetical example) the init process opens a service's logfile with
> > root privileges, then passes the file descriptor to that logfile to
> > the service on execve(), that doesn't mean that the service should be
> > able to perform compressed writes into that file, I think.
> 
> I think we should even generalize this: for most new properties a given
> file descriptor can carry we would want it to be explicitly enabled such
> that passing the fd around amounts to passing that property around. At
> least as soon as we consider it to be associated with some privilege
> boundary. I don't think we have done this generally. But I would very
> much support moving to such a model.

I think you've got this right.  This needs to be an fcntl() flag, which
is only settable by root.  Now, should it be an O_ flag, modifiable by
F_SETFL, or should it be a new F_ flag?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-24 20:01         ` Jann Horn
  2019-09-24 20:22           ` Christian Brauner
  2019-09-24 20:38           ` Omar Sandoval
@ 2019-09-25  7:11           ` Dave Chinner
  2019-09-25 12:07             ` Colin Walters
  2019-09-26  0:36             ` Omar Sandoval
  2 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2019-09-25  7:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Omar Sandoval, Aleksa Sarai, Jens Axboe, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Andy Lutomirski

On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > > like to add an interface to write pre-compressed data directly to the
> > > > > filesystem. This adds support for so-called "encoded writes" via
> > > > > pwritev2().
> > > > >
> > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > contains metadata about the write: namely, the compression algorithm and
> > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > > the interface in the future. The remaining iovecs contain the encoded
> > > > > extent.
> > > > >
> > > > > A similar interface for reading encoded data can be added to preadv2()
> > > > > in the future.
> > > > >
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > [...]
> > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > > +                        struct iov_iter *from)
> > > > > +{
> > > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > > +               return -EINVAL;
> > > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > > +               return -EFAULT;
> > > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > > +               return 0;
> > > > > +       }
> > > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > > +               return -EINVAL;
> > > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > > +               return -EPERM;
> > > >
> > > > How does this capable() check interact with io_uring? Without having
> > > > looked at this in detail, I suspect that when an encoded write is
> > > > requested through io_uring, the capable() check might be executed on
> > > > something like a workqueue worker thread, which is probably running
> > > > with a full capability set.
> > >
> > > I discussed this more with Jens. You're right, per-IO permission checks
> > > aren't going to work. In fully-polled mode, we never get an opportunity
> > > to check capabilities in right context. So, this will probably require a
> > > new open flag.
> >
> > Actually, file_ns_capable() accomplishes the same thing without a new
> > open flag. Changing the capable() check to file_ns_capable() in
> > init_user_ns should be enough.
> 
> +Aleksa for openat2() and open() space
> 
> Mmh... but if the file descriptor has been passed through a privilege
> boundary, it isn't really clear whether the original opener of the
> file intended for this to be possible. For example, if (as a
> hypothetical example) the init process opens a service's logfile with
> root privileges, then passes the file descriptor to that logfile to
> the service on execve(), that doesn't mean that the service should be
> able to perform compressed writes into that file, I think.

Where's the privilege boundary that is being crossed?

We're talking about user data read/write access here, not some
special security capability. Access to the data has already been
permission checked, so why should the format that the data is
supplied to the kernel in suddenly require new privilege checks?

i.e. writing encoded data to a file requires exactly the same
access permissions as writing cleartext data to the file. The only
extra information here is whether the _filesystem_ supports encoded
data, and that doesn't change regardless of what the open file gets
passed to. Hence the capability is either there or it isn't, it
doesn't transform not matter what privilege boundary the file is
passed across. Similarly, we have permission to access the data
or we don't through the struct file, it doesn't transform either.

Hence I don't see why CAP_SYS_ADMIN or any special permissions are
needed for an application with access permissions to file data to
use these RWF_ENCODED IO interfaces. I am inclined to think the
permission check here is wrong and should be dropped, and then all
these issues go away.

Yes, the app that is going to use this needs root perms because it
accesses all data in the fs (it's a backup app!), but that doesn't
mean you can only use RWF_ENCODED if you have root perms.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-25  7:11           ` Dave Chinner
@ 2019-09-25 12:07             ` Colin Walters
  2019-09-25 14:56               ` [RFC PATCH 2/3] " Chris Mason
                                 ` (2 more replies)
  2019-09-26  0:36             ` Omar Sandoval
  1 sibling, 3 replies; 22+ messages in thread
From: Colin Walters @ 2019-09-25 12:07 UTC (permalink / raw)
  To: Dave Chinner, Jann Horn
  Cc: Omar Sandoval, Aleksa Sarai, Jens Axboe, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Andy Lutomirski



On Wed, Sep 25, 2019, at 3:11 AM, Dave Chinner wrote:
>
> We're talking about user data read/write access here, not some
> special security capability. Access to the data has already been
> permission checked, so why should the format that the data is
> supplied to the kernel in suddenly require new privilege checks?

What happens with BTRFS today if userspace provides invalid compressed data via this interface?  Does that show up as filesystem corruption later?  If the data is verified at write time, wouldn't that be losing most of the speed advantages of providing pre-compressed data?

Ability for a user to cause fsck errors later would be a new thing that would argue for a privilege check I think.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] add RWF_ENCODED for writing compressed data
  2019-09-25 12:07             ` Colin Walters
@ 2019-09-25 14:56               ` Chris Mason
  2019-09-26 12:17                 ` Colin Walters
  2019-09-25 15:08               ` [RFC PATCH 2/3] fs: " Theodore Y. Ts'o
  2019-09-25 22:52               ` Dave Chinner
  2 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2019-09-25 14:56 UTC (permalink / raw)
  To: Colin Walters
  Cc: Dave Chinner, Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team,
	Andy Lutomirski

On 25 Sep 2019, at 8:07, Colin Walters wrote:

> On Wed, Sep 25, 2019, at 3:11 AM, Dave Chinner wrote:
>>
>> We're talking about user data read/write access here, not some
>> special security capability. Access to the data has already been
>> permission checked, so why should the format that the data is
>> supplied to the kernel in suddenly require new privilege checks?
>
> What happens with BTRFS today if userspace provides invalid compressed 
> data via this interface?  Does that show up as filesystem corruption 
> later?  If the data is verified at write time, wouldn't that be losing 
> most of the speed advantages of providing pre-compressed data?

The data is verified while being decompressed, but that's a fairly large 
fuzzing surface (all of zstd, zlib, and lzo).  A lot of people will 
correctly argue that we already have that fuzzing surface today, but I'd 
rather not make a really easy way to stuff arbitrary bytes through the 
kernel decompression code until all the projects involved sign off.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-25 12:07             ` Colin Walters
  2019-09-25 14:56               ` [RFC PATCH 2/3] " Chris Mason
@ 2019-09-25 15:08               ` Theodore Y. Ts'o
  2019-09-25 22:52               ` Dave Chinner
  2 siblings, 0 replies; 22+ messages in thread
From: Theodore Y. Ts'o @ 2019-09-25 15:08 UTC (permalink / raw)
  To: Colin Walters
  Cc: Dave Chinner, Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team,
	Andy Lutomirski

On Wed, Sep 25, 2019 at 08:07:12AM -0400, Colin Walters wrote:
> 
> 
> On Wed, Sep 25, 2019, at 3:11 AM, Dave Chinner wrote:
> >
> > We're talking about user data read/write access here, not some
> > special security capability. Access to the data has already been
> > permission checked, so why should the format that the data is
> > supplied to the kernel in suddenly require new privilege checks?
> 
> What happens with BTRFS today if userspace provides invalid
> compressed data via this interface?  Does that show up as filesystem
> corruption later?  If the data is verified at write time, wouldn't
> that be losing most of the speed advantages of providing
> pre-compressed data?

Not necessarily, most compression algorithms are far more expensive to
compress than to decompress.

If there is a buggy decompressor, it's possible that invalid data
could result in a buffer overrun.  So that's an argument for verifying
the compressed code at write time.  OTOH, the verification could be
just as vulnerability to invalid data as the decompressor, so it
doesn't buy you that much.

> Ability for a user to cause fsck errors later would be a new thing
> that would argue for a privilege check I think.

Well, if it's only invalid data in a user file, there's no reason why
it should cause the kernel declare that the file system is corrupt; it
can just return EIO.

What fsck does is a different question, of course; it might be that
the fsck code isn't going to check compressed user data.  After all,
if all of the files on the file system are compressed, requiring fsck
to check all compressed data blocks is tantamount to requiring it to
read all of the blocks in the file system.  Much better would be some
kind of online scrub operation which validates data files while the
file system is mounted and the system can be in a serving state.

						- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-25 12:07             ` Colin Walters
  2019-09-25 14:56               ` [RFC PATCH 2/3] " Chris Mason
  2019-09-25 15:08               ` [RFC PATCH 2/3] fs: " Theodore Y. Ts'o
@ 2019-09-25 22:52               ` Dave Chinner
  2 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2019-09-25 22:52 UTC (permalink / raw)
  To: Colin Walters
  Cc: Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team,
	Andy Lutomirski

On Wed, Sep 25, 2019 at 08:07:12AM -0400, Colin Walters wrote:
> 
> 
> On Wed, Sep 25, 2019, at 3:11 AM, Dave Chinner wrote:
> >
> > We're talking about user data read/write access here, not some
> > special security capability. Access to the data has already been
> > permission checked, so why should the format that the data is
> > supplied to the kernel in suddenly require new privilege checks?
> 
> What happens with BTRFS today if userspace provides invalid compressed
> data via this interface?

Then the filesystem returns EIO or ENODATA on read because it can't
decompress it.

User can read it back in compressed format (i.e. same way they wrote
it), try to fix it themselves.

> Does that show up as filesystem corruption later?

Nope. Just bad user data.

> If the data is verified at write time, wouldn't that be losing most of the speed advantages of providing pre-compressed data?

It's a direct IO interface. User writes garbage, then they get
garbage back. The user can still retreive the compressed data
directly, so the data is not lost....

> Ability for a user to cause fsck errors later would be a new thing
> that would argue for a privilege check I think.

fsck doesn't validate the correctness of user data - it validates
the filesystem structure is consistent. i.e. user data in unreadable
format != corrupt filesystem structure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
  2019-09-25  7:11           ` Dave Chinner
  2019-09-25 12:07             ` Colin Walters
@ 2019-09-26  0:36             ` Omar Sandoval
  1 sibling, 0 replies; 22+ messages in thread
From: Omar Sandoval @ 2019-09-26  0:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jann Horn, Aleksa Sarai, Jens Axboe, linux-fsdevel, linux-btrfs,
	Linux API, Kernel Team, Andy Lutomirski

On Wed, Sep 25, 2019 at 05:11:29PM +1000, Dave Chinner wrote:
> On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> > On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> > > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > > > like to add an interface to write pre-compressed data directly to the
> > > > > > filesystem. This adds support for so-called "encoded writes" via
> > > > > > pwritev2().
> > > > > >
> > > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > > contains metadata about the write: namely, the compression algorithm and
> > > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > > > the interface in the future. The remaining iovecs contain the encoded
> > > > > > extent.
> > > > > >
> > > > > > A similar interface for reading encoded data can be added to preadv2()
> > > > > > in the future.
> > > > > >
> > > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > > FMODE_ENCODED_IO in ->file_open().
> > > > > [...]
> > > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > > > +                        struct iov_iter *from)
> > > > > > +{
> > > > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > > > +               return -EINVAL;
> > > > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > > > +               return -EFAULT;
> > > > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > > > +               return 0;
> > > > > > +       }
> > > > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > > > +               return -EINVAL;
> > > > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > > > +               return -EPERM;
> > > > >
> > > > > How does this capable() check interact with io_uring? Without having
> > > > > looked at this in detail, I suspect that when an encoded write is
> > > > > requested through io_uring, the capable() check might be executed on
> > > > > something like a workqueue worker thread, which is probably running
> > > > > with a full capability set.
> > > >
> > > > I discussed this more with Jens. You're right, per-IO permission checks
> > > > aren't going to work. In fully-polled mode, we never get an opportunity
> > > > to check capabilities in right context. So, this will probably require a
> > > > new open flag.
> > >
> > > Actually, file_ns_capable() accomplishes the same thing without a new
> > > open flag. Changing the capable() check to file_ns_capable() in
> > > init_user_ns should be enough.
> > 
> > +Aleksa for openat2() and open() space
> > 
> > Mmh... but if the file descriptor has been passed through a privilege
> > boundary, it isn't really clear whether the original opener of the
> > file intended for this to be possible. For example, if (as a
> > hypothetical example) the init process opens a service's logfile with
> > root privileges, then passes the file descriptor to that logfile to
> > the service on execve(), that doesn't mean that the service should be
> > able to perform compressed writes into that file, I think.
> 
> Where's the privilege boundary that is being crossed?
> 
> We're talking about user data read/write access here, not some
> special security capability. Access to the data has already been
> permission checked, so why should the format that the data is
> supplied to the kernel in suddenly require new privilege checks?
> 
> i.e. writing encoded data to a file requires exactly the same
> access permissions as writing cleartext data to the file. The only
> extra information here is whether the _filesystem_ supports encoded
> data, and that doesn't change regardless of what the open file gets
> passed to. Hence the capability is either there or it isn't, it
> doesn't transform not matter what privilege boundary the file is
> passed across. Similarly, we have permission to access the data
> or we don't through the struct file, it doesn't transform either.
> 
> Hence I don't see why CAP_SYS_ADMIN or any special permissions are
> needed for an application with access permissions to file data to
> use these RWF_ENCODED IO interfaces. I am inclined to think the
> permission check here is wrong and should be dropped, and then all
> these issues go away.
> 
> Yes, the app that is going to use this needs root perms because it
> accesses all data in the fs (it's a backup app!), but that doesn't
> mean you can only use RWF_ENCODED if you have root perms.

For RWF_ENCODED writes, the risk here is that we'd be adding a way for
an unprivileged process to feed arbitrary data to zlib/lzo/zstd in the
kernel. From what I could find, this is a new attack surface for
unprivileged processes, and based on the search results for
"$compression_algorithm CVE", there are real bugs here.

For RWF_ENCODED reads, there's another potential issue that occurred to
me. There are a few operations for which we may need to chop up a
compressed extent: hole punch, truncate, reflink, and dedupe. Rather
than recompressing the data, Btrfs keeps the whole extent on disk and
updates the file metadata to refer to a piece of the extent. If we want
to support RWF_ENCODED reads for such extents (and I think we do), then
we need to return the entire original extent along with that metadata.
For an unprivileged reader, there's a security issue that we may be
returning data that the reader wasn't supposed to see. (A privileged
reader can go and read the block device anyways.)

So, in my opinion, both reads and writes should require privilege just
to be on the safe side.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] add RWF_ENCODED for writing compressed data
  2019-09-25 14:56               ` [RFC PATCH 2/3] " Chris Mason
@ 2019-09-26 12:17                 ` Colin Walters
  2019-09-26 17:46                   ` Omar Sandoval
  0 siblings, 1 reply; 22+ messages in thread
From: Colin Walters @ 2019-09-26 12:17 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team,
	Andy Lutomirski



On Wed, Sep 25, 2019, at 10:56 AM, Chris Mason wrote:

> The data is verified while being decompressed, but that's a fairly large 
> fuzzing surface (all of zstd, zlib, and lzo).  A lot of people will 
> correctly argue that we already have that fuzzing surface today, but I'd 
> rather not make a really easy way to stuff arbitrary bytes through the 
> kernel decompression code until all the projects involved sign off.

Right.  So maybe have this start of as a BTRFS ioctl and require
privileges?   I assume that's sufficient for what Omar wants.

(Are there actually any other popular Linux filesystems that do transparent compression anyways?)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] add RWF_ENCODED for writing compressed data
  2019-09-26 12:17                 ` Colin Walters
@ 2019-09-26 17:46                   ` Omar Sandoval
  0 siblings, 0 replies; 22+ messages in thread
From: Omar Sandoval @ 2019-09-26 17:46 UTC (permalink / raw)
  To: Colin Walters
  Cc: Chris Mason, Dave Chinner, Jann Horn, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team,
	Andy Lutomirski

On Thu, Sep 26, 2019 at 08:17:12AM -0400, Colin Walters wrote:
> 
> 
> On Wed, Sep 25, 2019, at 10:56 AM, Chris Mason wrote:
> 
> > The data is verified while being decompressed, but that's a fairly large 
> > fuzzing surface (all of zstd, zlib, and lzo).  A lot of people will 
> > correctly argue that we already have that fuzzing surface today, but I'd 
> > rather not make a really easy way to stuff arbitrary bytes through the 
> > kernel decompression code until all the projects involved sign off.
> 
> Right.  So maybe have this start of as a BTRFS ioctl and require
> privileges?   I assume that's sufficient for what Omar wants.

That was the first version of this series, but Dave requested that I
make it generic [1].

> (Are there actually any other popular Linux filesystems that do transparent compression anyways?)

A scan over the kernel tree shows that a few other filesystems do
compression:

- jffs2
- pstore (if you can call that a filesystem)
- ubifs
- cramfs (read-only)
- erofs (read-only)
- squashfs (read-only)

None of the "mainstream" general-purpose filesystems have support, but
that was also the case for reflink/dedupe before XFS added support.

1: https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-09-26 17:46 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19  6:53 [RFC PATCH 0/3] fs: interface for directly writing encoded (e.g., compressed) data Omar Sandoval
2019-09-19  6:53 ` [RFC PATCH 1/3] fs: pass READ/WRITE to kiocb_set_rw_flags() Omar Sandoval
2019-09-20 14:38   ` Jan Kara
2019-09-19  6:53 ` [PATCH] readv.2: Document new RWF_ENCODED flag to pwritev2() Omar Sandoval
2019-09-19  6:53 ` [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data Omar Sandoval
2019-09-19 15:44   ` Jann Horn
2019-09-20 16:25     ` Jens Axboe
2019-09-24 17:15     ` Omar Sandoval
2019-09-24 19:35       ` Omar Sandoval
2019-09-24 20:01         ` Jann Horn
2019-09-24 20:22           ` Christian Brauner
2019-09-24 20:50             ` Matthew Wilcox
2019-09-24 20:38           ` Omar Sandoval
2019-09-25  7:11           ` Dave Chinner
2019-09-25 12:07             ` Colin Walters
2019-09-25 14:56               ` [RFC PATCH 2/3] " Chris Mason
2019-09-26 12:17                 ` Colin Walters
2019-09-26 17:46                   ` Omar Sandoval
2019-09-25 15:08               ` [RFC PATCH 2/3] fs: " Theodore Y. Ts'o
2019-09-25 22:52               ` Dave Chinner
2019-09-26  0:36             ` Omar Sandoval
2019-09-19  6:53 ` [RFC PATCH 3/3] btrfs: implement encoded (compressed) writes Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).