All of lore.kernel.org
 help / color / mirror / Atom feed
From: Omar Sandoval <osandov@osandov.com>
To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>, Jann Horn <jannh@google.com>,
	Amir Goldstein <amir73il@gmail.com>,
	Aleksa Sarai <cyphar@cyphar.com>,
	linux-api@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH man-pages v8] Document encoded I/O
Date: Tue, 16 Mar 2021 12:42:56 -0700	[thread overview]
Message-ID: <03cb1c219989868f064782cb0b9c1011af8a0a9e.1615923241.git.osandov@osandov.com> (raw)
In-Reply-To: <cover.1615922644.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

This adds a new page, encoded_io(7), providing an overview of encoded
I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
reference it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 man2/fcntl.2      |   8 +
 man2/open.2       |  13 ++
 man2/readv.2      |  69 +++++++++
 man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 459 insertions(+)
 create mode 100644 man7/encoded_io.7

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index a15f467ef..1081cb70f 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -217,6 +217,7 @@ in
 .I arg
 are ignored.
 On Linux, this command can change only the
+.BR O_ALLOW_ENCODED ,
 .BR O_APPEND ,
 .BR O_ASYNC ,
 .BR O_DIRECT ,
@@ -1815,6 +1816,13 @@ and the soft or hard user pipe limit has been reached; see
 .BR pipe (7).
 .TP
 .B EPERM
+Attempted to set the
+.B O_ALLOW_ENCODED
+flag and the calling process did not have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
+.B EPERM
 Attempted to clear the
 .B O_APPEND
 flag on a file that has the append-only attribute set.
diff --git a/man2/open.2 b/man2/open.2
index 03fff1b65..0456a173b 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -180,6 +180,14 @@ for details.
 .PP
 The full list of file creation flags and file status flags is as follows:
 .TP
+.B O_ALLOW_ENCODED
+Open the file with encoded I/O permissions;
+see
+.BR encoded_io (7).
+The caller must have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
 .B O_APPEND
 The file is opened in append mode.
 Before each
@@ -1232,6 +1240,11 @@ did not match the owner of the file and the caller was not privileged.
 The operation was prevented by a file seal; see
 .BR fcntl (2).
 .TP
+.B EPERM
+The
+.B O_ALLOW_ENCODED
+flag was specified, but the caller was not privileged.
+.TP
 .B EROFS
 .I pathname
 refers to a file on a read-only filesystem and write access was
diff --git a/man2/readv.2 b/man2/readv.2
index 472adcf73..c7aec9fa7 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -263,6 +263,11 @@ the data is always appended to the end of the file.
 However, if the
 .I offset
 argument is \-1, the current file offset is updated.
+.TP
+.BR RWF_ENCODED " (since Linux 5.13)"
+Read or write encoded (e.g., compressed) data.
+See
+.BR encoded_io (7).
 .SH RETURN VALUE
 On success,
 .BR readv (),
@@ -282,6 +287,12 @@ than requested (see
 and
 .BR write (2)).
 .PP
+If
+.B RWF_ENCODED
+was specified in
+.IR flags ,
+then the return value is the number of encoded bytes.
+.PP
 On error, \-1 is returned, and \fIerrno\fP is set to indicate the error.
 .SH ERRORS
 The errors are as given for
@@ -312,6 +323,64 @@ is less than zero or greater than the permitted maximum.
 .TP
 .B EOPNOTSUPP
 An unknown flag is specified in \fIflags\fP.
+.TP
+.B EOPNOTSUPP
+.B RWF_ENCODED
+is specified in
+.I flags
+and the filesystem does not implement encoded I/O.
+.TP
+.B EPERM
+.B RWF_ENCODED
+is specified in
+.I flags
+and the file was not opened with the
+.B O_ALLOW_ENCODED
+flag.
+.PP
+.BR preadv2 ()
+can additionally fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+is not large enough to return the encoding metadata.
+.TP
+.B ENOBUFS
+.B RWF_ENCODED
+is specified in
+.I flags
+and the buffers in
+.I iov
+are not big enough to return the encoded data.
+.PP
+.BR pwritev2 ()
+can additionally fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+contains non-zero fields
+after the kernel's
+.IR "sizeof(struct encoded_iov)" .
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the encoding is unknown or not supported by the filesystem.
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the alignment and/or size requirements are not met.
 .SH VERSIONS
 .BR preadv ()
 and
diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
new file mode 100644
index 000000000..1f5e6a510
--- /dev/null
+++ b/man7/encoded_io.7
@@ -0,0 +1,369 @@
+.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.\"
+.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
+.SH NAME
+encoded_io \- overview of encoded I/O
+.SH DESCRIPTION
+Several filesystems (e.g., Btrfs) support transparent encoding
+(e.g., compression, encryption) of data on disk:
+written data is encoded by the kernel before it is written to disk,
+and read data is decoded before being returned to the user.
+In some cases, it is useful to skip this encoding step.
+For example, the user may want to read the compressed contents of a file
+or write pre-compressed data directly to a file.
+This is referred to as "encoded I/O".
+.SS Encoded I/O API
+Encoded I/O is specified with the
+.B RWF_ENCODED
+flag to
+.BR preadv2 (2)
+and
+.BR pwritev2 (2).
+If
+.B RWF_ENCODED
+is specified, then
+.I iov[0].iov_base
+points to an
+.I encoded_iov
+structure, defined in
+.I <linux/fs.h>
+as:
+.PP
+.in +4n
+.EX
+struct encoded_iov {
+    __aligned_u64 len;
+    __aligned_u64 unencoded_len;
+    __aligned_u64 unencoded_offset;
+    __u32 compression;
+    __u32 encryption;
+};
+.EE
+.in
+.PP
+This may be extended in the future, so
+.I iov[0].iov_len
+must be set to
+.I sizeof(struct encoded_iov)
+for forward/backward compatibility.
+The remaining buffers contain the encoded data.
+.PP
+.I compression
+and
+.I encryption
+are the encoding fields.
+.I compression
+is
+.B ENCODED_IOV_COMPRESSION_NONE
+(zero)
+or a filesystem-specific
+.B ENCODED_IOV_COMPRESSION_*
+constant;
+see
+.B "Filesystem support"
+below.
+.I encryption
+is currently always
+.B ENCODED_IOV_ENCRYPTION_NONE
+(zero).
+.PP
+.I unencoded_len
+is the length of the unencoded (i.e., decrypted and decompressed) data.
+.I unencoded_offset
+is the offset from the first byte of the unencoded data
+to the first byte of logical data in the file
+(less than or equal to
+.IR unencoded_len ).
+.I len
+is the length of the data in the file
+(less than or equal to
+.I unencoded_len
+-
+.IR unencoded_offset ).
+See
+.B Extent layout
+below for some examples.
+.PP
+If the unencoded data is actually longer than
+.IR unencoded_len ,
+then it is truncated;
+if it is shorter, then it is extended with zeroes.
+.PP
+.BR pwritev2 (2)
+uses the metadata specified in
+.IR iov[0] ,
+writes the encoded data from the remaining buffers,
+and returns the number of encoded bytes written
+(that is, the sum of
+.I iov[n].iov_len
+for 1 <=
+.I n
+<
+.IR iovcnt ;
+partial writes will not occur).
+At least one encoding field must be non-zero.
+Note that the encoded data is not validated when it is written;
+if it is not valid (e.g., it cannot be decompressed),
+then a subsequent read may return an error.
+If the
+.I offset
+argument to
+.BR pwritev2 (2)
+is -1, then the file offset is incremented by
+.IR len .
+If
+.I iov[0].iov_len
+is less than
+.I sizeof(struct encoded_iov)
+in the kernel,
+then any fields unknown to user space are treated as if they were zero;
+if it is greater and any fields unknown to the kernel are non-zero,
+then this returns -1 and sets
+.I errno
+to
+.BR E2BIG .
+.PP
+.BR preadv2 (2)
+populates the metadata in
+.IR iov[0] ,
+the encoded data in the remaining buffers,
+and returns the number of encoded bytes read.
+This will only return one extent per call.
+This can also read data which is not encoded;
+all encoding fields will be zero in that case.
+If the
+.I offset
+argument to
+.BR preadv2 (2)
+is -1, then the file offset is incremented by
+.IR len .
+If
+.I iov[0].iov_len
+is less than
+.I sizeof(struct encoded_iov)
+in the kernel and any fields unknown to user space are non-zero,
+then
+.BR preadv2 (2)
+returns -1 and sets
+.I errno
+to
+.BR E2BIG ;
+if it is greater,
+then any fields unknown to the kernel are returned as zero.
+If the provided buffers are not large enough
+to return an entire encoded extent,
+then
+.BR preadv2 (2)
+returns -1 and sets
+.I errno
+to
+.BR ENOBUFS .
+.PP
+As the filesystem page cache typically contains decoded data,
+encoded I/O bypasses the page cache.
+.SS Extent layout
+By using
+.IR len ,
+.IR unencoded_len ,
+and
+.IR unencoded_offset ,
+it is possible to refer to a subset of an unencoded extent.
+.PP
+In the simplest case,
+.I len
+is equal to
+.I unencoded_len
+and
+.I unencoded_offset
+is zero.
+This means that the entire unencoded extent is used.
+.PP
+However, suppose we read 50 bytes into a file
+which contains a single compressed extent.
+The filesystem must still return the entire compressed extent
+for us to be able to decompress it,
+so
+.I unencoded_len
+would be the length of the entire decompressed extent.
+However, because the read was at offset 50,
+the first 50 bytes should be ignored.
+Therefore,
+.I unencoded_offset
+would be 50,
+and
+.I len
+would accordingly be
+.I unencoded_len
+- 50.
+.PP
+Additionally, suppose we want to create an encrypted file with length 500,
+but the file is encrypted with a block cipher using a block size of 4096.
+The unencoded data would therefore include the appropriate padding,
+and
+.I unencoded_len
+would be 4096.
+However, to represent the logical size of the file,
+.I len
+would be 500
+(and
+.I unencoded_offset
+would be 0).
+.PP
+Similar situations can arise in other cases:
+.IP * 3
+If the filesystem pads data to the filesystem block size before compressing,
+then compressed files with a size unaligned to the filesystem block size
+will end with an extent with
+.I len
+<
+.IR unencoded_len .
+.IP *
+Extents cloned from the middle of a larger encoded extent with
+.B FICLONERANGE
+may have a non-zero
+.I unencoded_offset
+and/or
+.I len
+<
+.IR unencoded_len .
+.IP *
+If the middle of an encoded extent is overwritten,
+the filesystem may create extents with a non-zero
+.I unencoded_offset
+and/or
+.I len
+<
+.I unencoded_len
+for the parts that were not overwritten.
+.SS Security
+Encoded I/O creates the potential for some security issues:
+.IP * 3
+Encoded writes allow writing arbitrary data
+which the kernel will decode on a subsequent read.
+Decompression algorithms are complex
+and may have bugs which can be exploited by maliciously crafted data.
+.IP *
+Encoded reads may return data which is not logically present in the file
+(see the discussion of
+.I len
+vs
+.I unencoded_len
+above).
+It may not be intended for this data to be readable.
+.PP
+Therefore, encoded I/O requires privilege.
+Namely, the
+.B RWF_ENCODED
+flag may only be used if the file description has the
+.B O_ALLOW_ENCODED
+file status flag set,
+and the
+.B O_ALLOW_ENCODED
+flag may only be set by a thread with the
+.B CAP_SYS_ADMIN
+capability.
+The
+.B O_ALLOW_ENCODED
+flag can be set by
+.BR open (2)
+or
+.BR fcntl (2).
+It can also be cleared by
+.BR fcntl (2);
+clearing it does not require
+.B CAP_SYS_ADMIN.
+Note that it is not cleared on
+.BR fork (2)
+or
+.BR execve (2).
+One may wish to use
+.B O_CLOEXEC
+with
+.BR O_ALLOW_ENCODED .
+.SS Filesystem support
+Encoded I/O is supported on the following filesystems:
+.TP
+Btrfs (since Linux 5.13)
+.IP
+Btrfs supports encoded reads and writes of compressed data.
+The data is encoded as follows:
+.RS
+.IP * 3
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
+then the encoded data is a single zlib stream.
+.IP *
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
+then the encoded data is a single zstd frame compressed with the
+.I windowLog
+compression parameter set to no more than 17.
+.IP *
+If
+.I compression
+is one of
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
+or
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
+then the encoded data is compressed page by page
+(using the page size indicated by the name of the constant)
+with LZO1X
+and wrapped in the format documented in the Linux kernel source file
+.IR fs/btrfs/lzo.c .
+.RE
+.IP
+Additionally, there are some restrictions on
+.BR pwritev2 (2):
+.RS
+.IP * 3
+.I offset
+(or the current file offset if
+.I offset
+is -1) must be aligned to the sector size of the filesystem.
+.IP *
+.I len
+must be aligned to the sector size of the filesystem
+unless the data ends at or beyond the current end of the file.
+.IP *
+.I unencoded_len
+and the length of the encoded data must each be no more than 128 KiB.
+This limit may increase in the future.
+.IP *
+The length of the encoded data must be less than or equal to
+.IR unencoded_len .
+.IP *
+If using LZO, the filesystem's page size must match the compression page size.
+.RE
+.SH SEE ALSO
+.BR open (2),
+.BR preadv2 (2)
-- 
2.30.2


  reply	other threads:[~2021-03-16 19:44 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-16 19:42 [PATCH v8 00/10] fs: interface for directly reading/writing compressed data Omar Sandoval
2021-03-16 19:42 ` Omar Sandoval [this message]
2021-03-16 19:42 ` [PATCH v8 01/10] iov_iter: add copy_struct_from_iter() Omar Sandoval
2021-03-17 17:56   ` Christian Brauner
2021-03-17 18:45     ` Omar Sandoval
2021-03-20 10:04       ` Christian Brauner
2021-03-16 19:42 ` [PATCH v8 02/10] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
2021-03-16 19:42 ` [PATCH v8 03/10] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
2021-03-16 19:43 ` [PATCH v8 04/10] btrfs: fix check_data_csum() error message for direct I/O Omar Sandoval
2021-03-17 11:21   ` Qu Wenruo
2021-03-17 18:33     ` Omar Sandoval
2021-03-17 23:47       ` Qu Wenruo
2021-03-18 20:25         ` David Sterba
2021-03-16 19:43 ` [PATCH v8 05/10] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
2021-03-16 19:43 ` [PATCH v8 06/10] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
2021-03-16 19:43 ` [PATCH v8 07/10] btrfs: support different disk extent size for delalloc Omar Sandoval
2021-03-16 19:43 ` [PATCH v8 08/10] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
2021-03-16 19:43 ` [PATCH v8 09/10] btrfs: implement RWF_ENCODED reads Omar Sandoval
2021-03-16 19:43 ` [PATCH v8 10/10] btrfs: implement RWF_ENCODED writes Omar Sandoval
2021-03-19 18:21 ` [PATCH v8 00/10] fs: interface for directly reading/writing compressed data Josef Bacik
2021-03-19 20:08   ` Linus Torvalds
2021-03-19 20:12     ` Josef Bacik
2021-03-19 20:27     ` Omar Sandoval
2021-03-19 20:43       ` Christian Brauner
2021-03-19 20:55       ` Linus Torvalds
2021-03-19 21:11         ` Omar Sandoval
2021-03-19 21:47           ` Linus Torvalds
2021-03-19 22:46             ` Omar Sandoval
2021-03-20  0:31               ` Linus Torvalds
2021-03-20 20:39                 ` Omar Sandoval

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=03cb1c219989868f064782cb0b9c1011af8a0a9e.1615923241.git.osandov@osandov.com \
    --to=osandov@osandov.com \
    --cc=amir73il@gmail.com \
    --cc=cyphar@cyphar.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jannh@google.com \
    --cc=kernel-team@fb.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.