Re: [PATCH man-pages v5] Document encoded I/O

From: Amir Goldstein <amir73il@gmail.com>
To: Omar Sandoval <osandov@osandov.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>, Jann Horn <jannh@google.com>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Linux API <linux-api@vger.kernel.org>,
	kernel-team@fb.com, Michael Kerrisk <mtk.manpages@gmail.com>,
	linux-man <linux-man@vger.kernel.org>
Subject: Re: [PATCH man-pages v5] Document encoded I/O
Date: Fri, 21 Aug 2020 12:24:48 +0300	[thread overview]
Message-ID: <CAOQ4uxgEpYqQ9MeuS=76tOtjFCrL8urkDoPoHxu+A5s4C2HGRA@mail.gmail.com> (raw)
In-Reply-To: <64cc229872230dc6998a3dbf2264513870a8a6f6.1597994017.git.osandov@osandov.com>

On Fri, Aug 21, 2020 at 10:38 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This adds a new page, encoded_io(7), providing an overview of encoded
> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> reference it.
>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: linux-man <linux-man@vger.kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---

Omar,

Thanks for making the clarifications. Some questions below.

[...]

> +.PP
> +As the filesystem page cache typically contains decoded data,
> +encoded I/O bypasses the page cache.
> +.SS Extent layout
> +By using
> +.IR len ,
> +.IR unencoded_len ,
> +and
> +.IR unencoded_offset ,
> +it is possible to refer to a subset of an unencoded extent.
> +.PP
> +In the simplest case,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +This means that the entire unencoded extent is used.
> +.PP
> +However, suppose we read 50 bytes into a file
> +which contains a single compressed extent.
> +The filesystem must still return the entire compressed extent
> +for us to be able to decompress it,
> +so
> +.I unencoded_len
> +would be the length of the entire decompressed extent.
> +However, because the read was at offset 50,
> +the first 50 bytes should be ignored.
> +Therefore,
> +.I unencoded_offset
> +would be 50,
> +and
> +.I len
> +would accordingly be
> +.IR unencoded_len\ -\ 50 .
> +.PP
> +Additionally, suppose we want to create an encrypted file with length 500,
> +but the file is encrypted with a block cipher using a block size of 4096.
> +The unencoded data would therefore include the appropriate padding,
> +and
> +.I unencoded_len
> +would be 4096.
> +However, to represent the logical size of the file,
> +.I len
> +would be 500
> +(and
> +.I unencoded_offset
> +would be 0).
> +.PP
> +Similar situations can arise in other cases:
> +.IP * 3
> +If the filesystem pads data to the filesystem block size before compressing,
> +then compressed files with a size unaligned to the filesystem block size will
> +end with an extent with
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +Extents cloned from the middle of a larger encoded extent with
> +.B FICLONERANGE
> +may have a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +If the middle of an encoded extent is overwritten,
> +the filesystem may create extents with a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.I unencoded_len
> +for the parts that were not overwritten.

So in this case, would the reader be getting extents "out of unencoded order"?
e.g. unencoded range 0..4096 and then unencoded range 10..20?
Or would reader be reading the encoded full block twice, once for
ragne 0..10 and once for range 20..4096?

> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by maliciously crafted data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.
> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ALLOW_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +.B O_ALLOW_ENCODED
> +may be set and cleared with
> +.BR fcntl (2).
> +Note that it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2);
> +one may wish to use
> +.B O_CLOEXEC
> +with
> +.BR O_ALLOW_ENCODED .
> +.SS Filesystem support
> +Encoded I/O is supported on the following filesystems:
> +.TP
> +Btrfs (since Linux 5.10)
> +.IP
> +Btrfs supports encoded reads and writes of compressed data.
> +The data is encoded as follows:
> +.RS
> +.IP * 3
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +then the encoded data is a single zlib stream.
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +then the encoded data is compressed page by page with LZO1X
> +and wrapped in the format documented in the Linux kernel source file
> +.IR fs/btrfs/lzo.c .

:-/ So maybe call it ENCODED_IOV_COMPRESSION_BTRFS_LZO?

I understand why you want the encoding format not to be opaque, but
I imagine the encoded data is not going to be migrated as is between
different filesystems. So just call it for what it is - a private
filesystem encoding
format. If you have a format that is standard and other filesystems are likely
to use, fine, but let's not make an API that discourages using
"private" encoding, just for the sake of it and make life harder for no good
reason.

All the reader of this man page may be interested to know is which
filesystems are expected to support which encoding types and a general
description of what they mean (as you did).
Making this page wrongly appear as a standard for encoding formats is not
going to play out well...

> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZSTD ,
> +then the encoded data is a single zstd frame compressed with the
> +.I windowLog
> +compression parameter set to no more than 17.

Even that small detail is a bit limiting to filesystems and should
therefore be tagged as a private btrfs encoding IMO.

Thanks,
Amir.