linux-fscrypt.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/3] btrfs: support fsverity
@ 2021-06-30 20:01 Boris Burkov
  2021-06-30 20:01 ` [PATCH v6 1/3] btrfs: add ro compat flags to inodes Boris Burkov
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Boris Burkov @ 2021-06-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

This patchset provides support for fsverity in btrfs.

At a high level, we store the verity descriptor and Merkle tree data
in the file system btree with the file's inode as the objectid, and
direct reads/writes to those items to implement the generic fsverity
interface required by fs/verity/.

The first patch is a preparatory patch which adds a notion of
compat_flags to the btrfs_inode and inode_item in order to allow
enabling verity on a file without making the file system unusable for
older kernels.

The second patch is the bulk of the fsverity implementation. It
implements the fsverity interface, storage, caching, etc...

The third patch handles crashes mid-verity enable via orphan items.

I have tested this patch set in the following ways:
- xfstests auto group
- with a separate fix for btrfs fiemap and some light touches to the
  tests themselves: xfstests generic/572,573,574,575.
- new xfstest for btrfs specific corruptions (e.g. inline extents).
- new xfstest using dmlogwrites and dmsnapshot to exercise orphans.
- new xfstest using pwrite to exercise merkle cache EFBIG cases
- manual test with sleeps in kernel to force orphan vs. unlink race.
- manual end-to-end test with verity signed rpms.
--
changes for v6:
Patch 2: fix bugs reported by smatch
- fix unintialized/unused variables (copied, root, trans)
- handle len=0 in write_key_bytes
- 1 << blocksize -> 1ULL << blocksize

changes for v5:
Significant rewrite/re-organization. Most changes in patch 1 and 2:
- rewrote ro_compat flags to use top 32 bits of flags, discovered tree
  checker/flags definitions were broken (see patch 1 for details)
- merged dio and inline/prealloc/hole patches into main verity patch, as
they were basically empty.
- rewrote rollback to abort much less aggressively
- put orphan/enable verity on inode in one btrfs transaction
- tweaks to returned types to prefer u64 where reasonable
- use kmap_local, memzero_page properly
- use GFP_NOFS for allocating merkle tree cache pages
- many documentation fixes/improvements
- many style fixes
- rebase onto kdave/misc-next as of 6/24

changes for v4:
Patch 2:
- fix build without CONFIG_VERITY
- fix assumption of short writes
- make true_size match the item contents in get_verity_descriptor
- rewrite overflow logic in terms of file position instead of cache index
- round up position by 64k instead of adding 2048 pages
- fix conflation of block index and page index in write_merkle_block
- ensure reserved fields are 0 in the new descriptor item.

changes for v3:
Patch 2: fix bug in overflow logic, fix interface of
get_verity_descriptor, truncate merkle cache items on failure, fix
various code/style issues.
Patch 5: fix extent data leak if verity races with unlink or O_TMPFILE
and removes a legitimate orphan, then system is interrupted such that
the orphan was needed.

changes for v2:
Patch 1: Unchanged.
Patch 2: Return EFBIG if Merkle data past s_maxbytes. Added special
descriptor item for encryption and to handle ERANGE case for
get_verity_descriptor. Improved function comments. Rebased onto subpage
read patches -- modified end_page_read to do verity check before marking
the page uptodate. Changed from full compat to ro_compat; merged sysfs
feature here.
Patch 3: Rebased onto subpage read patches.
Patch 4: Unchanged.
Patch 5: Used to be sysfs feature, now a new patch that handles orphaned
verity data.


Boris Burkov (3):
  btrfs: add ro compat flags to inodes
  btrfs: initial fsverity support
  btrfs: verity metadata orphan items

 fs/btrfs/Makefile               |   1 +
 fs/btrfs/btrfs_inode.h          |  27 +-
 fs/btrfs/ctree.h                |  53 +-
 fs/btrfs/delayed-inode.c        |   9 +-
 fs/btrfs/extent_io.c            |  25 +-
 fs/btrfs/file.c                 |  10 +
 fs/btrfs/inode.c                |  31 +-
 fs/btrfs/ioctl.c                |  21 +-
 fs/btrfs/super.c                |   3 +
 fs/btrfs/sysfs.c                |   6 +
 fs/btrfs/tree-checker.c         |  18 +-
 fs/btrfs/tree-log.c             |   5 +-
 fs/btrfs/verity.c               | 831 ++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  35 ++
 15 files changed, 1029 insertions(+), 47 deletions(-)
 create mode 100644 fs/btrfs/verity.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 1/3] btrfs: add ro compat flags to inodes
  2021-06-30 20:01 [PATCH v6 0/3] btrfs: support fsverity Boris Burkov
@ 2021-06-30 20:01 ` Boris Burkov
  2021-06-30 20:01 ` [PATCH v6 2/3] btrfs: initial fsverity support Boris Burkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-06-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.

As for the implementation, it unfortunately gets a little complicated.

The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
Dave Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro-compat flags.

It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:

btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)

The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.

This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corruped exactly badly).
Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement..) The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.

With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h   | 20 +++++++++++++++++++-
 fs/btrfs/ctree.h         | 30 ++++++++++++++++--------------
 fs/btrfs/delayed-inode.c |  9 +++++++--
 fs/btrfs/inode.c         |  9 +++++++--
 fs/btrfs/ioctl.c         |  7 ++++---
 fs/btrfs/tree-checker.c  | 18 ++++++++++++++----
 fs/btrfs/tree-log.c      |  5 ++++-
 7 files changed, 71 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c652e19ad74e..1093b00130be 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -189,8 +189,10 @@ struct btrfs_inode {
 	 */
 	u64 csum_bytes;
 
-	/* flags field from the on disk inode */
+	/* Backwards incompatible flags, lower half of inode_item::flags  */
 	u32 flags;
+	/* Read-only compatibility flags, upper half of inode_item::flags */
+	u32 ro_flags;
 
 	/*
 	 * Counters to keep track of the number of extent item's we may use due
@@ -348,6 +350,22 @@ struct btrfs_dio_private {
 	u8 csums[];
 };
 
+/*
+ * btrfs_inode_item stores flags in a u64, btrfs_inode stores them in two
+ * separate u32s. These two functions convert between the two representations.
+ */
+static inline u64 btrfs_inode_combine_flags(u32 flags, u32 ro_flags)
+{
+	return (flags | ((u64)ro_flags << 32));
+}
+
+static inline void btrfs_inode_split_flags(u64 inode_item_flags,
+					   u32 *flags, u32 *ro_flags)
+{
+	*flags = (u32)inode_item_flags;
+	*ro_flags = (u32)(inode_item_flags >> 32);
+}
+
 /* Array of bytes with variable length, hexadecimal format 0x1234 */
 #define CSUM_FMT				"0x%*phN"
 #define CSUM_FMT_VALUE(size, bytes)		size, bytes
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d7ef4d7d2c1a..422bcc93977e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1486,20 +1486,20 @@ do {                                                                   \
 /*
  * Inode flags
  */
-#define BTRFS_INODE_NODATASUM		(1 << 0)
-#define BTRFS_INODE_NODATACOW		(1 << 1)
-#define BTRFS_INODE_READONLY		(1 << 2)
-#define BTRFS_INODE_NOCOMPRESS		(1 << 3)
-#define BTRFS_INODE_PREALLOC		(1 << 4)
-#define BTRFS_INODE_SYNC		(1 << 5)
-#define BTRFS_INODE_IMMUTABLE		(1 << 6)
-#define BTRFS_INODE_APPEND		(1 << 7)
-#define BTRFS_INODE_NODUMP		(1 << 8)
-#define BTRFS_INODE_NOATIME		(1 << 9)
-#define BTRFS_INODE_DIRSYNC		(1 << 10)
-#define BTRFS_INODE_COMPRESS		(1 << 11)
-
-#define BTRFS_INODE_ROOT_ITEM_INIT	(1 << 31)
+#define BTRFS_INODE_NODATASUM		(1U << 0)
+#define BTRFS_INODE_NODATACOW		(1U << 1)
+#define BTRFS_INODE_READONLY		(1U << 2)
+#define BTRFS_INODE_NOCOMPRESS		(1U << 3)
+#define BTRFS_INODE_PREALLOC		(1U << 4)
+#define BTRFS_INODE_SYNC		(1U << 5)
+#define BTRFS_INODE_IMMUTABLE		(1U << 6)
+#define BTRFS_INODE_APPEND		(1U << 7)
+#define BTRFS_INODE_NODUMP		(1U << 8)
+#define BTRFS_INODE_NOATIME		(1U << 9)
+#define BTRFS_INODE_DIRSYNC		(1U << 10)
+#define BTRFS_INODE_COMPRESS		(1U << 11)
+
+#define BTRFS_INODE_ROOT_ITEM_INIT	(1U << 31)
 
 #define BTRFS_INODE_FLAG_MASK						\
 	(BTRFS_INODE_NODATASUM |					\
@@ -1516,6 +1516,8 @@ do {                                                                   \
 	 BTRFS_INODE_COMPRESS |						\
 	 BTRFS_INODE_ROOT_ITEM_INIT)
 
+#define BTRFS_INODE_RO_FLAG_MASK					(0)
+
 struct btrfs_map_token {
 	struct extent_buffer *eb;
 	char *kaddr;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 257c1e18abd4..27be3150e537 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1711,6 +1711,8 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
 				  struct btrfs_inode_item *inode_item,
 				  struct inode *inode)
 {
+	u64 flags;
+
 	btrfs_set_stack_inode_uid(inode_item, i_uid_read(inode));
 	btrfs_set_stack_inode_gid(inode_item, i_gid_read(inode));
 	btrfs_set_stack_inode_size(inode_item, BTRFS_I(inode)->disk_i_size);
@@ -1723,7 +1725,9 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
 				       inode_peek_iversion(inode));
 	btrfs_set_stack_inode_transid(inode_item, trans->transid);
 	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
-	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
+	flags = btrfs_inode_combine_flags(BTRFS_I(inode)->flags,
+					  BTRFS_I(inode)->ro_flags);
+	btrfs_set_stack_inode_flags(inode_item, flags);
 	btrfs_set_stack_inode_block_group(inode_item, 0);
 
 	btrfs_set_stack_timespec_sec(&inode_item->atime,
@@ -1781,7 +1785,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 				   btrfs_stack_inode_sequence(inode_item));
 	inode->i_rdev = 0;
 	*rdev = btrfs_stack_inode_rdev(inode_item);
-	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
+	btrfs_inode_split_flags(btrfs_stack_inode_flags(inode_item),
+				&BTRFS_I(inode)->flags, &BTRFS_I(inode)->ro_flags);
 
 	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
 	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6eb20987351..be27cccea1a9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3639,7 +3639,8 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	rdev = btrfs_inode_rdev(leaf, inode_item);
 
 	BTRFS_I(inode)->index_cnt = (u64)-1;
-	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
+	btrfs_inode_split_flags(btrfs_inode_flags(leaf, inode_item),
+				&BTRFS_I(inode)->flags, &BTRFS_I(inode)->ro_flags);
 
 cache_index:
 	/*
@@ -3770,6 +3771,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 			    struct inode *inode)
 {
 	struct btrfs_map_token token;
+	u64 flags;
 
 	btrfs_init_map_token(&token, leaf);
 
@@ -3805,7 +3807,9 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
-	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	flags = btrfs_inode_combine_flags(BTRFS_I(inode)->flags,
+					  BTRFS_I(inode)->ro_flags);
+	btrfs_set_token_inode_flags(&token, item, flags);
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
@@ -8904,6 +8908,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->defrag_bytes = 0;
 	ei->disk_i_size = 0;
 	ei->flags = 0;
+	ei->ro_flags = 0;
 	ei->csum_bytes = 0;
 	ei->index_cnt = (u64)-1;
 	ei->dir_index = 0;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..8007364f064d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -103,9 +103,10 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
  * Export internal inode flags to the format expected by the FS_IOC_GETFLAGS
  * ioctl.
  */
-static unsigned int btrfs_inode_flags_to_fsflags(unsigned int flags)
+static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 {
 	unsigned int iflags = 0;
+	u32 flags = binode->flags;
 
 	if (flags & BTRFS_INODE_SYNC)
 		iflags |= FS_SYNC_FL;
@@ -200,7 +201,7 @@ int btrfs_fileattr_get(struct dentry *dentry, struct fileattr *fa)
 {
 	struct btrfs_inode *binode = BTRFS_I(d_inode(dentry));
 
-	fileattr_fill_flags(fa, btrfs_inode_flags_to_fsflags(binode->flags));
+	fileattr_fill_flags(fa, btrfs_inode_flags_to_fsflags(binode));
 	return 0;
 }
 
@@ -224,7 +225,7 @@ int btrfs_fileattr_set(struct user_namespace *mnt_userns,
 		return -EOPNOTSUPP;
 
 	fsflags = btrfs_mask_fsflags_for_type(inode, fa->flags);
-	old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
+	old_fsflags = btrfs_inode_flags_to_fsflags(binode);
 	ret = check_fsflags(old_fsflags, fsflags);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index a8b2e0d2c025..a4a9620957a6 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -24,6 +24,7 @@
 #include "compression.h"
 #include "volumes.h"
 #include "misc.h"
+#include "btrfs_inode.h"
 
 /*
  * Error message should follow the following format:
@@ -999,6 +1000,8 @@ static int check_inode_item(struct extent_buffer *leaf,
 	u32 valid_mask = (S_IFMT | S_ISUID | S_ISGID | S_ISVTX | 0777);
 	u32 mode;
 	int ret;
+	u32 flags;
+	u32 ro_flags;
 
 	ret = check_inode_key(leaf, key, slot);
 	if (unlikely(ret < 0))
@@ -1054,11 +1057,18 @@ static int check_inode_item(struct extent_buffer *leaf,
 			btrfs_inode_nlink(leaf, iitem));
 		return -EUCLEAN;
 	}
-	if (unlikely(btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)) {
+	btrfs_inode_split_flags(btrfs_inode_flags(leaf, iitem),
+				&flags, &ro_flags);
+	if (unlikely(flags & ~BTRFS_INODE_FLAG_MASK)) {
 		inode_item_err(leaf, slot,
-			       "unknown flags detected: 0x%llx",
-			       btrfs_inode_flags(leaf, iitem) &
-			       ~BTRFS_INODE_FLAG_MASK);
+			       "unknown incompat flags detected: 0x%x", flags);
+		return -EUCLEAN;
+	}
+	if (unlikely(!sb_rdonly(fs_info->sb) &&
+		     (ro_flags & ~BTRFS_INODE_RO_FLAG_MASK))) {
+		inode_item_err(leaf, slot,
+			       "unknown ro-compat flags detected on writeable mount: 0x%x",
+			       ro_flags);
 		return -EUCLEAN;
 	}
 	return 0;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index cab451d19547..25d8616692e4 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3913,6 +3913,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 			    u64 logged_isize)
 {
 	struct btrfs_map_token token;
+	u64 flags;
 
 	btrfs_init_map_token(&token, leaf);
 
@@ -3962,7 +3963,9 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
-	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	flags = btrfs_inode_combine_flags(BTRFS_I(inode)->flags,
+					  BTRFS_I(inode)->ro_flags);
+	btrfs_set_token_inode_flags(&token, item, flags);
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 2/3] btrfs: initial fsverity support
  2021-06-30 20:01 [PATCH v6 0/3] btrfs: support fsverity Boris Burkov
  2021-06-30 20:01 ` [PATCH v6 1/3] btrfs: add ro compat flags to inodes Boris Burkov
@ 2021-06-30 20:01 ` Boris Burkov
  2021-07-11 14:52   ` Eric Biggers
                     ` (2 more replies)
  2021-06-30 20:01 ` [PATCH v6 3/3] btrfs: verity metadata orphan items Boris Burkov
  2021-07-28 15:24 ` [PATCH v6 0/3] btrfs: support fsverity David Sterba
  3 siblings, 3 replies; 18+ messages in thread
From: Boris Burkov @ 2021-06-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/Makefile               |   1 +
 fs/btrfs/btrfs_inode.h          |   7 +
 fs/btrfs/ctree.h                |  25 +-
 fs/btrfs/extent_io.c            |  25 +-
 fs/btrfs/file.c                 |  10 +
 fs/btrfs/inode.c                |   6 +
 fs/btrfs/ioctl.c                |  14 +-
 fs/btrfs/super.c                |   3 +
 fs/btrfs/sysfs.c                |   6 +
 fs/btrfs/verity.c               | 758 ++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  35 ++
 12 files changed, 872 insertions(+), 19 deletions(-)
 create mode 100644 fs/btrfs/verity.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index cec88a66bd6c..3dcf9bcc2326 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -36,6 +36,7 @@ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
 btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
+btrfs-$(CONFIG_FS_VERITY) += verity.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1093b00130be..76ee1452c57b 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -51,6 +51,13 @@ enum {
 	 * the file range, inode's io_tree).
 	 */
 	BTRFS_INODE_NO_DELALLOC_FLUSH,
+	/*
+	 * Set when we are working on enabling verity for a file. Computing and
+	 * writing the whole Merkle tree can take a while so we want to prevent
+	 * races where two separate tasks attempt to simultaneously start verity
+	 * on the same file.
+	 */
+	BTRFS_INODE_VERITY_IN_PROGRESS,
 };
 
 /* in memory btrfs inode */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 422bcc93977e..63a99319f6f4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -281,7 +281,8 @@ struct btrfs_super_block {
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
-	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
+	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
+	 BTRFS_FEATURE_COMPAT_RO_VERITY)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -1516,7 +1517,9 @@ do {                                                                   \
 	 BTRFS_INODE_COMPRESS |						\
 	 BTRFS_INODE_ROOT_ITEM_INIT)
 
-#define BTRFS_INODE_RO_FLAG_MASK					(0)
+#define BTRFS_INODE_RO_VERITY		(1U << 0)
+
+#define BTRFS_INODE_RO_FLAG_MASK	(BTRFS_INODE_RO_VERITY)
 
 struct btrfs_map_token {
 	struct extent_buffer *eb;
@@ -3783,6 +3786,24 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
 	return signal_pending(current);
 }
 
+/* verity.c */
+#ifdef CONFIG_FS_VERITY
+extern const struct fsverity_operations btrfs_verityops;
+int btrfs_drop_verity_items(struct btrfs_inode *inode);
+BTRFS_SETGET_FUNCS(verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
+		   encryption, 8);
+BTRFS_SETGET_FUNCS(verity_descriptor_size, struct btrfs_verity_descriptor_item, size, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
+			 encryption, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size, struct btrfs_verity_descriptor_item,
+			 size, 64);
+#else
+static inline int btrfs_drop_verity_items(struct btrfs_inode *inode)
+{
+	return 0;
+}
+#endif
+
 /* Sanity test specific functions */
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 void btrfs_test_destroy_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9e81d25dea70..aeaf8fe342dc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -13,6 +13,7 @@
 #include <linux/pagevec.h>
 #include <linux/prefetch.h>
 #include <linux/cleancache.h>
+#include <linux/fsverity.h>
 #include "misc.h"
 #include "extent_io.h"
 #include "extent-io-tree.h"
@@ -2245,18 +2246,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-/*
- * helper function to set a given page up to date if all the
- * extents in the tree for that page are up to date
- */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
-{
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_SIZE - 1;
-	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
-		SetPageUptodate(page);
-}
-
 int free_io_failure(struct extent_io_tree *failure_tree,
 		    struct extent_io_tree *io_tree,
 		    struct io_failure_record *rec)
@@ -2688,7 +2677,14 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 	       start + len <= page_offset(page) + PAGE_SIZE);
 
 	if (uptodate) {
-		btrfs_page_set_uptodate(fs_info, page, start, len);
+		if (!PageError(page) && !PageUptodate(page) &&
+		    start < i_size_read(page->mapping->host) &&
+		    fsverity_active(page->mapping->host) &&
+		    !fsverity_verify_page(page)) {
+			btrfs_page_set_error(fs_info, page, start, len);
+		} else {
+			btrfs_page_set_uptodate(fs_info, page, start, len);
+		}
 	} else {
 		btrfs_page_clear_uptodate(fs_info, page, start, len);
 		btrfs_page_set_error(fs_info, page, start, len);
@@ -3097,7 +3093,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 		/* Update page status and unlock */
 		end_page_read(page, uptodate, start, len);
 		endio_readpage_release_extent(&processed, BTRFS_I(inode),
-					      start, end, uptodate);
+					      start, end, PageUptodate(page));
 	}
 	/* Release the last extent */
 	endio_readpage_release_extent(&processed, NULL, 0, 0, false);
@@ -3627,7 +3623,6 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		/* the get_extent function already copied into the page */
 		if (test_range_bit(tree, cur, cur_end,
 				   EXTENT_UPTODATE, 1, NULL)) {
-			check_page_uptodate(tree, page);
 			unlock_extent(tree, cur, cur + iosize - 1);
 			end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 28a05ba47060..78503b125261 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -16,6 +16,7 @@
 #include <linux/btrfs.h>
 #include <linux/uio.h>
 #include <linux/iversion.h>
+#include <linux/fsverity.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3605,7 +3606,13 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
+	int ret;
+
 	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+
+	ret = fsverity_file_open(inode, filp);
+	if (ret)
+		return ret;
 	return generic_file_open(inode, filp);
 }
 
@@ -3634,6 +3641,9 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if (fsverity_active(inode))
+		return 0;
+
 	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
 		return 0;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index be27cccea1a9..9f176a840446 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -32,6 +32,7 @@
 #include <linux/sched/mm.h>
 #include <linux/iomap.h>
 #include <asm/unaligned.h>
+#include <linux/fsverity.h>
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -5437,6 +5438,7 @@ void btrfs_evict_inode(struct inode *inode)
 	trace_btrfs_inode_evict(inode);
 
 	if (!root) {
+		fsverity_cleanup_inode(inode);
 		clear_inode(inode);
 		return;
 	}
@@ -5519,6 +5521,7 @@ void btrfs_evict_inode(struct inode *inode)
 	 * to retry these periodically in the future.
 	 */
 	btrfs_remove_delayed_node(BTRFS_I(inode));
+	fsverity_cleanup_inode(inode);
 	clear_inode(inode);
 }
 
@@ -9090,6 +9093,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(path->dentry);
 	u32 blocksize = inode->i_sb->s_blocksize;
 	u32 bi_flags = BTRFS_I(inode)->flags;
+	u32 bi_ro_flags = BTRFS_I(inode)->ro_flags;
 
 	stat->result_mask |= STATX_BTIME;
 	stat->btime.tv_sec = BTRFS_I(inode)->i_otime.tv_sec;
@@ -9102,6 +9106,8 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 		stat->attributes |= STATX_ATTR_IMMUTABLE;
 	if (bi_flags & BTRFS_INODE_NODUMP)
 		stat->attributes |= STATX_ATTR_NODUMP;
+	if (bi_ro_flags & BTRFS_INODE_RO_VERITY)
+		stat->attributes |= STATX_ATTR_VERITY;
 
 	stat->attributes_mask |= (STATX_ATTR_APPEND |
 				  STATX_ATTR_COMPRESSED |
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 8007364f064d..c4e8f7df384b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -27,6 +27,7 @@
 #include <linux/uaccess.h>
 #include <linux/iversion.h>
 #include <linux/fileattr.h>
+#include <linux/fsverity.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "export.h"
@@ -107,6 +108,7 @@ static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 {
 	unsigned int iflags = 0;
 	u32 flags = binode->flags;
+	u32 ro_flags = binode->ro_flags;
 
 	if (flags & BTRFS_INODE_SYNC)
 		iflags |= FS_SYNC_FL;
@@ -122,6 +124,8 @@ static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 		iflags |= FS_DIRSYNC_FL;
 	if (flags & BTRFS_INODE_NODATACOW)
 		iflags |= FS_NOCOW_FL;
+	if (ro_flags & BTRFS_INODE_RO_VERITY)
+		iflags |= FS_VERITY_FL;
 
 	if (flags & BTRFS_INODE_NOCOMPRESS)
 		iflags |= FS_NOCOMP_FL;
@@ -149,10 +153,12 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 		new_fl |= S_NOATIME;
 	if (binode->flags & BTRFS_INODE_DIRSYNC)
 		new_fl |= S_DIRSYNC;
+	if (binode->ro_flags & BTRFS_INODE_RO_VERITY)
+		new_fl |= S_VERITY;
 
 	set_mask_bits(&inode->i_flags,
-		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
-		      new_fl);
+		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
+		      S_VERITY, new_fl);
 }
 
 /*
@@ -5014,6 +5020,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_subvol_rootref(file, argp);
 	case BTRFS_IOC_INO_LOOKUP_USER:
 		return btrfs_ioctl_ino_lookup_user(file, argp);
+	case FS_IOC_ENABLE_VERITY:
+		return fsverity_ioctl_enable(file, (const void __user *)argp);
+	case FS_IOC_MEASURE_VERITY:
+		return fsverity_ioctl_measure(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d07b18b2b250..e6c5968bd028 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1353,6 +1353,9 @@ static int btrfs_fill_super(struct super_block *sb,
 	sb->s_op = &btrfs_super_ops;
 	sb->s_d_op = &btrfs_dentry_operations;
 	sb->s_export_op = &btrfs_export_ops;
+#ifdef CONFIG_FS_VERITY
+	sb->s_vop = &btrfs_verityops;
+#endif
 	sb->s_xattr = btrfs_xattr_handlers;
 	sb->s_time_gran = 1;
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 9d1d140118ff..e101a0bf392f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -267,6 +267,9 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
 #ifdef CONFIG_BTRFS_DEBUG
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
+#ifdef CONFIG_FS_VERITY
+BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
+#endif
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -284,6 +287,9 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_FEAT_ATTR_PTR(zoned),
+#endif
+#ifdef CONFIG_FS_VERITY
+	BTRFS_FEAT_ATTR_PTR(verity),
 #endif
 	NULL
 };
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
new file mode 100644
index 000000000000..f24c1d88f66d
--- /dev/null
+++ b/fs/btrfs/verity.c
@@ -0,0 +1,758 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/rwsem.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/iversion.h>
+#include <linux/fsverity.h>
+#include <linux/sched/mm.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "locking.h"
+
+/*
+ * Implementation of the interface defined in struct fsverity_operations.
+ *
+ * The main question is how and where to store the verity descriptor and the
+ * Merkle tree. We store both in dedicated btree items in the filesystem tree,
+ * together with the rest of the inode metadata. This means we'll need to do
+ * extra work to encrypt them once encryption is supported in btrfs, but btrfs
+ * has a lot of careful code around i_size and it seems better to make a new key
+ * type than try and adjust all of our expectations for i_size.
+ *
+ * Note that this differs from the implementation in ext4 and f2fs, where
+ * this data is stored as if it were in the file, but past EOF. However, btrfs
+ * does not have a widespread mechanism for caching opaque metadata pages, so we
+ * do pretend that the Merkle tree pages themselves are past EOF for the
+ * purposes of caching them (as opposed to creating a virtual inode).
+ *
+ * fs verity items are stored under two different key types on disk.
+ * The descriptor items:
+ * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]
+ *
+ * At offset 0, we store a btrfs_verity_descriptor_item which tracks the
+ * size of the descriptor item and some extra data for encryption.
+ * Starting at offset 1, these hold the generic fs verity descriptor.
+ * The latter are opaque to btrfs, we just read and write them as a blob for
+ * the higher level verity code.  The most common descriptor size is 256 bytes.
+ *
+ * The merkle tree items:
+ * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
+ *
+ * These also start at offset 0, and correspond to the merkle tree bytes.
+ * So when fsverity asks for page 0 of the merkle tree, we pull up one page
+ * starting at offset 0 for this key type.  These are also opaque to btrfs,
+ * we're blindly storing whatever fsverity sends down.
+ */
+
+
+#define MERKLE_START_ALIGN 65536
+/*
+ * Compute the logical file offset where we cache the Merkle tree.
+ *
+ * @inode: the inode of the verity file
+ *
+ * For the purposes of caching the Merkle tree pages, as required by
+ * fs-verity, it is convenient to do size computations in terms of a file
+ * offset, rather than in terms of page indices.
+ *
+ * Use 64K to be sure it's past the last page in the file, even with 64k pages.
+ * That rounding operation itself can overflow loff_t, so we do it in u64 and
+ * check.
+ *
+ * Returns the file offset on success, negative error code on failure.
+ */
+static loff_t merkle_file_pos(const struct inode *inode)
+{
+	loff_t ret;
+	u64 sz = inode->i_size;
+	u64 rounded = round_up(sz, MERKLE_START_ALIGN);
+
+	if (rounded > inode->i_sb->s_maxbytes)
+		return -EFBIG;
+	ret = rounded;
+	return ret;
+}
+
+/*
+ * Drop all the items for this inode with this key_type.
+ *
+ * @inode: The inode to drop items for
+ * @key_type: The type of items to drop (VERITY_DESC_ITEM or
+ *            VERITY_MERKLE_ITEM)
+ *
+ * Before doing a verity enable we cleanup any existing verity items.
+ * This is also used to clean up if a verity enable failed half way
+ * through.
+ *
+ * Returns number of dropped items on success, negative error code on failure.
+ */
+static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int count = 0;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (1) {
+		/*
+		 * 1 for the item being dropped
+		 */
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			goto out;
+		}
+
+		/*
+		 * Walk backwards through all the items until we find one
+		 * that isn't from our key type or objectid
+		 */
+		key.objectid = btrfs_ino(inode);
+		key.type = key_type;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+		if (ret > 0) {
+			ret = 0;
+			/* No more keys of this type, we're done */
+			if (path->slots[0] == 0)
+				break;
+			path->slots[0]--;
+		} else if (ret < 0) {
+			btrfs_end_transaction(trans);
+			goto out;
+		}
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		/* No more keys of this type, we're done */
+		if (key.objectid != btrfs_ino(inode) || key.type != key_type)
+			break;
+
+		/*
+		 * This shouldn't be a performance sensitive function because
+		 * it's not used as part of truncate.  If it ever becomes
+		 * perf sensitive, change this to walk forward and bulk delete
+		 * items
+		 */
+		ret = btrfs_del_items(trans, root, path, path->slots[0], 1);
+		if (ret) {
+			btrfs_end_transaction(trans);
+			goto out;
+		}
+		count++;
+		btrfs_release_path(path);
+		btrfs_end_transaction(trans);
+	}
+	ret = count;
+	btrfs_end_transaction(trans);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * Drop all verity items
+ *
+ * @inode: the inode to drop verity items for.
+ *
+ * In most contexts where we are dropping verity items, we want to do it for all
+ * the types of verity items, not a particular one.
+ *
+ * Returns: 0 on success, negative error code on failure.
+ */
+int btrfs_drop_verity_items(struct btrfs_inode *inode)
+{
+	int ret;
+
+	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
+	if (ret < 0)
+		goto out;
+	ret = drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+out:
+	return ret;
+}
+
+
+/*
+ * Insert and write inode items with a given key type and offset.
+ *
+ * @inode: The inode to insert for.
+ * @key_type: The key type to insert.
+ * @offset: The item offset to insert at.
+ * @src: Source data to write.
+ * @len: Length of source data to write.
+ *
+ * Write len bytes from src into items of up to 1k length.
+ * The inserted items will have key <ino, key_type, offset + off> where
+ * off is consecutively increasing from 0 up to the last item ending at
+ * offset + len.
+ *
+ * Returns 0 on success and a negative error code on failure.
+ */
+static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
+			   const char *src, u64 len)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_path *path;
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	unsigned long copy_bytes;
+	unsigned long src_offset = 0;
+	void *data;
+	int ret = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (len > 0) {
+		/*
+		 * 1 for the new item being inserted
+		 */
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			break;
+		}
+
+		key.objectid = btrfs_ino(inode);
+		key.type = key_type;
+		key.offset = offset;
+
+		/*
+		 * Insert 2K at a time mostly to be friendly for smaller
+		 * leaf size filesystems
+		 */
+		copy_bytes = min_t(u64, len, 2048);
+
+		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
+		if (ret) {
+			btrfs_end_transaction(trans);
+			break;
+		}
+
+		leaf = path->nodes[0];
+
+		data = btrfs_item_ptr(leaf, path->slots[0], void);
+		write_extent_buffer(leaf, src + src_offset,
+				    (unsigned long)data, copy_bytes);
+		offset += copy_bytes;
+		src_offset += copy_bytes;
+		len -= copy_bytes;
+
+		btrfs_release_path(path);
+		btrfs_end_transaction(trans);
+	}
+
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * Read inode items of the given key type and offset from the btree.
+ *
+ * @inode: The inode to read items of.
+ * @key_type: The key type to read.
+ * @offset: The item offset to read from.
+ * @dest: The buffer to read into. This parameter has slightly tricky
+ *        semantics.  If it is NULL, the function will not do any copying
+ *        and will just return the size of all the items up to len bytes.
+ *        If dest_page is passed, then the function will kmap_local the
+ *        page and ignore dest, but it must still be non-NULL to avoid the
+ *        counting-only behavior.
+ * @len: Length in bytes to read.
+ * @dest_page: Copy into this page instead of the dest buffer.
+ *
+ * Helper function to read items from the btree.  This returns the number
+ * of bytes read or < 0 for errors.  We can return short reads if the
+ * items don't exist on disk or aren't big enough to fill the desired length.
+ * Supports reading into a provided buffer (dest) or into the page cache
+ *
+ * Returns number of bytes read or a negative error code on failure.
+ */
+static int read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
+			  char *dest, u64 len, struct page *dest_page)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 item_end;
+	u64 copy_end;
+	int copied = 0;
+	u32 copy_offset;
+	unsigned long copy_bytes;
+	unsigned long dest_offset = 0;
+	void *data;
+	char *kaddr = dest;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	if (dest_page)
+		path->reada = READA_FORWARD;
+
+	key.objectid = btrfs_ino(inode);
+	key.type = key_type;
+	key.offset = offset;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0) {
+		goto out;
+	} else if (ret > 0) {
+		ret = 0;
+		if (path->slots[0] == 0)
+			goto out;
+		path->slots[0]--;
+	}
+
+	while (len > 0) {
+		leaf = path->nodes[0];
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		if (key.objectid != btrfs_ino(inode) ||
+		    key.type != key_type)
+			break;
+
+		item_end = btrfs_item_size_nr(leaf, path->slots[0]) + key.offset;
+
+		if (copied > 0) {
+			/*
+			 * Once we've copied something, we want all of the items
+			 * to be sequential
+			 */
+			if (key.offset != offset)
+				break;
+		} else {
+			/*
+			 * Our initial offset might be in the middle of an
+			 * item.  Make sure it all makes sense
+			 */
+			if (key.offset > offset)
+				break;
+			if (item_end <= offset)
+				break;
+		}
+
+		/* desc = NULL to just sum all the item lengths */
+		if (!dest)
+			copy_end = item_end;
+		else
+			copy_end = min(offset + len, item_end);
+
+		/* Number of bytes in this item we want to copy */
+		copy_bytes = copy_end - offset;
+
+		/* Offset from the start of item for copying */
+		copy_offset = offset - key.offset;
+
+		if (dest) {
+			if (dest_page)
+				kaddr = kmap_local_page(dest_page);
+
+			data = btrfs_item_ptr(leaf, path->slots[0], void);
+			read_extent_buffer(leaf, kaddr + dest_offset,
+					   (unsigned long)data + copy_offset,
+					   copy_bytes);
+
+			if (dest_page)
+				kunmap_local(kaddr);
+		}
+
+		offset += copy_bytes;
+		dest_offset += copy_bytes;
+		len -= copy_bytes;
+		copied += copy_bytes;
+
+		path->slots[0]++;
+		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+			/*
+			 * We've reached the last slot in this leaf and we need
+			 * to go to the next leaf.
+			 */
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0) {
+				break;
+			} else if (ret > 0) {
+				ret = 0;
+				break;
+			}
+		}
+	}
+out:
+	btrfs_free_path(path);
+	if (!ret)
+		ret = copied;
+	return ret;
+}
+
+/*
+ * Rollback in-progress verity if we encounter an error.
+ *
+ * @inode: the inode verity had an error for
+ *
+ * We try to handle recoverable errors while enabling verity by rolling it
+ * back and just failing the operation, rather than having an fs level error no
+ * matter what. However, any error in rollback is unrecoverable.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+static int rollback_verity(struct btrfs_inode *inode)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	int ret;
+
+	ASSERT(inode_is_locked(&inode->vfs_inode));
+	truncate_inode_pages(inode->vfs_inode.i_mapping,
+			     inode->vfs_inode.i_size);
+	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
+	ret = btrfs_drop_verity_items(inode);
+	if (ret) {
+		btrfs_handle_fs_error(root->fs_info, ret,
+				      "failed to drop verity items in rollback %lu\n",
+				      inode->vfs_inode.i_ino);
+		goto out;
+	}
+	/*
+	 * 1 for updating the inode flag
+	 */
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		btrfs_handle_fs_error(root->fs_info, ret,
+				      "failed to start transaction in verity rollback %lu\n",
+				      inode->vfs_inode.i_ino);
+		goto out;
+	}
+	inode->ro_flags &= ~BTRFS_INODE_RO_VERITY;
+	btrfs_sync_inode_flags_to_i_flags(&inode->vfs_inode);
+	ret = btrfs_update_inode(trans, root, inode);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
+	btrfs_end_transaction(trans);
+out:
+	return ret;
+}
+
+/*
+ * Finalize making the file a valid verity file
+ *
+ * @inode: the inode to be marked as verity
+ * @desc: the contents of the verity descriptor to write (not NULL)
+ * @desc_size: the size of the verity descriptor
+ *
+ * Do the actual work of finalizing verity after successfully writing the Merkle
+ * tree:
+ * - write out the descriptor items
+ * - mark the inode with the verity flag
+ * - mark the ro compat bit
+ * - clear the in progress bit
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+static int finish_verity(struct btrfs_inode *inode,
+			 const void *desc, size_t desc_size)
+{
+	struct btrfs_trans_handle *trans = NULL;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_verity_descriptor_item item;
+	int ret;
+
+	/* Write out the descriptor item */
+	memset(&item, 0, sizeof(item));
+	btrfs_set_stack_verity_descriptor_size(&item, desc_size);
+	ret = write_key_bytes(inode,
+			      BTRFS_VERITY_DESC_ITEM_KEY, 0,
+			      (const char *)&item, sizeof(item));
+	if (ret)
+		goto out;
+	/* Write out the descriptor itself */
+	ret = write_key_bytes(inode,
+			      BTRFS_VERITY_DESC_ITEM_KEY, 1,
+			      desc, desc_size);
+	if (ret)
+		goto out;
+
+	/*
+	 * 1 for updating the inode flag
+	 */
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+	inode->ro_flags |= BTRFS_INODE_RO_VERITY;
+	btrfs_sync_inode_flags_to_i_flags(&inode->vfs_inode);
+	ret = btrfs_update_inode(trans, root, inode);
+	if (ret)
+		goto end_trans;
+	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
+	btrfs_set_fs_compat_ro(root->fs_info, VERITY);
+end_trans:
+	btrfs_end_transaction(trans);
+out:
+	return ret;
+
+}
+
+/*
+ * fsverity op that begins enabling verity.
+ *
+ * @filp: the file to enable verity on
+ *
+ * Begin enabling fsverity for the file. We drop any existing verity items
+ * and set the in progress bit.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+static int btrfs_begin_enable_verity(struct file *filp)
+{
+	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
+	int ret;
+
+	ASSERT(inode_is_locked(file_inode(filp)));
+
+	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	ret = btrfs_drop_verity_items(inode);
+	if (ret)
+		goto out;
+
+	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
+out:
+	return ret;
+}
+
+/*
+ * fsverity op that ends enabling verity.
+ *
+ * @filp: the file we are finishing enabling verity on
+ * @desc: the verity descriptor to write out (NULL in error conditions)
+ * @desc_size: the size of the verity descriptor (variable with signatures)
+ * @merkle_tree_size: the size of the merkle tree in bytes
+ *
+ * If desc is null, then VFS is signaling an error occurred during verity
+ * enable, and we should try to rollback. Otherwise, attempt to finish verity.
+ *
+ * Returns 0 on success, negative error code on error.
+ */
+static int btrfs_end_enable_verity(struct file *filp, const void *desc,
+				   size_t desc_size, u64 merkle_tree_size)
+{
+	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
+	int ret = 0;
+	int rollback_ret;
+
+	ASSERT(inode_is_locked(file_inode(filp)));
+
+	if (desc == NULL)
+		goto rollback;
+
+	ret = finish_verity(inode, desc, desc_size);
+	if (ret)
+		goto rollback;
+	return ret;
+
+
+rollback:
+	rollback_ret = rollback_verity(inode);
+	if (rollback_ret)
+		btrfs_err(inode->root->fs_info,
+			  "failed to rollback verity items: %d", rollback_ret);
+	return ret;
+}
+
+/*
+ * fsverity op that gets the struct fsverity_descriptor.
+ *
+ * @inode: the inode to get the descriptor of
+ * @buf: output buffer for the descriptor contents
+ * @buf_size: size of the output buffer. 0 to query the size.
+ *
+ * fsverity does a two pass setup for reading the descriptor, in the first pass
+ * it calls with buf_size = 0 to query the size of the descriptor,
+ * and then in the second pass it actually reads the descriptor off
+ * disk.
+ *
+ * Returns the size on success or a negative error code on failure.
+ */
+static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
+				       size_t buf_size)
+{
+	u64 true_size;
+	int ret = 0;
+	struct btrfs_verity_descriptor_item item;
+
+	memset(&item, 0, sizeof(item));
+	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
+			     0, (char *)&item, sizeof(item), NULL);
+	if (ret < 0)
+		return ret;
+
+	if (item.reserved[0] != 0 || item.reserved[1] != 0)
+		return -EUCLEAN;
+
+	true_size = btrfs_stack_verity_descriptor_size(&item);
+	if (true_size > INT_MAX)
+		return -EUCLEAN;
+
+	if (!buf_size)
+		return true_size;
+	if (buf_size < true_size)
+		return -ERANGE;
+
+	ret = read_key_bytes(BTRFS_I(inode),
+			     BTRFS_VERITY_DESC_ITEM_KEY, 1,
+			     buf, buf_size, NULL);
+	if (ret < 0)
+		return ret;
+	if (ret != true_size)
+		return -EIO;
+
+	return true_size;
+}
+
+/*
+ * fsverity op that reads and caches a merkle tree page.
+ *
+ * @inode: the inode to read a merkle tree page for
+ * @index: the page index relative to the start of the merkle tree
+ * @num_ra_pages: number of pages to readahead. Optional, we ignore it.
+ *
+ * The Merkle tree is stored in the filesystem btree, but its pages are cached
+ * with a logical position past EOF in the inode's mapping.
+ *
+ * Returns the page we read, or an ERR_PTR on error.
+ */
+static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
+					       pgoff_t index,
+					       unsigned long num_ra_pages)
+{
+	struct page *page;
+	u64 off = (u64)index << PAGE_SHIFT;
+	loff_t merkle_pos = merkle_file_pos(inode);
+	int ret;
+
+	if (merkle_pos < 0)
+		return ERR_PTR(merkle_pos);
+	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
+		return ERR_PTR(-EFBIG);
+	index += merkle_pos >> PAGE_SHIFT;
+again:
+	page = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED);
+	if (page) {
+		if (PageUptodate(page))
+			return page;
+
+		lock_page(page);
+		/*
+		 * We only insert uptodate pages, so !Uptodate has to be
+		 * an error
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+			put_page(page);
+			return ERR_PTR(-EIO);
+		}
+		unlock_page(page);
+		return page;
+	}
+
+	page = __page_cache_alloc(mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS));
+	if (!page)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Merkle item keys are indexed from byte 0 in the merkle tree.
+	 * they have the form:
+	 *
+	 * [ inode objectid, BTRFS_MERKLE_ITEM_KEY, offset in bytes ]
+	 */
+	ret = read_key_bytes(BTRFS_I(inode),
+			     BTRFS_VERITY_MERKLE_ITEM_KEY, off,
+			     page_address(page), PAGE_SIZE, page);
+	if (ret < 0) {
+		put_page(page);
+		return ERR_PTR(ret);
+	}
+	if (ret < PAGE_SIZE)
+		memzero_page(page, ret, PAGE_SIZE - ret);
+
+	SetPageUptodate(page);
+	ret = add_to_page_cache_lru(page, inode->i_mapping, index, GFP_NOFS);
+
+	if (!ret) {
+		/* Inserted and ready for fsverity */
+		unlock_page(page);
+	} else {
+		put_page(page);
+		/* Did someone race us into inserting this page? */
+		if (ret == -EEXIST)
+			goto again;
+		page = ERR_PTR(ret);
+	}
+	return page;
+}
+
+/*
+ * fsverity op that writes a Merkle tree block into the btree.
+ *
+ * @inode: inode to write a Merkle tree block for
+ * @buf: Merkle tree data block to write
+ * @index: the index of the block in the Merkle tree
+ * @log_blocksize: log base 2 of the Merkle tree block size
+ *
+ * Note that the block size could be different from the page size, so it is not
+ * safe to assume that index is a page index.
+ *
+ * Returns 0 on success or negative error code on failure
+ */
+static int btrfs_write_merkle_tree_block(struct inode *inode, const void *buf,
+					u64 index, int log_blocksize)
+{
+	u64 off = index << log_blocksize;
+	u64 len = 1ULL << log_blocksize;
+	loff_t merkle_pos = merkle_file_pos(inode);
+
+	if (merkle_pos < 0)
+		return merkle_pos;
+	if (merkle_pos > inode->i_sb->s_maxbytes - off - len)
+		return -EFBIG;
+
+	return write_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY,
+			       off, buf, len);
+}
+
+const struct fsverity_operations btrfs_verityops = {
+	.begin_enable_verity	= btrfs_begin_enable_verity,
+	.end_enable_verity	= btrfs_end_enable_verity,
+	.get_verity_descriptor	= btrfs_get_verity_descriptor,
+	.read_merkle_tree_page	= btrfs_read_merkle_tree_page,
+	.write_merkle_tree_block = btrfs_write_merkle_tree_block,
+};
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 22cd037123fa..d7d3cfead056 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
  * first mount when booting older kernel versions.
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
+#define BTRFS_FEATURE_COMPAT_RO_VERITY			(1ULL << 2)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ccdb40fe40dc..871d64fdc887 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -118,6 +118,29 @@
 #define BTRFS_INODE_REF_KEY		12
 #define BTRFS_INODE_EXTREF_KEY		13
 #define BTRFS_XATTR_ITEM_KEY		24
+
+/*
+ * fs verity items are stored under two different key types on disk.
+ * The descriptor items:
+ * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]
+ *
+ * At offset 0, we store a btrfs_verity_descriptor_item which tracks the
+ * size of the descriptor item and some extra data for encryption.
+ * Starting at offset 1, these hold the generic fs verity descriptor.
+ * The latter are opaque to btrfs, we just read and write them as a blob for the
+ * higher level verity code.  The most common descriptor size is 256 bytes.
+ *
+ * The merkle tree items:
+ * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
+ *
+ * These also start at offset 0, and correspond to the merkle tree bytes.
+ * When fsverity asks for page 0 of the merkle tree, we pull up one page
+ * starting at offset 0 for this key type.  These are also opaque to btrfs,
+ * we're blindly storing whatever fsverity sends down.
+ */
+#define BTRFS_VERITY_DESC_ITEM_KEY	36
+#define BTRFS_VERITY_MERKLE_ITEM_KEY	37
+
 #define BTRFS_ORPHAN_ITEM_KEY		48
 /* reserve 2-15 close to the inode for later flexibility */
 
@@ -991,4 +1014,16 @@ struct btrfs_qgroup_limit_item {
 	__le64 rsv_excl;
 } __attribute__ ((__packed__));
 
+struct btrfs_verity_descriptor_item {
+	/* size of the verity descriptor in bytes */
+	__le64 size;
+	/*
+	 * When we implement support for fscrypt, we will need to encrypt the
+	 * Merkle tree for encrypted verity files. These 128 bits are for the
+	 * eventual storage of an fscrypt initialization vector.
+	 */
+	__le64 reserved[2];
+	__u8 encryption;
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 3/3] btrfs: verity metadata orphan items
  2021-06-30 20:01 [PATCH v6 0/3] btrfs: support fsverity Boris Burkov
  2021-06-30 20:01 ` [PATCH v6 1/3] btrfs: add ro compat flags to inodes Boris Burkov
  2021-06-30 20:01 ` [PATCH v6 2/3] btrfs: initial fsverity support Boris Burkov
@ 2021-06-30 20:01 ` Boris Burkov
  2021-07-28 15:24 ` [PATCH v6 0/3] btrfs: support fsverity David Sterba
  3 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-06-30 20:01 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

Writing out the verity data is too large of an operation to do in a
single transaction. If we are interrupted before we finish creating
fsverity metadata for a file, or fail to clean up already created
metadata after a failure, we could leak the verity items that we already
committed.

To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.

There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.

A final thing to note is that this is a resurrection of using orphans to
signal an operation besides "delete this inode". The old case was to
signal the need to do a truncate. That case still technically applies
for mounting very old file systems, so we need to take some care to not
clobber it. To that end, we just have to be careful that verity orphan
cleanup is a no-op for non-verity files.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c  | 16 +++++++--
 fs/btrfs/verity.c | 83 ++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 92 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9f176a840446..29d36e361a50 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3432,7 +3432,14 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 
 		/*
 		 * If we have an inode with links, there are a couple of
-		 * possibilities. Old kernels (before v3.12) used to create an
+		 * possibilities:
+		 *
+		 * 1. We were halfway through creating fsverity metadata for the
+		 * file. In that case, the orphan item represents incomplete
+		 * fsverity metadata which must be cleaned up with
+		 * btrfs_drop_verity_items and deleting the orphan item.
+
+		 * 2. Old kernels (before v3.12) used to create an
 		 * orphan item for truncate indicating that there were possibly
 		 * extent items past i_size that needed to be deleted. In v3.12,
 		 * truncate was changed to update i_size in sync with the extent
@@ -3449,9 +3456,14 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * deleted but wasn't. The inode number may have been reused,
 		 * but either way, we can delete the orphan item.
 		 */
+
 		if (ret == -ENOENT || inode->i_nlink) {
-			if (!ret)
+			if (!ret) {
+				ret = btrfs_drop_verity_items(BTRFS_I(inode));
 				iput(inode);
+				if (ret)
+					goto out;
+			}
 			trans = btrfs_start_transaction(root, 1);
 			if (IS_ERR(trans)) {
 				ret = PTR_ERR(trans);
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index f24c1d88f66d..b1739186156b 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -49,6 +49,15 @@
  * So when fsverity asks for page 0 of the merkle tree, we pull up one page
  * starting at offset 0 for this key type.  These are also opaque to btrfs,
  * we're blindly storing whatever fsverity sends down.
+ *
+ * Another important consideration is the fact that the Merkle tree data scales
+ * linearly with the size of the file (with 4k pages/blocks and SHA-256, it's
+ * ~1/127th the size) so for large files, writing the tree can be a lengthy
+ * operation. For that reason, we guard the whole enable verity operation
+ * (between begin_enable_verity and end_enable_verity) with an orphan item.
+ * Again, because the data can be pretty large, it's quite possible that we
+ * could run out of space writing it, so we try our best to handle errors by
+ * stopping and rolling back rather than aborting the victim transaction.
  */
 
 
@@ -406,6 +415,40 @@ static int read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
 	return ret;
 }
 
+/*
+ * Delete an fsverity orphan
+ *
+ * @trans: transaction to do the delete in
+ * @inode: the inode to orphan
+ *
+ * This helper serves to capture verity orphan specific logic that is repeated
+ * in the couple places we delete verity orphans. Specifically, handling ENOENT
+ * and ignoring inodes with 0 links.
+ *
+ * Returns zero on success or a negative error code on failure.
+ */
+
+static int del_orphan(struct btrfs_trans_handle *trans,
+		      struct btrfs_inode *inode)
+{
+	struct btrfs_root *root = inode->root;
+	int ret;
+
+	/*
+	 * If the inode has no links, it is either already unlinked, or was
+	 * created with O_TMPFILE. In either case, it should have an orphan from
+	 * that other operation. Rather than reference count the orphans, we
+	 * simply ignore them here, because we only invoke the verity path in
+	 * the orphan logic when i_nlink is 1.
+	 */
+	if (!inode->vfs_inode.i_nlink)
+		return 0;
+
+	ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+	if (ret == -ENOENT)
+		ret = 0;
+	return ret;
+}
 /*
  * Rollback in-progress verity if we encounter an error.
  *
@@ -436,8 +479,9 @@ static int rollback_verity(struct btrfs_inode *inode)
 	}
 	/*
 	 * 1 for updating the inode flag
+	 * 1 for deleting the orphan
 	 */
-	trans = btrfs_start_transaction(root, 1);
+	trans = btrfs_start_transaction(root, 2);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
 		btrfs_handle_fs_error(root->fs_info, ret,
@@ -452,6 +496,11 @@ static int rollback_verity(struct btrfs_inode *inode)
 		btrfs_abort_transaction(trans, ret);
 		goto out;
 	}
+	ret = del_orphan(trans, inode);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
 	btrfs_end_transaction(trans);
 out:
 	return ret;
@@ -468,6 +517,7 @@ static int rollback_verity(struct btrfs_inode *inode)
  * tree:
  * - write out the descriptor items
  * - mark the inode with the verity flag
+ * - delete the orphan item
  * - mark the ro compat bit
  * - clear the in progress bit
  *
@@ -498,8 +548,9 @@ static int finish_verity(struct btrfs_inode *inode,
 
 	/*
 	 * 1 for updating the inode flag
+	 * 1 for deleting the orphan
 	 */
-	trans = btrfs_start_transaction(root, 1);
+	trans = btrfs_start_transaction(root, 2);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
 		goto out;
@@ -507,6 +558,9 @@ static int finish_verity(struct btrfs_inode *inode,
 	inode->ro_flags |= BTRFS_INODE_RO_VERITY;
 	btrfs_sync_inode_flags_to_i_flags(&inode->vfs_inode);
 	ret = btrfs_update_inode(trans, root, inode);
+	if (ret)
+		goto end_trans;
+	ret = del_orphan(trans, inode);
 	if (ret)
 		goto end_trans;
 	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
@@ -523,14 +577,16 @@ static int finish_verity(struct btrfs_inode *inode,
  *
  * @filp: the file to enable verity on
  *
- * Begin enabling fsverity for the file. We drop any existing verity items
- * and set the in progress bit.
+ * Begin enabling fsverity for the file. We drop any existing verity items, add
+ * an orphan and set the in progress bit.
  *
  * Returns 0 on success, negative error code on failure.
  */
 static int btrfs_begin_enable_verity(struct file *filp)
 {
 	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
+	struct btrfs_root *root = inode->root;
+	struct btrfs_trans_handle *trans;
 	int ret;
 
 	ASSERT(inode_is_locked(file_inode(filp)));
@@ -540,11 +596,28 @@ static int btrfs_begin_enable_verity(struct file *filp)
 		goto out;
 	}
 
+	/*
+	 * This should almost never do anything, but theoretically, it's
+	 * possible that we failed to enable verity on a file, then were
+	 * interrupted or failed while rolling back, failed to cleanup the
+	 * orphan, and finally attempt to enable verity again.
+	 */
 	ret = btrfs_drop_verity_items(inode);
 	if (ret)
 		goto out;
 
-	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
+	/*
+	 * 1 for the orphan item
+	 */
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+	ret = btrfs_orphan_add(trans, inode);
+	if (!ret)
+		set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
+	btrfs_end_transaction(trans);
 out:
 	return ret;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-06-30 20:01 ` [PATCH v6 2/3] btrfs: initial fsverity support Boris Burkov
@ 2021-07-11 14:52   ` Eric Biggers
  2021-07-28 14:29     ` David Sterba
  2021-09-14 18:25     ` Boris Burkov
  2021-07-28 15:05   ` David Sterba
  2021-09-14 17:32   ` Eric Biggers
  2 siblings, 2 replies; 18+ messages in thread
From: Eric Biggers @ 2021-07-11 14:52 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> Add support for fsverity in btrfs. To support the generic interface in
> fs/verity, we add two new item types in the fs tree for inodes with
> verity enabled. One stores the per-file verity descriptor and btrfs
> verity item and the other stores the Merkle tree data itself.
> 
> Verity checking is done in end_page_read just before a page is marked
> uptodate. This naturally handles a variety of edge cases like holes,
> preallocated extents, and inline extents. Some care needs to be taken to
> not try to verity pages past the end of the file, which are accessed by
> the generic buffered file reading code under some circumstances like
> reading to the end of the last page and trying to read again. Direct IO
> on a verity file falls back to buffered reads.
> 
> Verity relies on PageChecked for the Merkle tree data itself to avoid
> re-walking up shared paths in the tree. For this reason, we need to
> cache the Merkle tree data. Since the file is immutable after verity is
> turned on, we can cache it at an index past EOF.
> 
> Use the new inode ro_flags to store verity on the inode item, so that we
> can enable verity on a file, then rollback to an older kernel and still
> mount the file system and read the file. Since we can't safely write the
> file anymore without ruining the invariants of the Merkle tree, we mark
> a ro_compat flag on the file system when a file has verity enabled.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Co-developed-by: Chris Mason <clm@fb.com>
> Signed-off-by: Chris Mason <clm@fb.com>
> Signed-off-by: Boris Burkov <boris@bur.io>

Generally looks good, feel free to add:

Acked-by: Eric Biggers <ebiggers@google.com>

A few minor comments below:

> @@ -2688,7 +2677,14 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>  	       start + len <= page_offset(page) + PAGE_SIZE);
>  
>  	if (uptodate) {
> -		btrfs_page_set_uptodate(fs_info, page, start, len);
> +		if (!PageError(page) && !PageUptodate(page) &&
> +		    start < i_size_read(page->mapping->host) &&
> +		    fsverity_active(page->mapping->host) &&
> +		    !fsverity_verify_page(page)) {
> +			btrfs_page_set_error(fs_info, page, start, len);
> +		} else {
> +			btrfs_page_set_uptodate(fs_info, page, start, len);
> +		}

When is it ever the case that PageError(page) or PageUptodate(page) here?

Also: in general, fsverity_active() should be checked first, in order to avoid
any overhead when !CONFIG_FS_VERITY.

> @@ -5014,6 +5020,10 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_get_subvol_rootref(file, argp);
>  	case BTRFS_IOC_INO_LOOKUP_USER:
>  		return btrfs_ioctl_ino_lookup_user(file, argp);
> +	case FS_IOC_ENABLE_VERITY:
> +		return fsverity_ioctl_enable(file, (const void __user *)argp);
> +	case FS_IOC_MEASURE_VERITY:
> +		return fsverity_ioctl_measure(file, argp);

You could wire up FS_IOC_READ_VERITY_METADATA as well.  It should just work
without having to do anything else.

> + * The merkle tree items:
> + * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
> + *
> + * These also start at offset 0, and correspond to the merkle tree bytes.
> + * So when fsverity asks for page 0 of the merkle tree, we pull up one page
> + * starting at offset 0 for this key type.  These are also opaque to btrfs,
> + * we're blindly storing whatever fsverity sends down.
> + */

Is it defined which offsets, specifically, the Merkle tree items start at?  Or
is any arrangement valid -- say, one filesystem might use one item per Merkle
tree block, while another might have multiple blocks per item, while another
might have multiple items per block?  What about the degenerate case where there
is a separate btrfs item for each individual Merkle tree byte, and maybe even
some empty items -- is that being considered a valid/supported on-disk format,
or is there a limit?

> +static loff_t merkle_file_pos(const struct inode *inode)
> +{
> +	loff_t ret;
> +	u64 sz = inode->i_size;
> +	u64 rounded = round_up(sz, MERKLE_START_ALIGN);
> +
> +	if (rounded > inode->i_sb->s_maxbytes)
> +		return -EFBIG;
> +	ret = rounded;
> +	return ret;
> +}

The 'ret' variable is unnecessary; this can just 'return rounded'.

> +/*
> + * Drop all the items for this inode with this key_type.
> + *
> + * @inode: The inode to drop items for
> + * @key_type: The type of items to drop (VERITY_DESC_ITEM or
> + *            VERITY_MERKLE_ITEM)

BTRFS_VERITY_DESC_ITEM_KEY or BTRFS_VERITY_MERKLE_ITEM_KEY

> + *
> + * Before doing a verity enable we cleanup any existing verity items.
> + * This is also used to clean up if a verity enable failed half way
> + * through.
> + *
> + * Returns number of dropped items on success, negative error code on failure.
> + */
> +static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)

The caller doesn't actually care about the number of dropped items, so this
could just return 0 on success or a negative error code on failure.

> +	while (1) {
> +		/*
> +		 * 1 for the item being dropped
> +		 */
> +		trans = btrfs_start_transaction(root, 1);
> +		if (IS_ERR(trans)) {
> +			ret = PTR_ERR(trans);
> +			goto out;
> +		}
> +
> +		/*
> +		 * Walk backwards through all the items until we find one
> +		 * that isn't from our key type or objectid
> +		 */
> +		key.objectid = btrfs_ino(inode);
> +		key.type = key_type;
> +		key.offset = (u64)-1;
> +
> +		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
> +		if (ret > 0) {
> +			ret = 0;
> +			/* No more keys of this type, we're done */
> +			if (path->slots[0] == 0)
> +				break;
> +			path->slots[0]--;
> +		} else if (ret < 0) {
> +			btrfs_end_transaction(trans);
> +			goto out;
> +		}

Pardon my unfamiliarity with btrfs, but it looks like if the key isn't present,
then btrfs_search_slot() returns the position where the key would be inserted.
What if the previous leaf is completely full -- does btrfs_search_slot() return
a new leaf, or does it return a pointer past the end of the previous one?  (It
looks like the latter is assumed here.)  The comment for btrfs_search_slot()
doesn't make this clear.

> +int btrfs_drop_verity_items(struct btrfs_inode *inode)
> +{
> +	int ret;
> +
> +	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
> +	if (ret < 0)
> +		goto out;
> +	ret = drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
> +	if (ret < 0)
> +		goto out;
> +	ret = 0;
> +out:
> +	return ret;
> +}

This could be simplified a bit if drop_verity_items() returned 0 on success.

> +/*
> + * Insert and write inode items with a given key type and offset.
> + *
> + * @inode: The inode to insert for.
> + * @key_type: The key type to insert.
> + * @offset: The item offset to insert at.
> + * @src: Source data to write.
> + * @len: Length of source data to write.
> + *
> + * Write len bytes from src into items of up to 1k length.
> + * The inserted items will have key <ino, key_type, offset + off> where
> + * off is consecutively increasing from 0 up to the last item ending at
> + * offset + len.
> + *
> + * Returns 0 on success and a negative error code on failure.
> + */
> +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> +			   const char *src, u64 len)

The comment says items of up to 1k length, but the code uses 2K.

> +/*
> + * Read inode items of the given key type and offset from the btree.
> + *
> + * @inode: The inode to read items of.
> + * @key_type: The key type to read.
> + * @offset: The item offset to read from.
> + * @dest: The buffer to read into. This parameter has slightly tricky
> + *        semantics.  If it is NULL, the function will not do any copying
> + *        and will just return the size of all the items up to len bytes.
> + *        If dest_page is passed, then the function will kmap_local the
> + *        page and ignore dest, but it must still be non-NULL to avoid the
> + *        counting-only behavior.
> + * @len: Length in bytes to read.
> + * @dest_page: Copy into this page instead of the dest buffer.
> + *
> + * Helper function to read items from the btree.  This returns the number
> + * of bytes read or < 0 for errors.  We can return short reads if the
> + * items don't exist on disk or aren't big enough to fill the desired length.
> + * Supports reading into a provided buffer (dest) or into the page cache
> + *
> + * Returns number of bytes read or a negative error code on failure.
> + */
> +static int read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> +			  char *dest, u64 len, struct page *dest_page)
> +{
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	u64 item_end;
> +	u64 copy_end;
> +	int copied = 0;
> +	u32 copy_offset;
> +	unsigned long copy_bytes;
> +	unsigned long dest_offset = 0;
> +	void *data;
> +	char *kaddr = dest;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	if (dest_page)
> +		path->reada = READA_FORWARD;
> +
> +	key.objectid = btrfs_ino(inode);
> +	key.type = key_type;
> +	key.offset = offset;
> +
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0) {
> +		goto out;
> +	} else if (ret > 0) {
> +		ret = 0;
> +		if (path->slots[0] == 0)
> +			goto out;
> +		path->slots[0]--;
> +	}

Same question about btrfs_search_slot() here.  If the key isn't found and the
previous leaf is completely full, will it return a pointer past the end of it?

> +/*
> + * fsverity op that begins enabling verity.
> + *
> + * @filp: the file to enable verity on
> + *
> + * Begin enabling fsverity for the file. We drop any existing verity items
> + * and set the in progress bit.
> + *
> + * Returns 0 on success, negative error code on failure.
> + */
> +static int btrfs_begin_enable_verity(struct file *filp)
> +{
> +	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
> +	int ret;
> +
> +	ASSERT(inode_is_locked(file_inode(filp)));
> +
> +	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	ret = btrfs_drop_verity_items(inode);
> +	if (ret)
> +		goto out;
> +
> +	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
> +out:
> +	return ret;
> +}

There's no need for 'goto out' if no cleanup is being done.  Just return
directly instead.

> +static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
> +					       pgoff_t index,
> +					       unsigned long num_ra_pages)
> +{
> +	struct page *page;
> +	u64 off = (u64)index << PAGE_SHIFT;
> +	loff_t merkle_pos = merkle_file_pos(inode);
> +	int ret;
> +
> +	if (merkle_pos < 0)
> +		return ERR_PTR(merkle_pos);
> +	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
> +		return ERR_PTR(-EFBIG);
> +	index += merkle_pos >> PAGE_SHIFT;
> +again:
> +	page = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED);
> +	if (page) {
> +		if (PageUptodate(page))
> +			return page;
> +
> +		lock_page(page);
> +		/*
> +		 * We only insert uptodate pages, so !Uptodate has to be
> +		 * an error
> +		 */
> +		if (!PageUptodate(page)) {
> +			unlock_page(page);
> +			put_page(page);
> +			return ERR_PTR(-EIO);
> +		}
> +		unlock_page(page);
> +		return page;

As per the comment above, aren't the Merkle tree pages marked Uptodate before
being inserted into the page cache?  If so, isn't it unnecessary to re-check
Uptodate under the page lock?

> +struct btrfs_verity_descriptor_item {
> +	/* size of the verity descriptor in bytes */
> +	__le64 size;
> +	/*
> +	 * When we implement support for fscrypt, we will need to encrypt the
> +	 * Merkle tree for encrypted verity files. These 128 bits are for the
> +	 * eventual storage of an fscrypt initialization vector.
> +	 */
> +	__le64 reserved[2];
> +	__u8 encryption;
> +} __attribute__ ((__packed__));

Do you have something in mind for how an initialization vector stored here would
be used?  I'd have thought that if/when fscrypt support is added, you'd either
derive a new per-file key for encrypting the verity metadata specifically, or
you'd encrypt the verity metadata with the regular per-file key using IVs that
are chosen as if the verity metadata were appended to the file contents.
Neither case would require that any additional information be stored here.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-07-11 14:52   ` Eric Biggers
@ 2021-07-28 14:29     ` David Sterba
  2021-09-14 18:25     ` Boris Burkov
  1 sibling, 0 replies; 18+ messages in thread
From: David Sterba @ 2021-07-28 14:29 UTC (permalink / raw)
  To: Eric Biggers; +Cc: Boris Burkov, linux-btrfs, linux-fscrypt, kernel-team

On Sun, Jul 11, 2021 at 09:52:56AM -0500, Eric Biggers wrote:
> On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> > Add support for fsverity in btrfs. To support the generic interface in
> > fs/verity, we add two new item types in the fs tree for inodes with
> > verity enabled. One stores the per-file verity descriptor and btrfs
> > verity item and the other stores the Merkle tree data itself.
> > 
> > Verity checking is done in end_page_read just before a page is marked
> > uptodate. This naturally handles a variety of edge cases like holes,
> > preallocated extents, and inline extents. Some care needs to be taken to
> > not try to verity pages past the end of the file, which are accessed by
> > the generic buffered file reading code under some circumstances like
> > reading to the end of the last page and trying to read again. Direct IO
> > on a verity file falls back to buffered reads.
> > 
> > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > re-walking up shared paths in the tree. For this reason, we need to
> > cache the Merkle tree data. Since the file is immutable after verity is
> > turned on, we can cache it at an index past EOF.
> > 
> > Use the new inode ro_flags to store verity on the inode item, so that we
> > can enable verity on a file, then rollback to an older kernel and still
> > mount the file system and read the file. Since we can't safely write the
> > file anymore without ruining the invariants of the Merkle tree, we mark
> > a ro_compat flag on the file system when a file has verity enabled.
> > 
> > Reported-by: kernel test robot <lkp@intel.com>
> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > Co-developed-by: Chris Mason <clm@fb.com>
> > Signed-off-by: Chris Mason <clm@fb.com>
> > Signed-off-by: Boris Burkov <boris@bur.io>
> 
> Generally looks good, feel free to add:
> 
> Acked-by: Eric Biggers <ebiggers@google.com>
> 
> A few minor comments below:

Thanks for the comments. Lots of them are minor fixups, I can do that
when applying the patch. There are some questions that I'll leave to
Boris to answer, I don't think they'd prevent merging the patches now
and fixing up later.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-06-30 20:01 ` [PATCH v6 2/3] btrfs: initial fsverity support Boris Burkov
  2021-07-11 14:52   ` Eric Biggers
@ 2021-07-28 15:05   ` David Sterba
  2021-09-14 17:32   ` Eric Biggers
  2 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2021-07-28 15:05 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> +struct btrfs_verity_descriptor_item {
> +	/* size of the verity descriptor in bytes */
> +	__le64 size;
> +	/*
> +	 * When we implement support for fscrypt, we will need to encrypt the
> +	 * Merkle tree for encrypted verity files. These 128 bits are for the
> +	 * eventual storage of an fscrypt initialization vector.
> +	 */
> +	__le64 reserved[2];

This does 2 for known extensions, do you think more would be desirable?
Eg. reserving 256 bits. We can detect that also at runtime by the item
size so it's extensible but just in case this could be done from the
beginning.

> +	__u8 encryption;
> +} __attribute__ ((__packed__));
> +
>  #endif /* _BTRFS_CTREE_H_ */
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/3] btrfs: support fsverity
  2021-06-30 20:01 [PATCH v6 0/3] btrfs: support fsverity Boris Burkov
                   ` (2 preceding siblings ...)
  2021-06-30 20:01 ` [PATCH v6 3/3] btrfs: verity metadata orphan items Boris Burkov
@ 2021-07-28 15:24 ` David Sterba
  3 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2021-07-28 15:24 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, Jun 30, 2021 at 01:01:47PM -0700, Boris Burkov wrote:
> This patchset provides support for fsverity in btrfs.

I did one more pass and fixed issues pointed out by Eric and some other
minor style issues. Patchset has been moved from topic branch to
misc-next, please send separate patches or let me know if there are
trivial fixups needed. Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-06-30 20:01 ` [PATCH v6 2/3] btrfs: initial fsverity support Boris Burkov
  2021-07-11 14:52   ` Eric Biggers
  2021-07-28 15:05   ` David Sterba
@ 2021-09-14 17:32   ` Eric Biggers
  2021-09-14 17:49     ` Boris Burkov
  2 siblings, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2021-09-14 17:32 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

Hi Boris,

On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> Add support for fsverity in btrfs. To support the generic interface in
> fs/verity, we add two new item types in the fs tree for inodes with
> verity enabled. One stores the per-file verity descriptor and btrfs
> verity item and the other stores the Merkle tree data itself.
> 
> Verity checking is done in end_page_read just before a page is marked
> uptodate. This naturally handles a variety of edge cases like holes,
> preallocated extents, and inline extents. Some care needs to be taken to
> not try to verity pages past the end of the file, which are accessed by
> the generic buffered file reading code under some circumstances like
> reading to the end of the last page and trying to read again. Direct IO
> on a verity file falls back to buffered reads.
> 
> Verity relies on PageChecked for the Merkle tree data itself to avoid
> re-walking up shared paths in the tree. For this reason, we need to
> cache the Merkle tree data. Since the file is immutable after verity is
> turned on, we can cache it at an index past EOF.
> 
> Use the new inode ro_flags to store verity on the inode item, so that we
> can enable verity on a file, then rollback to an older kernel and still
> mount the file system and read the file. Since we can't safely write the
> file anymore without ruining the invariants of the Merkle tree, we mark
> a ro_compat flag on the file system when a file has verity enabled.

I want to mention the btrfs verity support in
Documentation/filesystems/fsverity.rst, and I have a couple questions:

1. Is the ro_compat filesystem flag still a thing?  The commit message claims it
   is, and BTRFS_FEATURE_COMPAT_RO_VERITY is defined in the code, but it doesn't
   seem to actually be used.  It's not needed since you found a way to make the
   inode flags ro_compat instead, right?

2. Is there a minimum version of btrfs-progs that is required to use btrfs
   verity?  With ext4 and f2fs, the fsck tools had to be updated, so there were
   minimum versions of the userspace tools required.

Thanks,

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-14 17:32   ` Eric Biggers
@ 2021-09-14 17:49     ` Boris Burkov
  2021-09-14 17:56       ` Eric Biggers
  2021-09-14 18:03       ` David Sterba
  0 siblings, 2 replies; 18+ messages in thread
From: Boris Burkov @ 2021-09-14 17:49 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Tue, Sep 14, 2021 at 10:32:59AM -0700, Eric Biggers wrote:
> Hi Boris,
> 
> On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> > Add support for fsverity in btrfs. To support the generic interface in
> > fs/verity, we add two new item types in the fs tree for inodes with
> > verity enabled. One stores the per-file verity descriptor and btrfs
> > verity item and the other stores the Merkle tree data itself.
> > 
> > Verity checking is done in end_page_read just before a page is marked
> > uptodate. This naturally handles a variety of edge cases like holes,
> > preallocated extents, and inline extents. Some care needs to be taken to
> > not try to verity pages past the end of the file, which are accessed by
> > the generic buffered file reading code under some circumstances like
> > reading to the end of the last page and trying to read again. Direct IO
> > on a verity file falls back to buffered reads.
> > 
> > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > re-walking up shared paths in the tree. For this reason, we need to
> > cache the Merkle tree data. Since the file is immutable after verity is
> > turned on, we can cache it at an index past EOF.
> > 
> > Use the new inode ro_flags to store verity on the inode item, so that we
> > can enable verity on a file, then rollback to an older kernel and still
> > mount the file system and read the file. Since we can't safely write the
> > file anymore without ruining the invariants of the Merkle tree, we mark
> > a ro_compat flag on the file system when a file has verity enabled.
> 
> I want to mention the btrfs verity support in
> Documentation/filesystems/fsverity.rst, and I have a couple questions:
> 
> 1. Is the ro_compat filesystem flag still a thing?  The commit message claims it
>    is, and BTRFS_FEATURE_COMPAT_RO_VERITY is defined in the code, but it doesn't
>    seem to actually be used.  It's not needed since you found a way to make the
>    inode flags ro_compat instead, right?

I believe it is still being used, unless I messed up the patch I sent in
the end. Taking a quick look, I think it's set at fs/btrfs/verity.c:558.

btrfs_set_fs_compat_ro(root->fs_info, VERITY);

I believe I still needed it because the tree checker doesn't scan every
inode on the filesystem when you mount, so it would only freak out about
a ro-compat inode later on if the inode didn't happen to be in a leaf
that was being checked at mount time.

> 
> 2. Is there a minimum version of btrfs-progs that is required to use btrfs
>    verity?  With ext4 and f2fs, the fsck tools had to be updated, so there were
>    minimum versions of the userspace tools required.

Hmm. I didn't update fsck, but now that you mention it, I think I need to...
I'll test it right away and get back to you, but I suspect I need to
hurry up and implement it.

Boris
> 
> Thanks,
> 
> - Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-14 17:49     ` Boris Burkov
@ 2021-09-14 17:56       ` Eric Biggers
  2021-09-14 18:34         ` Boris Burkov
  2021-09-14 18:03       ` David Sterba
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2021-09-14 17:56 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Tue, Sep 14, 2021 at 10:49:33AM -0700, Boris Burkov wrote:
> On Tue, Sep 14, 2021 at 10:32:59AM -0700, Eric Biggers wrote:
> > Hi Boris,
> > 
> > On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> > > Add support for fsverity in btrfs. To support the generic interface in
> > > fs/verity, we add two new item types in the fs tree for inodes with
> > > verity enabled. One stores the per-file verity descriptor and btrfs
> > > verity item and the other stores the Merkle tree data itself.
> > > 
> > > Verity checking is done in end_page_read just before a page is marked
> > > uptodate. This naturally handles a variety of edge cases like holes,
> > > preallocated extents, and inline extents. Some care needs to be taken to
> > > not try to verity pages past the end of the file, which are accessed by
> > > the generic buffered file reading code under some circumstances like
> > > reading to the end of the last page and trying to read again. Direct IO
> > > on a verity file falls back to buffered reads.
> > > 
> > > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > > re-walking up shared paths in the tree. For this reason, we need to
> > > cache the Merkle tree data. Since the file is immutable after verity is
> > > turned on, we can cache it at an index past EOF.
> > > 
> > > Use the new inode ro_flags to store verity on the inode item, so that we
> > > can enable verity on a file, then rollback to an older kernel and still
> > > mount the file system and read the file. Since we can't safely write the
> > > file anymore without ruining the invariants of the Merkle tree, we mark
> > > a ro_compat flag on the file system when a file has verity enabled.
> > 
> > I want to mention the btrfs verity support in
> > Documentation/filesystems/fsverity.rst, and I have a couple questions:
> > 
> > 1. Is the ro_compat filesystem flag still a thing?  The commit message claims it
> >    is, and BTRFS_FEATURE_COMPAT_RO_VERITY is defined in the code, but it doesn't
> >    seem to actually be used.  It's not needed since you found a way to make the
> >    inode flags ro_compat instead, right?
> 
> I believe it is still being used, unless I messed up the patch I sent in
> the end. Taking a quick look, I think it's set at fs/btrfs/verity.c:558.
> 
> btrfs_set_fs_compat_ro(root->fs_info, VERITY);
> 
> I believe I still needed it because the tree checker doesn't scan every
> inode on the filesystem when you mount, so it would only freak out about
> a ro-compat inode later on if the inode didn't happen to be in a leaf
> that was being checked at mount time.
> 

Okay, so it is used.  (Due to the macro, it didn't show up when grepping.)

Doesn't it defeat the purpose of a ro_compat inode flag if the whole filesystem
is marked with a ro_compat feature flag, though?  I thought that the point of
the ro_compat inode flag is to allow old kernels to mount the filesystem
read-write, with only verity files being forced to read-only.  That would be
more flexible than ext4's implementation of fs-verity which forces the whole
filesystem to read-only.  But it seems you're forcing the whole filesystem to
read-only anyway?

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-14 17:49     ` Boris Burkov
  2021-09-14 17:56       ` Eric Biggers
@ 2021-09-14 18:03       ` David Sterba
  1 sibling, 0 replies; 18+ messages in thread
From: David Sterba @ 2021-09-14 18:03 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Eric Biggers, linux-btrfs, linux-fscrypt, kernel-team

On Tue, Sep 14, 2021 at 10:49:33AM -0700, Boris Burkov wrote:
> >    inode flags ro_compat instead, right?
> 
> I believe it is still being used, unless I messed up the patch I sent in
> the end. Taking a quick look, I think it's set at fs/btrfs/verity.c:558.
> 
> btrfs_set_fs_compat_ro(root->fs_info, VERITY);
> 
> I believe I still needed it because the tree checker doesn't scan every
> inode on the filesystem when you mount, so it would only freak out about
> a ro-compat inode later on if the inode didn't happen to be in a leaf
> that was being checked at mount time.
> 
> > 
> > 2. Is there a minimum version of btrfs-progs that is required to use btrfs
> >    verity?  With ext4 and f2fs, the fsck tools had to be updated, so there were
> >    minimum versions of the userspace tools required.
> 
> Hmm. I didn't update fsck, but now that you mention it, I think I need to...
> I'll test it right away and get back to you, but I suspect I need to
> hurry up and implement it.

The timing of kernel features and btrfs-progs is to have them at the
same release number at the latest, but it could be any time earlier as
it also makes testing easier (released vs git snapshot).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-07-11 14:52   ` Eric Biggers
  2021-07-28 14:29     ` David Sterba
@ 2021-09-14 18:25     ` Boris Burkov
  1 sibling, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-09-14 18:25 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Sun, Jul 11, 2021 at 09:52:56AM -0500, Eric Biggers wrote:
> On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> > Add support for fsverity in btrfs. To support the generic interface in
> > fs/verity, we add two new item types in the fs tree for inodes with
> > verity enabled. One stores the per-file verity descriptor and btrfs
> > verity item and the other stores the Merkle tree data itself.
> > 
> > Verity checking is done in end_page_read just before a page is marked
> > uptodate. This naturally handles a variety of edge cases like holes,
> > preallocated extents, and inline extents. Some care needs to be taken to
> > not try to verity pages past the end of the file, which are accessed by
> > the generic buffered file reading code under some circumstances like
> > reading to the end of the last page and trying to read again. Direct IO
> > on a verity file falls back to buffered reads.
> > 
> > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > re-walking up shared paths in the tree. For this reason, we need to
> > cache the Merkle tree data. Since the file is immutable after verity is
> > turned on, we can cache it at an index past EOF.
> > 
> > Use the new inode ro_flags to store verity on the inode item, so that we
> > can enable verity on a file, then rollback to an older kernel and still
> > mount the file system and read the file. Since we can't safely write the
> > file anymore without ruining the invariants of the Merkle tree, we mark
> > a ro_compat flag on the file system when a file has verity enabled.
> > 
> > Reported-by: kernel test robot <lkp@intel.com>
> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > Co-developed-by: Chris Mason <clm@fb.com>
> > Signed-off-by: Chris Mason <clm@fb.com>
> > Signed-off-by: Boris Burkov <boris@bur.io>
> 
> Generally looks good, feel free to add:
> 
> Acked-by: Eric Biggers <ebiggers@google.com>
> 
> A few minor comments below:

I was on vacation when you sent this (and thanks again for all your
reviewing) but I forgot to get back to your questions.

Hopefully, the answers are still useful.

> 
> > @@ -2688,7 +2677,14 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> >  	       start + len <= page_offset(page) + PAGE_SIZE);
> >  
> >  	if (uptodate) {
> > -		btrfs_page_set_uptodate(fs_info, page, start, len);
> > +		if (!PageError(page) && !PageUptodate(page) &&
> > +		    start < i_size_read(page->mapping->host) &&
> > +		    fsverity_active(page->mapping->host) &&
> > +		    !fsverity_verify_page(page)) {
> > +			btrfs_page_set_error(fs_info, page, start, len);
> > +		} else {
> > +			btrfs_page_set_uptodate(fs_info, page, start, len);
> > +		}
> 
> When is it ever the case that PageError(page) or PageUptodate(page) here?

I suspect that the sub-page refactor which consolidated a ton of this
logic made some of these checks redundant. I definitely hit the
PageUptodate case while testing an earlier version, but I can't recall
the exact circumstance now.

> 
> Also: in general, fsverity_active() should be checked first, in order to avoid
> any overhead when !CONFIG_FS_VERITY.
> 
> > @@ -5014,6 +5020,10 @@ long btrfs_ioctl(struct file *file, unsigned int
> >  		return btrfs_ioctl_get_subvol_rootref(file, argp);
> >  	case BTRFS_IOC_INO_LOOKUP_USER:
> >  		return btrfs_ioctl_ino_lookup_user(file, argp);
> > +	case FS_IOC_ENABLE_VERITY:
> > +		return fsverity_ioctl_enable(file, (const void __user *)argp);
> > +	case FS_IOC_MEASURE_VERITY:
> > +		return fsverity_ioctl_measure(file, argp);
> 
> You could wire up FS_IOC_READ_VERITY_METADATA as well.  It should just work
> without having to do anything else.

Good point, will do.

> 
> > + * The merkle tree items:
> > + * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
> > + *
> > + * These also start at offset 0, and correspond to the merkle tree bytes.
> > + * So when fsverity asks for page 0 of the merkle tree, we pull up one page
> > + * starting at offset 0 for this key type.  These are also opaque to btrfs,
> > + * we're blindly storing whatever fsverity sends down.
> > + */
> 
> Is it defined which offsets, specifically, the Merkle tree items start at?  Or
> is any arrangement valid -- say, one filesystem might use one item per Merkle
> tree block, while another might have multiple blocks per item, while another
> might have multiple items per block?  What about the degenerate case where there
> is a separate btrfs item for each individual Merkle tree byte, and maybe even
> some empty items -- is that being considered a valid/supported on-disk format,
> or is there a limit?

The "offsets" here are a logical concept for arranging items by btrfs
keys in the btree. Really it's just an arbitrary u64 that is the least
significant part of the (objectid, type, offset) triple. e.g., for the
desc item, we use offset 0 to signal our internal item and offset 1+ for
the fsverity_descriptor struct.

With that said, read_key_bytes will iterate through this logical space
and shouldn't care how the items/leafs are laid out, so if we happened
to write items in the ways you described, I think it would work just
fine. I haven't tested this beyond maybe 1k vs 2k items, though.

> 
> > +static loff_t merkle_file_pos(const struct inode *inode)
> > +{
> > +	loff_t ret;
> > +	u64 sz = inode->i_size;
> > +	u64 rounded = round_up(sz, MERKLE_START_ALIGN);
> > +
> > +	if (rounded > inode->i_sb->s_maxbytes)
> > +		return -EFBIG;
> > +	ret = rounded;
> > +	return ret;
> > +}
> 
> The 'ret' variable is unnecessary; this can just 'return rounded'.
> 
> > +/*
> > + * Drop all the items for this inode with this key_type.
> > + *
> > + * @inode: The inode to drop items for
> > + * @key_type: The type of items to drop (VERITY_DESC_ITEM or
> > + *            VERITY_MERKLE_ITEM)
> 
> BTRFS_VERITY_DESC_ITEM_KEY or BTRFS_VERITY_MERKLE_ITEM_KEY
> 
> > + *
> > + * Before doing a verity enable we cleanup any existing verity items.
> > + * This is also used to clean up if a verity enable failed half way
> > + * through.
> > + *
> > + * Returns number of dropped items on success, negative error code on failure.
> > + */
> > +static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)
> 
> The caller doesn't actually care about the number of dropped items, so this
> could just return 0 on success or a negative error code on failure.
> 
> > +	while (1) {
> > +		/*
> > +		 * 1 for the item being dropped
> > +		 */
> > +		trans = btrfs_start_transaction(root, 1);
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * Walk backwards through all the items until we find one
> > +		 * that isn't from our key type or objectid
> > +		 */
> > +		key.objectid = btrfs_ino(inode);
> > +		key.type = key_type;
> > +		key.offset = (u64)-1;
> > +
> > +		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
> > +		if (ret > 0) {
> > +			ret = 0;
> > +			/* No more keys of this type, we're done */
> > +			if (path->slots[0] == 0)
> > +				break;
> > +			path->slots[0]--;
> > +		} else if (ret < 0) {
> > +			btrfs_end_transaction(trans);
> > +			goto out;
> > +		}
> 
> Pardon my unfamiliarity with btrfs, but it looks like if the key isn't present,
> then btrfs_search_slot() returns the position where the key would be inserted.
> What if the previous leaf is completely full -- does btrfs_search_slot() return
> a new leaf, or does it return a pointer past the end of the previous one?  (It
> looks like the latter is assumed here.)  The comment for btrfs_search_slot()
> doesn't make this clear.

I believe that depends on the ins_len parameter. If ins_len is >0,
search_slot can do splitting and return a new leaf with the appropriate
new slot. If ins_len is <= 0, I believe it will return a slot at the end
of the leaf (see btrfs_bin_search/generic_bin_search). In this case,
ins_len is -1, so we don't expect it to split nodes/leaves.
> 
> > +int btrfs_drop_verity_items(struct btrfs_inode *inode)
> > +{
> > +	int ret;
> > +
> > +	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
> > +	if (ret < 0)
> > +		goto out;
> > +	ret = drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
> > +	if (ret < 0)
> > +		goto out;
> > +	ret = 0;
> > +out:
> > +	return ret;
> > +}
> 
> This could be simplified a bit if drop_verity_items() returned 0 on success.
> 
> > +/*
> > + * Insert and write inode items with a given key type and offset.
> > + *
> > + * @inode: The inode to insert for.
> > + * @key_type: The key type to insert.
> > + * @offset: The item offset to insert at.
> > + * @src: Source data to write.
> > + * @len: Length of source data to write.
> > + *
> > + * Write len bytes from src into items of up to 1k length.
> > + * The inserted items will have key <ino, key_type, offset + off> where
> > + * off is consecutively increasing from 0 up to the last item ending at
> > + * offset + len.
> > + *
> > + * Returns 0 on success and a negative error code on failure.
> > + */
> > +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> > +			   const char *src, u64 len)
> 
> The comment says items of up to 1k length, but the code uses 2K.
> 
> > +/*
> > + * Read inode items of the given key type and offset from the btree.
> > + *
> > + * @inode: The inode to read items of.
> > + * @key_type: The key type to read.
> > + * @offset: The item offset to read from.
> > + * @dest: The buffer to read into. This parameter has slightly tricky
> > + *        semantics.  If it is NULL, the function will not do any copying
> > + *        and will just return the size of all the items up to len bytes.
> > + *        If dest_page is passed, then the function will kmap_local the
> > + *        page and ignore dest, but it must still be non-NULL to avoid the
> > + *        counting-only behavior.
> > + * @len: Length in bytes to read.
> > + * @dest_page: Copy into this page instead of the dest buffer.
> > + *
> > + * Helper function to read items from the btree.  This returns the number
> > + * of bytes read or < 0 for errors.  We can return short reads if the
> > + * items don't exist on disk or aren't big enough to fill the desired length.
> > + * Supports reading into a provided buffer (dest) or into the page cache
> > + *
> > + * Returns number of bytes read or a negative error code on failure.
> > + */
> > +static int read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> > +			  char *dest, u64 len, struct page *dest_page)
> > +{
> > +	struct btrfs_path *path;
> > +	struct btrfs_root *root = inode->root;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_key key;
> > +	u64 item_end;
> > +	u64 copy_end;
> > +	int copied = 0;
> > +	u32 copy_offset;
> > +	unsigned long copy_bytes;
> > +	unsigned long dest_offset = 0;
> > +	void *data;
> > +	char *kaddr = dest;
> > +	int ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	if (dest_page)
> > +		path->reada = READA_FORWARD;
> > +
> > +	key.objectid = btrfs_ino(inode);
> > +	key.type = key_type;
> > +	key.offset = offset;
> > +
> > +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +	if (ret < 0) {
> > +		goto out;
> > +	} else if (ret > 0) {
> > +		ret = 0;
> > +		if (path->slots[0] == 0)
> > +			goto out;
> > +		path->slots[0]--;
> > +	}
> 
> Same question about btrfs_search_slot() here.  If the key isn't found and the
> previous leaf is completely full, will it return a pointer past the end of it?
> 
> > +/*
> > + * fsverity op that begins enabling verity.
> > + *
> > + * @filp: the file to enable verity on
> > + *
> > + * Begin enabling fsverity for the file. We drop any existing verity items
> > + * and set the in progress bit.
> > + *
> > + * Returns 0 on success, negative error code on failure.
> > + */
> > +static int btrfs_begin_enable_verity(struct file *filp)
> > +{
> > +	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
> > +	int ret;
> > +
> > +	ASSERT(inode_is_locked(file_inode(filp)));
> > +
> > +	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	ret = btrfs_drop_verity_items(inode);
> > +	if (ret)
> > +		goto out;
> > +
> > +	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &inode->runtime_flags);
> > +out:
> > +	return ret;
> > +}
> 
> There's no need for 'goto out' if no cleanup is being done.  Just return
> directly instead.
> 
> > +static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
> > +					       pgoff_t index,
> > +					       unsigned long num_ra_pages)
> > +{
> > +	struct page *page;
> > +	u64 off = (u64)index << PAGE_SHIFT;
> > +	loff_t merkle_pos = merkle_file_pos(inode);
> > +	int ret;
> > +
> > +	if (merkle_pos < 0)
> > +		return ERR_PTR(merkle_pos);
> > +	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
> > +		return ERR_PTR(-EFBIG);
> > +	index += merkle_pos >> PAGE_SHIFT;
> > +again:
> > +	page = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED);
> > +	if (page) {
> > +		if (PageUptodate(page))
> > +			return page;
> > +
> > +		lock_page(page);
> > +		/*
> > +		 * We only insert uptodate pages, so !Uptodate has to be
> > +		 * an error
> > +		 */
> > +		if (!PageUptodate(page)) {
> > +			unlock_page(page);
> > +			put_page(page);
> > +			return ERR_PTR(-EIO);
> > +		}
> > +		unlock_page(page);
> > +		return page;
> 
> As per the comment above, aren't the Merkle tree pages marked Uptodate before
> being inserted into the page cache?  If so, isn't it unnecessary to re-check
> Uptodate under the page lock?

I feel like this might be caused by me being confused about the metadata
metadata page in the btree getting an error and the merkle tree page which
only we write to. I'll think about it a little more to make sure there's
not a different explanation.

> 
> > +struct btrfs_verity_descriptor_item {
> > +	/* size of the verity descriptor in bytes */
> > +	__le64 size;
> > +	/*
> > +	 * When we implement support for fscrypt, we will need to encrypt the
> > +	 * Merkle tree for encrypted verity files. These 128 bits are for the
> > +	 * eventual storage of an fscrypt initialization vector.
> > +	 */
> > +	__le64 reserved[2];
> > +	__u8 encryption;
> > +} __attribute__ ((__packed__));
> 
> Do you have something in mind for how an initialization vector stored here would
> be used?  I'd have thought that if/when fscrypt support is added, you'd either
> derive a new per-file key for encrypting the verity metadata specifically, or
> you'd encrypt the verity metadata with the regular per-file key using IVs that
> are chosen as if the verity metadata were appended to the file contents.
> Neither case would require that any additional information be stored here.

Unfortunately, I can't give an intelligent answer to this one.

Omar Sandoval is working on fscrypt for btrfs and I spoke with him to
figure out what he needed for the verity metadata. I'll ask him to chime
in here :)

> 
> - Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-14 17:56       ` Eric Biggers
@ 2021-09-14 18:34         ` Boris Burkov
  2021-09-15 20:45           ` Eric Biggers
  0 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2021-09-14 18:34 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Tue, Sep 14, 2021 at 10:56:28AM -0700, Eric Biggers wrote:
> On Tue, Sep 14, 2021 at 10:49:33AM -0700, Boris Burkov wrote:
> > On Tue, Sep 14, 2021 at 10:32:59AM -0700, Eric Biggers wrote:
> > > Hi Boris,
> > > 
> > > On Wed, Jun 30, 2021 at 01:01:49PM -0700, Boris Burkov wrote:
> > > > Add support for fsverity in btrfs. To support the generic interface in
> > > > fs/verity, we add two new item types in the fs tree for inodes with
> > > > verity enabled. One stores the per-file verity descriptor and btrfs
> > > > verity item and the other stores the Merkle tree data itself.
> > > > 
> > > > Verity checking is done in end_page_read just before a page is marked
> > > > uptodate. This naturally handles a variety of edge cases like holes,
> > > > preallocated extents, and inline extents. Some care needs to be taken to
> > > > not try to verity pages past the end of the file, which are accessed by
> > > > the generic buffered file reading code under some circumstances like
> > > > reading to the end of the last page and trying to read again. Direct IO
> > > > on a verity file falls back to buffered reads.
> > > > 
> > > > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > > > re-walking up shared paths in the tree. For this reason, we need to
> > > > cache the Merkle tree data. Since the file is immutable after verity is
> > > > turned on, we can cache it at an index past EOF.
> > > > 
> > > > Use the new inode ro_flags to store verity on the inode item, so that we
> > > > can enable verity on a file, then rollback to an older kernel and still
> > > > mount the file system and read the file. Since we can't safely write the
> > > > file anymore without ruining the invariants of the Merkle tree, we mark
> > > > a ro_compat flag on the file system when a file has verity enabled.
> > > 
> > > I want to mention the btrfs verity support in
> > > Documentation/filesystems/fsverity.rst, and I have a couple questions:
> > > 
> > > 1. Is the ro_compat filesystem flag still a thing?  The commit message claims it
> > >    is, and BTRFS_FEATURE_COMPAT_RO_VERITY is defined in the code, but it doesn't
> > >    seem to actually be used.  It's not needed since you found a way to make the
> > >    inode flags ro_compat instead, right?
> > 
> > I believe it is still being used, unless I messed up the patch I sent in
> > the end. Taking a quick look, I think it's set at fs/btrfs/verity.c:558.
> > 
> > btrfs_set_fs_compat_ro(root->fs_info, VERITY);
> > 
> > I believe I still needed it because the tree checker doesn't scan every
> > inode on the filesystem when you mount, so it would only freak out about
> > a ro-compat inode later on if the inode didn't happen to be in a leaf
> > that was being checked at mount time.
> > 
> 
> Okay, so it is used.  (Due to the macro, it didn't show up when grepping.)
> 
> Doesn't it defeat the purpose of a ro_compat inode flag if the whole filesystem
> is marked with a ro_compat feature flag, though?  I thought that the point of
> the ro_compat inode flag is to allow old kernels to mount the filesystem
> read-write, with only verity files being forced to read-only.  That would be
> more flexible than ext4's implementation of fs-verity which forces the whole
> filesystem to read-only.  But it seems you're forcing the whole filesystem to
> read-only anyway?
> 
> - Eric

I was thinking of it in terms of "RO compat is the goal" and having new
inode flags totally broke that and was treated as a corruption of the
inode regardless of the fs being ro/rw. I think a check on a live fs
would just flip the fs ro, which was the goal anyway, but a check that
happened during mount would fail the mount, even for a read-only fs. 

Making it fully per file would be pretty cool! The only thing
really missing as far as I can tell is a way to mark a file read only
with the same semantics fsverity uses from within btrfs.

Boris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-14 18:34         ` Boris Burkov
@ 2021-09-15 20:45           ` Eric Biggers
  2021-09-15 21:01             ` Boris Burkov
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2021-09-15 20:45 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Tue, Sep 14, 2021 at 11:34:29AM -0700, Boris Burkov wrote:
> > Okay, so it is used.  (Due to the macro, it didn't show up when grepping.)
> > 
> > Doesn't it defeat the purpose of a ro_compat inode flag if the whole filesystem
> > is marked with a ro_compat feature flag, though?  I thought that the point of
> > the ro_compat inode flag is to allow old kernels to mount the filesystem
> > read-write, with only verity files being forced to read-only.  That would be
> > more flexible than ext4's implementation of fs-verity which forces the whole
> > filesystem to read-only.  But it seems you're forcing the whole filesystem to
> > read-only anyway?
> > 
> > - Eric
> 
> I was thinking of it in terms of "RO compat is the goal" and having new
> inode flags totally broke that and was treated as a corruption of the
> inode regardless of the fs being ro/rw. I think a check on a live fs
> would just flip the fs ro, which was the goal anyway, but a check that
> happened during mount would fail the mount, even for a read-only fs. 
> 
> Making it fully per file would be pretty cool! The only thing
> really missing as far as I can tell is a way to mark a file read only
> with the same semantics fsverity uses from within btrfs.

I don't understand.  Why are you bothering with the ro_compat inode flag at all
if it doesn't actually work?

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-15 20:45           ` Eric Biggers
@ 2021-09-15 21:01             ` Boris Burkov
  2021-09-15 21:12               ` Eric Biggers
  0 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2021-09-15 21:01 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, Sep 15, 2021 at 01:45:23PM -0700, Eric Biggers wrote:
> On Tue, Sep 14, 2021 at 11:34:29AM -0700, Boris Burkov wrote:
> > > Okay, so it is used.  (Due to the macro, it didn't show up when grepping.)
> > > 
> > > Doesn't it defeat the purpose of a ro_compat inode flag if the whole filesystem
> > > is marked with a ro_compat feature flag, though?  I thought that the point of
> > > the ro_compat inode flag is to allow old kernels to mount the filesystem
> > > read-write, with only verity files being forced to read-only.  That would be
> > > more flexible than ext4's implementation of fs-verity which forces the whole
> > > filesystem to read-only.  But it seems you're forcing the whole filesystem to
> > > read-only anyway?
> > > 
> > > - Eric
> > 
> > I was thinking of it in terms of "RO compat is the goal" and having new
> > inode flags totally broke that and was treated as a corruption of the
> > inode regardless of the fs being ro/rw. I think a check on a live fs
> > would just flip the fs ro, which was the goal anyway, but a check that
> > happened during mount would fail the mount, even for a read-only fs. 
> > 
> > Making it fully per file would be pretty cool! The only thing
> > really missing as far as I can tell is a way to mark a file read only
> > with the same semantics fsverity uses from within btrfs.
> 
> I don't understand.  Why are you bothering with the ro_compat inode flag at all
> if it doesn't actually work?
> 
> - Eric

Sorry I explained that really badly.

My first try was ro-compat bit only, that failed because btrfs couldn't
add an inode flag in a ro-compat way before my changes, as it could
fail to mount.

To fix that, I had to work on the inode flag compatibility, which
evolved into this notion of inode ro-compat flags, which does work as
expected: if you see a file with an unknown ro-compat flag it's an error
if you aren't read-only. Read-only mount will never fail.

I think changing the semantics of the ro-compat inodes from:
"an unknown ro inode flag -> fs ro" to
"an unknown ro inode flag -> file ro"
could be a big win. I don't think there is a rush to do that, though? If
I add it now, on top of the existing code, then you might go back to a
kernel that can only mount the fs read-only or you might go back to one
which is clever enough to only force the file read-only.

Hope I'm making a bit more sense, now.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-15 21:01             ` Boris Burkov
@ 2021-09-15 21:12               ` Eric Biggers
  2021-09-15 23:14                 ` Boris Burkov
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2021-09-15 21:12 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, Sep 15, 2021 at 02:01:12PM -0700, Boris Burkov wrote:
> On Wed, Sep 15, 2021 at 01:45:23PM -0700, Eric Biggers wrote:
> > On Tue, Sep 14, 2021 at 11:34:29AM -0700, Boris Burkov wrote:
> > > > Okay, so it is used.  (Due to the macro, it didn't show up when grepping.)
> > > > 
> > > > Doesn't it defeat the purpose of a ro_compat inode flag if the whole filesystem
> > > > is marked with a ro_compat feature flag, though?  I thought that the point of
> > > > the ro_compat inode flag is to allow old kernels to mount the filesystem
> > > > read-write, with only verity files being forced to read-only.  That would be
> > > > more flexible than ext4's implementation of fs-verity which forces the whole
> > > > filesystem to read-only.  But it seems you're forcing the whole filesystem to
> > > > read-only anyway?
> > > > 
> > > > - Eric
> > > 
> > > I was thinking of it in terms of "RO compat is the goal" and having new
> > > inode flags totally broke that and was treated as a corruption of the
> > > inode regardless of the fs being ro/rw. I think a check on a live fs
> > > would just flip the fs ro, which was the goal anyway, but a check that
> > > happened during mount would fail the mount, even for a read-only fs. 
> > > 
> > > Making it fully per file would be pretty cool! The only thing
> > > really missing as far as I can tell is a way to mark a file read only
> > > with the same semantics fsverity uses from within btrfs.
> > 
> > I don't understand.  Why are you bothering with the ro_compat inode flag at all
> > if it doesn't actually work?
> > 
> > - Eric
> 
> Sorry I explained that really badly.
> 
> My first try was ro-compat bit only, that failed because btrfs couldn't
> add an inode flag in a ro-compat way before my changes, as it could
> fail to mount.
> 
> To fix that, I had to work on the inode flag compatibility, which
> evolved into this notion of inode ro-compat flags, which does work as
> expected: if you see a file with an unknown ro-compat flag it's an error
> if you aren't read-only. Read-only mount will never fail.
> 
> I think changing the semantics of the ro-compat inodes from:
> "an unknown ro inode flag -> fs ro" to
> "an unknown ro inode flag -> file ro"
> could be a big win. I don't think there is a rush to do that, though?

If you're forcing the filesystem to read-only anyway, why not just rely on the
filesystem-wide ro_compat flag, which you already implemented and which already
does that?  What benefit does the per-file ro_compat flag have, if it doesn't
actually make just the file read-only (which would be the expected behavior)?
You might as well just use a "regular" inode flag in that case.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/3] btrfs: initial fsverity support
  2021-09-15 21:12               ` Eric Biggers
@ 2021-09-15 23:14                 ` Boris Burkov
  0 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-09-15 23:14 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, Sep 15, 2021 at 02:12:37PM -0700, Eric Biggers wrote:
> On Wed, Sep 15, 2021 at 02:01:12PM -0700, Boris Burkov wrote:
> > On Wed, Sep 15, 2021 at 01:45:23PM -0700, Eric Biggers wrote:
> > > On Tue, Sep 14, 2021 at 11:34:29AM -0700, Boris Burkov wrote:
> > > > > Okay, so it is used.  (Due to the macro, it didn't show up when grepping.)
> > > > > 
> > > > > Doesn't it defeat the purpose of a ro_compat inode flag if the whole filesystem
> > > > > is marked with a ro_compat feature flag, though?  I thought that the point of
> > > > > the ro_compat inode flag is to allow old kernels to mount the filesystem
> > > > > read-write, with only verity files being forced to read-only.  That would be
> > > > > more flexible than ext4's implementation of fs-verity which forces the whole
> > > > > filesystem to read-only.  But it seems you're forcing the whole filesystem to
> > > > > read-only anyway?
> > > > > 
> > > > > - Eric
> > > > 
> > > > I was thinking of it in terms of "RO compat is the goal" and having new
> > > > inode flags totally broke that and was treated as a corruption of the
> > > > inode regardless of the fs being ro/rw. I think a check on a live fs
> > > > would just flip the fs ro, which was the goal anyway, but a check that
> > > > happened during mount would fail the mount, even for a read-only fs. 
> > > > 
> > > > Making it fully per file would be pretty cool! The only thing
> > > > really missing as far as I can tell is a way to mark a file read only
> > > > with the same semantics fsverity uses from within btrfs.
> > > 
> > > I don't understand.  Why are you bothering with the ro_compat inode flag at all
> > > if it doesn't actually work?
> > > 
> > > - Eric
> > 
> > Sorry I explained that really badly.
> > 
> > My first try was ro-compat bit only, that failed because btrfs couldn't
> > add an inode flag in a ro-compat way before my changes, as it could
> > fail to mount.
> > 
> > To fix that, I had to work on the inode flag compatibility, which
> > evolved into this notion of inode ro-compat flags, which does work as
> > expected: if you see a file with an unknown ro-compat flag it's an error
> > if you aren't read-only. Read-only mount will never fail.
> > 
> > I think changing the semantics of the ro-compat inodes from:
> > "an unknown ro inode flag -> fs ro" to
> > "an unknown ro inode flag -> file ro"
> > could be a big win. I don't think there is a rush to do that, though?
> 
> If you're forcing the filesystem to read-only anyway, why not just rely on the
> filesystem-wide ro_compat flag, which you already implemented and which already
> does that?  What benefit does the per-file ro_compat flag have, if it doesn't
> actually make just the file read-only (which would be the expected behavior)?
> You might as well just use a "regular" inode flag in that case.
> 
> - Eric

I couldn't use a regular inode flag because the btrfs tree checker will
call it an error when it sees a flag it doesn't recognize, regardless of
compat bits or fs read-only status. This is extra painful if the inode
with verity enabled is in a leaf that gets read in at mount time and
gets checked then.

a fake example of what was happening:

mkfs.btrfs dev
mount dev mnt
touch /mnt/foo
fsverity enable /mnt/foo
<reboot to old kernel>
mount dev mnt
!!!FAIL!!!
mount -o ro dev mnt
!!!FAIL!!!

To get around this, I added a new flag field that wasn't checked as
aggressively -- and didn't call it an error on ro mount.

There is more excruciating detail, that I won't poorly re-create here,
in the commit message of:
"btrfs: add ro compat flags to inodes"

However, I really do agree that having done the work to add the new
class of flags, it makes sense to try to take advantage of it the way
you suggest, since per-file ro compat sounds a lot cooler than fs ro
compat. I was just trying to do what I could to make the fs compat bit
work at all.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-09-15 23:15 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-30 20:01 [PATCH v6 0/3] btrfs: support fsverity Boris Burkov
2021-06-30 20:01 ` [PATCH v6 1/3] btrfs: add ro compat flags to inodes Boris Burkov
2021-06-30 20:01 ` [PATCH v6 2/3] btrfs: initial fsverity support Boris Burkov
2021-07-11 14:52   ` Eric Biggers
2021-07-28 14:29     ` David Sterba
2021-09-14 18:25     ` Boris Burkov
2021-07-28 15:05   ` David Sterba
2021-09-14 17:32   ` Eric Biggers
2021-09-14 17:49     ` Boris Burkov
2021-09-14 17:56       ` Eric Biggers
2021-09-14 18:34         ` Boris Burkov
2021-09-15 20:45           ` Eric Biggers
2021-09-15 21:01             ` Boris Burkov
2021-09-15 21:12               ` Eric Biggers
2021-09-15 23:14                 ` Boris Burkov
2021-09-14 18:03       ` David Sterba
2021-06-30 20:01 ` [PATCH v6 3/3] btrfs: verity metadata orphan items Boris Burkov
2021-07-28 15:24 ` [PATCH v6 0/3] btrfs: support fsverity David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).