All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/5] btrfs: support fsverity
@ 2021-05-05 19:20 Boris Burkov
  0 siblings, 0 replies; 26+ messages in thread
From: Boris Burkov @ 2021-05-05 19:20 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

This patchset provides support for fsverity in btrfs.

At a high level, we store the verity descriptor and Merkle tree data
in the file system btree with the file's inode as the objectid, and
direct reads/writes to those items to implement the generic fsverity
interface required by fs/verity/.

The first patch is a preparatory patch which adds a notion of
compat_flags to the btrfs_inode and inode_item in order to allow
enabling verity on a file without making the file system unmountable for
older kernels. (It runs afoul of the leaf corruption check otherwise)

The second patch is the bulk of the fsverity implementation. It
implements the fsverity interface and adds verity checks for the typical
file reading case.

The third patch cleans up the corner cases in readpage, covering inline
extents, preallocated extents, and holes.

The fourth patch handles direct io of a veritied file by falling back to
buffered io.

The fifth patch handles crashes mid-verity enable via orphan items

I have tested this patch set in the following ways:
- xfstests auto group
- with a separate fix for btrfs fiemap and some light touches to the
  tests themselves: xfstests generic/572,573,574,575.
- new xfstest for btrfs specific corruptions (e.g. inline extents).
- new xfstest using dmlogwrites and dmsnapshot to exercise orphans.
- new xfstest using pwrite to exercise merkle cache EFBIG cases
- manual test with sleeps in kernel to force orphan vs. unlink race.
- manual end-to-end test with verity signed rpms.
--
changes for v4:
Patch 2:
- fix build without CONFIG_VERITY
- fix assumption of short writes
- make true_size match the item contents in get_verity_descriptor
- rewrite overflow logic in terms of file position instead of cache index
- round up position by 64k instead of adding 2048 pages
- fix conflation of block index and page index in write_merkle_block
- ensure reserved fields are 0 in the new descriptor item.

changes for v3:
Patch 2: fix bug in overflow logic, fix interface of
get_verity_descriptor, truncate merkle cache items on failure, fix
various code/style issues.
Patch 5: fix extent data leak if verity races with unlink or O_TMPFILE
and removes a legitimate orphan, then system is interrupted such that
the orphan was needed.

changes for v2:
Patch 1: Unchanged.
Patch 2: Return EFBIG if Merkle data past s_maxbytes. Added special
descriptor item for encryption and to handle ERANGE case for
get_verity_descriptor. Improved function comments. Rebased onto subpage
read patches -- modified end_page_read to do verity check before marking
the page uptodate. Changed from full compat to ro_compat; merged sysfs
feature here.
Patch 3: Rebased onto subpage read patches.
Patch 4: Unchanged.
Patch 5: Used to be sysfs feature, now a new patch that handles orphaned
verity data.

Boris Burkov (4):
  btrfs: add compat_flags to btrfs_inode_item
  btrfs: check verity for reads of inline extents and holes
  btrfs: fallback to buffered io for verity files
  btrfs: verity metadata orphan items

Chris Mason (1):
  btrfs: initial fsverity support

 fs/btrfs/Makefile               |   1 +
 fs/btrfs/btrfs_inode.h          |   2 +
 fs/btrfs/ctree.h                |  32 +-
 fs/btrfs/delayed-inode.c        |   2 +
 fs/btrfs/extent_io.c            |  53 +--
 fs/btrfs/file.c                 |   9 +
 fs/btrfs/inode.c                |  25 +-
 fs/btrfs/ioctl.c                |  21 +-
 fs/btrfs/super.c                |   3 +
 fs/btrfs/sysfs.c                |   6 +
 fs/btrfs/tree-log.c             |   1 +
 fs/btrfs/verity.c               | 686 ++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h      |   2 +-
 include/uapi/linux/btrfs_tree.h |  22 +-
 14 files changed, 829 insertions(+), 36 deletions(-)
 create mode 100644 fs/btrfs/verity.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item
       [not found] <cover.1620241221.git.boris@bur.io>
@ 2021-05-05 19:20 ` Boris Burkov
  2021-05-11 19:11   ` David Sterba
  2021-05-25 18:12   ` Eric Biggers
  2021-05-05 19:20 ` [PATCH v4 2/5] btrfs: initial fsverity support Boris Burkov
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Boris Burkov @ 2021-05-05 19:20 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

The tree checker currently rejects unrecognized flags when it reads
btrfs_inode_item. Practically, this means that adding a new flag makes
the change backwards incompatible if the flag is ever set on a file.

Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a
new "compat_flags". These flags are zero on inode creation in btrfs and
mkfs and are ignored by an older kernel, so it should be safe to use
them in this way.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h          | 1 +
 fs/btrfs/ctree.h                | 2 ++
 fs/btrfs/delayed-inode.c        | 2 ++
 fs/btrfs/inode.c                | 3 +++
 fs/btrfs/ioctl.c                | 7 ++++---
 fs/btrfs/tree-log.c             | 1 +
 include/uapi/linux/btrfs_tree.h | 7 ++++++-
 7 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c652e19ad74e..e8dbc8e848ce 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -191,6 +191,7 @@ struct btrfs_inode {
 
 	/* flags field from the on disk inode */
 	u32 flags;
+	u64 compat_flags;
 
 	/*
 	 * Counters to keep track of the number of extent item's we may use due
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0f5b0b12762b..0546273a520b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1786,6 +1786,7 @@ BTRFS_SETGET_FUNCS(inode_gid, struct btrfs_inode_item, gid, 32);
 BTRFS_SETGET_FUNCS(inode_mode, struct btrfs_inode_item, mode, 32);
 BTRFS_SETGET_FUNCS(inode_rdev, struct btrfs_inode_item, rdev, 64);
 BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
+BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_generation, struct btrfs_inode_item,
 			 generation, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_sequence, struct btrfs_inode_item,
@@ -1803,6 +1804,7 @@ BTRFS_SETGET_STACK_FUNCS(stack_inode_gid, struct btrfs_inode_item, gid, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_mode, struct btrfs_inode_item, mode, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_rdev, struct btrfs_inode_item, rdev, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_flags, struct btrfs_inode_item, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
 BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
 BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 1a88f6214ebc..ef4e0265dbe3 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1718,6 +1718,7 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_inode_transid(inode_item, trans->transid);
 	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
 	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
+	btrfs_set_stack_inode_compat_flags(inode_item, BTRFS_I(inode)->compat_flags);
 	btrfs_set_stack_inode_block_group(inode_item, 0);
 
 	btrfs_set_stack_timespec_sec(&inode_item->atime,
@@ -1776,6 +1777,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 	inode->i_rdev = 0;
 	*rdev = btrfs_stack_inode_rdev(inode_item);
 	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
+	BTRFS_I(inode)->compat_flags = btrfs_stack_inode_compat_flags(inode_item);
 
 	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
 	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 69fcdf8f0b1c..d89000577f7f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3627,6 +3627,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
 
 	BTRFS_I(inode)->index_cnt = (u64)-1;
 	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
+	BTRFS_I(inode)->compat_flags = btrfs_inode_compat_flags(leaf, inode_item);
 
 cache_index:
 	/*
@@ -3793,6 +3794,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
 	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
@@ -8857,6 +8859,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->defrag_bytes = 0;
 	ei->disk_i_size = 0;
 	ei->flags = 0;
+	ei->compat_flags = 0;
 	ei->csum_bytes = 0;
 	ei->index_cnt = (u64)-1;
 	ei->dir_index = 0;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba0e4ddaf6b..ff335c192170 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -102,8 +102,9 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
  * Export internal inode flags to the format expected by the FS_IOC_GETFLAGS
  * ioctl.
  */
-static unsigned int btrfs_inode_flags_to_fsflags(unsigned int flags)
+static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 {
+	unsigned int flags = binode->flags;
 	unsigned int iflags = 0;
 
 	if (flags & BTRFS_INODE_SYNC)
@@ -156,7 +157,7 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
 {
 	struct btrfs_inode *binode = BTRFS_I(file_inode(file));
-	unsigned int flags = btrfs_inode_flags_to_fsflags(binode->flags);
+	unsigned int flags = btrfs_inode_flags_to_fsflags(binode);
 
 	if (copy_to_user(arg, &flags, sizeof(flags)))
 		return -EFAULT;
@@ -228,7 +229,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 
 	btrfs_inode_lock(inode, 0);
 	fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
-	old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
+	old_fsflags = btrfs_inode_flags_to_fsflags(binode);
 
 	ret = vfs_ioc_setflags_prepare(inode, old_fsflags, fsflags);
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a0fc3a1390ab..3ef166a3485a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3944,6 +3944,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
 	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 58d7cff9afb1..ae25280316bd 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -574,11 +574,16 @@ struct btrfs_inode_item {
 	/* modification sequence number for NFS */
 	__le64 sequence;
 
+	/*
+	 * flags which aren't checked for corruption at mount
+	 * and can be added in a backwards compatible way
+	 */
+	__le64 compat_flags;
 	/*
 	 * a little future expansion, for more than this we can
 	 * just grow the inode item and version it
 	 */
-	__le64 reserved[4];
+	__le64 reserved[3];
 	struct btrfs_timespec atime;
 	struct btrfs_timespec ctime;
 	struct btrfs_timespec mtime;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 2/5] btrfs: initial fsverity support
       [not found] <cover.1620241221.git.boris@bur.io>
  2021-05-05 19:20 ` [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
@ 2021-05-05 19:20 ` Boris Burkov
  2021-05-06  0:09     ` kernel test robot
                     ` (3 more replies)
  2021-05-05 19:20 ` [PATCH v4 3/5] btrfs: check verity for reads of inline extents and holes Boris Burkov
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 26+ messages in thread
From: Boris Burkov @ 2021-05-05 19:20 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

From: Chris Mason <clm@fb.com>

Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and the other
stores the Merkle tree data itself.

Verity checking is done at the end of IOs to ensure each page is checked
before it is marked uptodate.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode compat_flags to store verity on the inode item, so
that we can enable verity on a file, then rollback to an older kernel
and still mount the file system and read the file. Since we can't safely
write the file anymore without ruining the invariants of the Merkle
tree, we mark a ro_compat flag on the file system when a file has verity
enabled.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/Makefile               |   1 +
 fs/btrfs/btrfs_inode.h          |   1 +
 fs/btrfs/ctree.h                |  30 +-
 fs/btrfs/extent_io.c            |  27 +-
 fs/btrfs/file.c                 |   6 +
 fs/btrfs/inode.c                |   7 +
 fs/btrfs/ioctl.c                |  14 +-
 fs/btrfs/super.c                |   3 +
 fs/btrfs/sysfs.c                |   6 +
 fs/btrfs/verity.c               | 617 ++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h      |   2 +-
 include/uapi/linux/btrfs_tree.h |  15 +
 12 files changed, 718 insertions(+), 11 deletions(-)
 create mode 100644 fs/btrfs/verity.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index cec88a66bd6c..3dcf9bcc2326 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -36,6 +36,7 @@ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
 btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
+btrfs-$(CONFIG_FS_VERITY) += verity.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index e8dbc8e848ce..4536548b9e79 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -51,6 +51,7 @@ enum {
 	 * the file range, inode's io_tree).
 	 */
 	BTRFS_INODE_NO_DELALLOC_FLUSH,
+	BTRFS_INODE_VERITY_IN_PROGRESS,
 };
 
 /* in memory btrfs inode */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0546273a520b..c5aab6a639ef 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -279,9 +279,10 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
-#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
-	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
-	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
+#define BTRFS_FEATURE_COMPAT_RO_SUPP				\
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |		\
+	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID |	\
+	 BTRFS_FEATURE_COMPAT_RO_VERITY)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -1505,6 +1506,11 @@ do {                                                                   \
 	 BTRFS_INODE_COMPRESS |						\
 	 BTRFS_INODE_ROOT_ITEM_INIT)
 
+/*
+ * Inode compat flags
+ */
+#define BTRFS_INODE_VERITY		(1 << 0)
+
 struct btrfs_map_token {
 	struct extent_buffer *eb;
 	char *kaddr;
@@ -3766,6 +3772,24 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
 	return signal_pending(current);
 }
 
+/* verity.c */
+#ifdef CONFIG_FS_VERITY
+extern const struct fsverity_operations btrfs_verityops;
+int btrfs_drop_verity_items(struct btrfs_inode *inode);
+BTRFS_SETGET_FUNCS(verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
+		   encryption, 8);
+BTRFS_SETGET_FUNCS(verity_descriptor_size, struct btrfs_verity_descriptor_item, size, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
+			 encryption, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size, struct btrfs_verity_descriptor_item,
+			 size, 64);
+#else
+static inline int btrfs_drop_verity_items(struct btrfs_inode *inode)
+{
+	return 0;
+}
+#endif
+
 /* Sanity test specific functions */
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 void btrfs_test_destroy_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4fb33cadc41a..d1f57a4ad2fb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -13,6 +13,7 @@
 #include <linux/pagevec.h>
 #include <linux/prefetch.h>
 #include <linux/cleancache.h>
+#include <linux/fsverity.h>
 #include "misc.h"
 #include "extent_io.h"
 #include "extent-io-tree.h"
@@ -2862,15 +2863,28 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
 	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
 }
 
-static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
+static int end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
-	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	int ret = 0;
+	struct inode *inode = page->mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 
 	ASSERT(page_offset(page) <= start &&
 		start + len <= page_offset(page) + PAGE_SIZE);
 
 	if (uptodate) {
-		btrfs_page_set_uptodate(fs_info, page, start, len);
+		/*
+		 * buffered reads of a file with page alignment will issue a
+		 * 0 length read for one page past the end of file, so we must
+		 * explicitly skip checking verity on that page of zeros.
+		 */
+		if (!PageError(page) && !PageUptodate(page) &&
+		    start < i_size_read(inode) &&
+		    fsverity_active(inode) &&
+		    !fsverity_verify_page(page))
+			ret = -EIO;
+		else
+			btrfs_page_set_uptodate(fs_info, page, start, len);
 	} else {
 		btrfs_page_clear_uptodate(fs_info, page, start, len);
 		btrfs_page_set_error(fs_info, page, start, len);
@@ -2878,12 +2892,13 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 
 	if (fs_info->sectorsize == PAGE_SIZE)
 		unlock_page(page);
-	else if (is_data_inode(page->mapping->host))
+	else if (is_data_inode(inode))
 		/*
 		 * For subpage data, unlock the page if we're the last reader.
 		 * For subpage metadata, page lock is not utilized for read.
 		 */
 		btrfs_subpage_end_reader(fs_info, page, start, len);
+	return ret;
 }
 
 /*
@@ -3059,7 +3074,9 @@ static void end_bio_extent_readpage(struct bio *bio)
 		bio_offset += len;
 
 		/* Update page status and unlock */
-		end_page_read(page, uptodate, start, len);
+		ret = end_page_read(page, uptodate, start, len);
+		if (ret)
+			uptodate = 0;
 		endio_readpage_release_extent(&processed, BTRFS_I(inode),
 					      start, end, uptodate);
 	}
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 3b10d98b4ebb..a99470303bd9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -16,6 +16,7 @@
 #include <linux/btrfs.h>
 #include <linux/uio.h>
 #include <linux/iversion.h>
+#include <linux/fsverity.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3593,7 +3594,12 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
+	int ret;
 	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+
+	ret = fsverity_file_open(inode, filp);
+	if (ret)
+		return ret;
 	return generic_file_open(inode, filp);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d89000577f7f..1b1101369777 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -32,6 +32,7 @@
 #include <linux/sched/mm.h>
 #include <linux/iomap.h>
 #include <asm/unaligned.h>
+#include <linux/fsverity.h>
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -5405,7 +5406,9 @@ void btrfs_evict_inode(struct inode *inode)
 
 	trace_btrfs_inode_evict(inode);
 
+
 	if (!root) {
+		fsverity_cleanup_inode(inode);
 		clear_inode(inode);
 		return;
 	}
@@ -5488,6 +5491,7 @@ void btrfs_evict_inode(struct inode *inode)
 	 * to retry these periodically in the future.
 	 */
 	btrfs_remove_delayed_node(BTRFS_I(inode));
+	fsverity_cleanup_inode(inode);
 	clear_inode(inode);
 }
 
@@ -9041,6 +9045,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(path->dentry);
 	u32 blocksize = inode->i_sb->s_blocksize;
 	u32 bi_flags = BTRFS_I(inode)->flags;
+	u32 bi_compat_flags = BTRFS_I(inode)->compat_flags;
 
 	stat->result_mask |= STATX_BTIME;
 	stat->btime.tv_sec = BTRFS_I(inode)->i_otime.tv_sec;
@@ -9053,6 +9058,8 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 		stat->attributes |= STATX_ATTR_IMMUTABLE;
 	if (bi_flags & BTRFS_INODE_NODUMP)
 		stat->attributes |= STATX_ATTR_NODUMP;
+	if (bi_compat_flags & BTRFS_INODE_VERITY)
+		stat->attributes |= STATX_ATTR_VERITY;
 
 	stat->attributes_mask |= (STATX_ATTR_APPEND |
 				  STATX_ATTR_COMPRESSED |
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ff335c192170..4b8f38fe4226 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -26,6 +26,7 @@
 #include <linux/btrfs.h>
 #include <linux/uaccess.h>
 #include <linux/iversion.h>
+#include <linux/fsverity.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "export.h"
@@ -105,6 +106,7 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
 static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 {
 	unsigned int flags = binode->flags;
+	unsigned int compat_flags = binode->compat_flags;
 	unsigned int iflags = 0;
 
 	if (flags & BTRFS_INODE_SYNC)
@@ -121,6 +123,8 @@ static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 		iflags |= FS_DIRSYNC_FL;
 	if (flags & BTRFS_INODE_NODATACOW)
 		iflags |= FS_NOCOW_FL;
+	if (compat_flags & BTRFS_INODE_VERITY)
+		iflags |= FS_VERITY_FL;
 
 	if (flags & BTRFS_INODE_NOCOMPRESS)
 		iflags |= FS_NOCOMP_FL;
@@ -148,10 +152,12 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 		new_fl |= S_NOATIME;
 	if (binode->flags & BTRFS_INODE_DIRSYNC)
 		new_fl |= S_DIRSYNC;
+	if (binode->compat_flags & BTRFS_INODE_VERITY)
+		new_fl |= S_VERITY;
 
 	set_mask_bits(&inode->i_flags,
-		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
-		      new_fl);
+		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
+		      S_VERITY, new_fl);
 }
 
 static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
@@ -5072,6 +5078,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_subvol_rootref(file, argp);
 	case BTRFS_IOC_INO_LOOKUP_USER:
 		return btrfs_ioctl_ino_lookup_user(file, argp);
+	case FS_IOC_ENABLE_VERITY:
+		return fsverity_ioctl_enable(file, (const void __user *)argp);
+	case FS_IOC_MEASURE_VERITY:
+		return fsverity_ioctl_measure(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4a396c1147f1..aa41ee30e3ca 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1365,6 +1365,9 @@ static int btrfs_fill_super(struct super_block *sb,
 	sb->s_op = &btrfs_super_ops;
 	sb->s_d_op = &btrfs_dentry_operations;
 	sb->s_export_op = &btrfs_export_ops;
+#ifdef CONFIG_FS_VERITY
+	sb->s_vop = &btrfs_verityops;
+#endif
 	sb->s_xattr = btrfs_xattr_handlers;
 	sb->s_time_gran = 1;
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 436ac7b4b334..331ea4febcb1 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -267,6 +267,9 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
 #ifdef CONFIG_BTRFS_DEBUG
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
+#ifdef CONFIG_FS_VERITY
+BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
+#endif
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -284,6 +287,9 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_FEAT_ATTR_PTR(zoned),
+#endif
+#ifdef CONFIG_FS_VERITY
+	BTRFS_FEAT_ATTR_PTR(verity),
 #endif
 	NULL
 };
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
new file mode 100644
index 000000000000..feaf5908b3d3
--- /dev/null
+++ b/fs/btrfs/verity.c
@@ -0,0 +1,617 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Facebook.  All rights reserved.
+ */
+
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/rwsem.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/iversion.h>
+#include <linux/fsverity.h>
+#include <linux/sched/mm.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "locking.h"
+
+/*
+ * Just like ext4, we cache the merkle tree in pages after EOF in the page
+ * cache.  Unlike ext4, we're storing these in dedicated btree items and
+ * not just shoving them after EOF in the file.  This means we'll need to
+ * do extra work to encrypt them once encryption is supported in btrfs,
+ * but btrfs has a lot of careful code around i_size and it seems better
+ * to make a new key type than try and adjust all of our expectations
+ * for i_size.
+ *
+ * fs verity items are stored under two different key types on disk.
+ *
+ * The descriptor items:
+ * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]
+ *
+ * At offset 0, we store a btrfs_verity_descriptor_item which tracks the
+ * size of the descriptor item and some extra data for encryption.
+ * Starting at offset 1, these hold the generic fs verity descriptor.
+ * These are opaque to btrfs, we just read and write them as a blob for
+ * the higher level verity code.  The most common size for this is 256 bytes.
+ *
+ * The merkle tree items:
+ * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
+ *
+ * These also start at offset 0, and correspond to the merkle tree bytes.
+ * So when fsverity asks for page 0 of the merkle tree, we pull up one page
+ * starting at offset 0 for this key type.  These are also opaque to btrfs,
+ * we're blindly storing whatever fsverity sends down.
+ */
+
+/*
+ * Compute the logical file offset where we cache the Merkle tree.
+ *
+ * @inode: the inode of the verity file
+ *
+ * For the purposes of caching the Merkle tree pages, as required by
+ * fs-verity, it is convenient to do size computations in terms of a file
+ * offset, rather than in terms of page indices.
+ *
+ * Returns the file offset on success, negative error code on failure.
+ */
+static loff_t merkle_file_pos(const struct inode *inode)
+{
+	u64 sz = inode->i_size;
+	u64 ret = round_up(sz, 65536);
+
+	if (ret > inode->i_sb->s_maxbytes)
+		return -EFBIG;
+	return ret;
+}
+
+/*
+ * Drop all the items for this inode with this key_type.
+ * @inode: The inode to drop items for
+ * @key_type: The type of items to drop (VERITY_DESC_ITEM or
+ *            VERITY_MERKLE_ITEM)
+ *
+ * Before doing a verity enable we cleanup any existing verity items.
+ *
+ * This is also used to clean up if a verity enable failed half way
+ * through.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (1) {
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			goto out;
+		}
+
+		/*
+		 * walk backwards through all the items until we find one
+		 * that isn't from our key type or objectid
+		 */
+		key.objectid = btrfs_ino(inode);
+		key.offset = (u64)-1;
+		key.type = key_type;
+
+		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+		if (ret > 0) {
+			ret = 0;
+			/* no more keys of this type, we're done */
+			if (path->slots[0] == 0)
+				break;
+			path->slots[0]--;
+		} else if (ret < 0) {
+			break;
+		}
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		/* no more keys of this type, we're done */
+		if (key.objectid != btrfs_ino(inode) || key.type != key_type)
+			break;
+
+		/*
+		 * this shouldn't be a performance sensitive function because
+		 * it's not used as part of truncate.  If it ever becomes
+		 * perf sensitive, change this to walk forward and bulk delete
+		 * items
+		 */
+		ret = btrfs_del_items(trans, root, path,
+				      path->slots[0], 1);
+		btrfs_release_path(path);
+		btrfs_end_transaction(trans);
+
+		if (ret)
+			goto out;
+	}
+
+	btrfs_end_transaction(trans);
+out:
+	btrfs_free_path(path);
+	return ret;
+
+}
+
+/*
+ * Insert and write inode items with a given key type and offset.
+ * @inode: The inode to insert for.
+ * @key_type: The key type to insert.
+ * @offset: The item offset to insert at.
+ * @src: Source data to write.
+ * @len: Length of source data to write.
+ *
+ * Write len bytes from src into items of up to 1k length.
+ * The inserted items will have key <ino, key_type, offset + off> where
+ * off is consecutively increasing from 0 up to the last item ending at
+ * offset + len.
+ *
+ * Returns 0 on success and a negative error code on failure.
+ */
+static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
+			   const char *src, u64 len)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_path *path;
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 copied = 0;
+	unsigned long copy_bytes;
+	unsigned long src_offset = 0;
+	void *data;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (len > 0) {
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			break;
+		}
+
+		key.objectid = btrfs_ino(inode);
+		key.offset = offset;
+		key.type = key_type;
+
+		/*
+		 * insert 1K at a time mostly to be friendly for smaller
+		 * leaf size filesystems
+		 */
+		copy_bytes = min_t(u64, len, 1024);
+
+		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
+		if (ret) {
+			btrfs_end_transaction(trans);
+			break;
+		}
+
+		leaf = path->nodes[0];
+
+		data = btrfs_item_ptr(leaf, path->slots[0], void);
+		write_extent_buffer(leaf, src + src_offset,
+				    (unsigned long)data, copy_bytes);
+		offset += copy_bytes;
+		src_offset += copy_bytes;
+		len -= copy_bytes;
+		copied += copy_bytes;
+
+		btrfs_release_path(path);
+		btrfs_end_transaction(trans);
+	}
+
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * Read inode items of the given key type and offset from the btree.
+ * @inode: The inode to read items of.
+ * @key_type: The key type to read.
+ * @offset: The item offset to read from.
+ * @dest: The buffer to read into. This parameter has slightly tricky
+ *        semantics.  If it is NULL, the function will not do any copying
+ *        and will just return the size of all the items up to len bytes.
+ *        If dest_page is passed, then the function will kmap_atomic the
+ *        page and ignore dest, but it must still be non-NULL to avoid the
+ *        counting-only behavior.
+ * @len: Length in bytes to read.
+ * @dest_page: Copy into this page instead of the dest buffer.
+ *
+ * Helper function to read items from the btree.  This returns the number
+ * of bytes read or < 0 for errors.  We can return short reads if the
+ * items don't exist on disk or aren't big enough to fill the desired length.
+ *
+ * Supports reading into a provided buffer (dest) or into the page cache
+ *
+ * Returns number of bytes read or a negative error code on failure.
+ */
+static ssize_t read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
+			  char *dest, u64 len, struct page *dest_page)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 item_end;
+	u64 copy_end;
+	u64 copied = 0;
+	u32 copy_offset;
+	unsigned long copy_bytes;
+	unsigned long dest_offset = 0;
+	void *data;
+	char *kaddr = dest;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	if (dest_page)
+		path->reada = READA_FORWARD;
+
+	key.objectid = btrfs_ino(inode);
+	key.offset = offset;
+	key.type = key_type;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0) {
+		goto out;
+	} else if (ret > 0) {
+		ret = 0;
+		if (path->slots[0] == 0)
+			goto out;
+		path->slots[0]--;
+	}
+
+	while (len > 0) {
+		leaf = path->nodes[0];
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		if (key.objectid != btrfs_ino(inode) ||
+		    key.type != key_type)
+			break;
+
+		item_end = btrfs_item_size_nr(leaf, path->slots[0]) + key.offset;
+
+		if (copied > 0) {
+			/*
+			 * once we've copied something, we want all of the items
+			 * to be sequential
+			 */
+			if (key.offset != offset)
+				break;
+		} else {
+			/*
+			 * our initial offset might be in the middle of an
+			 * item.  Make sure it all makes sense
+			 */
+			if (key.offset > offset)
+				break;
+			if (item_end <= offset)
+				break;
+		}
+
+		/* desc = NULL to just sum all the item lengths */
+		if (!dest)
+			copy_end = item_end;
+		else
+			copy_end = min(offset + len, item_end);
+
+		/* number of bytes in this item we want to copy */
+		copy_bytes = copy_end - offset;
+
+		/* offset from the start of item for copying */
+		copy_offset = offset - key.offset;
+
+		if (dest) {
+			if (dest_page)
+				kaddr = kmap_atomic(dest_page);
+
+			data = btrfs_item_ptr(leaf, path->slots[0], void);
+			read_extent_buffer(leaf, kaddr + dest_offset,
+					   (unsigned long)data + copy_offset,
+					   copy_bytes);
+
+			if (dest_page)
+				kunmap_atomic(kaddr);
+		}
+
+		offset += copy_bytes;
+		dest_offset += copy_bytes;
+		len -= copy_bytes;
+		copied += copy_bytes;
+
+		path->slots[0]++;
+		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+			/*
+			 * we've reached the last slot in this leaf and we need
+			 * to go to the next leaf.
+			 */
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0) {
+				break;
+			} else if (ret > 0) {
+				ret = 0;
+				break;
+			}
+		}
+	}
+out:
+	btrfs_free_path(path);
+	if (!ret)
+		ret = copied;
+	return ret;
+}
+
+/*
+ * Drop verity items from the btree and from the page cache
+ *
+ * @inode: the inode to drop items for
+ *
+ * If we fail partway through enabling verity, enable verity and have some
+ * partial data extant, or cleanup orphaned verity data, we need to truncate it
+ * from the cache and delete the items themselves from the btree.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+int btrfs_drop_verity_items(struct btrfs_inode *inode)
+{
+	int ret;
+	struct inode *ino = &inode->vfs_inode;
+
+	truncate_inode_pages(ino->i_mapping, ino->i_size);
+	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
+	if (ret)
+		return ret;
+	return drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
+}
+
+/*
+ * fsverity op that begins enabling verity.
+ * fsverity calls this to ask us to setup the inode for enabling.  We
+ * drop any existing verity items and set the in progress bit.
+ */
+static int btrfs_begin_enable_verity(struct file *filp)
+{
+	struct inode *inode = file_inode(filp);
+	int ret;
+
+	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags))
+		return -EBUSY;
+
+	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
+	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
+	if (ret)
+		goto err;
+
+	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
+	return ret;
+
+}
+
+/*
+ * fsverity op that ends enabling verity.
+ * fsverity calls this when it's done with all of the pages in the file
+ * and all of the merkle items have been inserted.  We write the
+ * descriptor and update the inode in the btree to reflect its new life
+ * as a verity file.
+ */
+static int btrfs_end_enable_verity(struct file *filp, const void *desc,
+				  size_t desc_size, u64 merkle_tree_size)
+{
+	struct btrfs_trans_handle *trans;
+	struct inode *inode = file_inode(filp);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_verity_descriptor_item item;
+	int ret;
+
+	if (desc != NULL) {
+		/* write out the descriptor item */
+		memset(&item, 0, sizeof(item));
+		btrfs_set_stack_verity_descriptor_size(&item, desc_size);
+		ret = write_key_bytes(BTRFS_I(inode),
+				      BTRFS_VERITY_DESC_ITEM_KEY, 0,
+				      (const char *)&item, sizeof(item));
+		if (ret)
+			goto out;
+		/* write out the descriptor itself */
+		ret = write_key_bytes(BTRFS_I(inode),
+				      BTRFS_VERITY_DESC_ITEM_KEY, 1,
+				      desc, desc_size);
+		if (ret)
+			goto out;
+
+		/* update our inode flags to include fs verity */
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			goto out;
+		}
+		BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
+		btrfs_sync_inode_flags_to_i_flags(inode);
+		ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
+		btrfs_end_transaction(trans);
+	}
+
+out:
+	if (desc == NULL || ret) {
+		/* If we failed, drop all the verity items */
+		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
+		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+	} else
+		btrfs_set_fs_compat_ro(root->fs_info, VERITY);
+	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
+	return ret;
+}
+
+/*
+ * fsverity op that gets the struct fsverity_descriptor.
+ * fsverity does a two pass setup for reading the descriptor, in the first pass
+ * it calls with buf_size = 0 to query the size of the descriptor,
+ * and then in the second pass it actually reads the descriptor off
+ * disk.
+ */
+static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
+				       size_t buf_size)
+{
+	u64 true_size;
+	ssize_t ret = 0;
+	struct btrfs_verity_descriptor_item item;
+
+	memset(&item, 0, sizeof(item));
+	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
+			     0, (char *)&item, sizeof(item), NULL);
+	if (ret < 0)
+		return ret;
+
+	if (item.reserved[0] != 0 || item.reserved[1] != 0)
+		return -EUCLEAN;
+
+	true_size = btrfs_stack_verity_descriptor_size(&item);
+	if (true_size > INT_MAX)
+		return -EUCLEAN;
+
+	if (!buf_size)
+		return true_size;
+	if (buf_size < true_size)
+		return -ERANGE;
+
+	ret = read_key_bytes(BTRFS_I(inode),
+			     BTRFS_VERITY_DESC_ITEM_KEY, 1,
+			     buf, buf_size, NULL);
+	if (ret < 0)
+		return ret;
+	if (ret != true_size)
+		return -EIO;
+
+	return true_size;
+}
+
+/*
+ * fsverity op that reads and caches a merkle tree page.  These are stored
+ * in the btree, but we cache them in the inode's address space after EOF.
+ */
+static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
+					       pgoff_t index,
+					       unsigned long num_ra_pages)
+{
+	struct page *p;
+	u64 off = index << PAGE_SHIFT;
+	loff_t merkle_pos = merkle_file_pos(inode);
+	ssize_t ret;
+	int err;
+
+	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
+		return ERR_PTR(-EFBIG);
+	index += merkle_pos >> PAGE_SHIFT;
+again:
+	p = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED);
+	if (p) {
+		if (PageUptodate(p))
+			return p;
+
+		lock_page(p);
+		/*
+		 * we only insert uptodate pages, so !Uptodate has to be
+		 * an error
+		 */
+		if (!PageUptodate(p)) {
+			unlock_page(p);
+			put_page(p);
+			return ERR_PTR(-EIO);
+		}
+		unlock_page(p);
+		return p;
+	}
+
+	p = page_cache_alloc(inode->i_mapping);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * merkle item keys are indexed from byte 0 in the merkle tree.
+	 * they have the form:
+	 *
+	 * [ inode objectid, BTRFS_MERKLE_ITEM_KEY, offset in bytes ]
+	 */
+	ret = read_key_bytes(BTRFS_I(inode),
+			     BTRFS_VERITY_MERKLE_ITEM_KEY, off,
+			     page_address(p), PAGE_SIZE, p);
+	if (ret < 0) {
+		put_page(p);
+		return ERR_PTR(ret);
+	}
+
+	/* zero fill any bytes we didn't write into the page */
+	if (ret < PAGE_SIZE) {
+		char *kaddr = kmap_atomic(p);
+
+		memset(kaddr + ret, 0, PAGE_SIZE - ret);
+		kunmap_atomic(kaddr);
+	}
+	SetPageUptodate(p);
+	err = add_to_page_cache_lru(p, inode->i_mapping, index,
+				    mapping_gfp_mask(inode->i_mapping));
+
+	if (!err) {
+		/* inserted and ready for fsverity */
+		unlock_page(p);
+	} else {
+		put_page(p);
+		/* did someone race us into inserting this page? */
+		if (err == -EEXIST)
+			goto again;
+		p = ERR_PTR(err);
+	}
+	return p;
+}
+
+/*
+ * fsverity op that writes a merkle tree block into the btree in 1k chunks.
+ */
+static int btrfs_write_merkle_tree_block(struct inode *inode, const void *buf,
+					u64 index, int log_blocksize)
+{
+	u64 off = index << log_blocksize;
+	u64 len = 1 << log_blocksize;
+
+	if (merkle_file_pos(inode) > inode->i_sb->s_maxbytes - off - len)
+		return -EFBIG;
+
+	return write_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY,
+			       off, buf, len);
+}
+
+const struct fsverity_operations btrfs_verityops = {
+	.begin_enable_verity	= btrfs_begin_enable_verity,
+	.end_enable_verity	= btrfs_end_enable_verity,
+	.get_verity_descriptor	= btrfs_get_verity_descriptor,
+	.read_merkle_tree_page	= btrfs_read_merkle_tree_page,
+	.write_merkle_tree_block = btrfs_write_merkle_tree_block,
+};
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5df73001aad4..fa21c8aac78d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
  * first mount when booting older kernel versions.
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
+#define BTRFS_FEATURE_COMPAT_RO_VERITY		(1ULL << 2)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
@@ -308,7 +309,6 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
-
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
 	__u64 compat_ro_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ae25280316bd..2be57416f886 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -118,6 +118,14 @@
 #define BTRFS_INODE_REF_KEY		12
 #define BTRFS_INODE_EXTREF_KEY		13
 #define BTRFS_XATTR_ITEM_KEY		24
+
+/*
+ * fsverity has a descriptor per file, and then
+ * a number of sha or csum items indexed by offset in to the file.
+ */
+#define BTRFS_VERITY_DESC_ITEM_KEY	36
+#define BTRFS_VERITY_MERKLE_ITEM_KEY	37
+
 #define BTRFS_ORPHAN_ITEM_KEY		48
 /* reserve 2-15 close to the inode for later flexibility */
 
@@ -996,4 +1004,11 @@ struct btrfs_qgroup_limit_item {
 	__le64 rsv_excl;
 } __attribute__ ((__packed__));
 
+struct btrfs_verity_descriptor_item {
+	/* size of the verity descriptor in bytes */
+	__le64 size;
+	__le64 reserved[2];
+	__u8 encryption;
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 3/5] btrfs: check verity for reads of inline extents and holes
       [not found] <cover.1620241221.git.boris@bur.io>
  2021-05-05 19:20 ` [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
  2021-05-05 19:20 ` [PATCH v4 2/5] btrfs: initial fsverity support Boris Burkov
@ 2021-05-05 19:20 ` Boris Burkov
  2021-05-12 17:57   ` David Sterba
  2021-05-05 19:20 ` [PATCH v4 4/5] btrfs: fallback to buffered io for verity files Boris Burkov
  2021-05-05 19:20 ` [PATCH v4 5/5] btrfs: verity metadata orphan items Boris Burkov
  4 siblings, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2021-05-05 19:20 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

The majority of reads receive a verity check after the bio is complete
as the page is marked uptodate. However, there is a class of reads which
are handled with btrfs logic in readpage, rather than by submitting a
bio. Specifically, these are inline extents, preallocated extents, and
holes. Tweak readpage so that if it is going to mark such a page
uptodate, it first checks verity on it.

Now if a veritied file has corruption to this class of EXTENT_DATA
items, it will be detected at read time.

There is one annoying edge case that requires checking for start <
last_byte: if userspace reads to the end of a file with page aligned
size and then tries to keep reading (as cat does), the buffered read
code will try to read the page past the end of the file, and expects it
to be filled with 0s and marked uptodate. That bogus page is not part of
the data hashed by verity, so we have to ignore it.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/extent_io.c | 26 +++++++-------------------
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d1f57a4ad2fb..d1493a876915 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2202,18 +2202,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-/*
- * helper function to set a given page up to date if all the
- * extents in the tree for that page are up to date
- */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
-{
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_SIZE - 1;
-	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
-		SetPageUptodate(page);
-}
-
 int free_io_failure(struct extent_io_tree *failure_tree,
 		    struct extent_io_tree *io_tree,
 		    struct io_failure_record *rec)
@@ -3467,14 +3455,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
-			end_page_read(page, true, cur, iosize);
+			ret = end_page_read(page, true, cur, iosize);
 			break;
 		}
 		em = __get_extent_map(inode, page, pg_offset, cur,
 				      end - cur + 1, em_cached);
 		if (IS_ERR_OR_NULL(em)) {
 			unlock_extent(tree, cur, end);
-			end_page_read(page, false, cur, end + 1 - cur);
+			ret = end_page_read(page, false, cur, end + 1 - cur);
 			break;
 		}
 		extent_offset = cur - em->start;
@@ -3555,9 +3543,10 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 
 			set_extent_uptodate(tree, cur, cur + iosize - 1,
 					    &cached, GFP_NOFS);
+
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
-			end_page_read(page, true, cur, iosize);
+			ret = end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3565,9 +3554,8 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		/* the get_extent function already copied into the page */
 		if (test_range_bit(tree, cur, cur_end,
 				   EXTENT_UPTODATE, 1, NULL)) {
-			check_page_uptodate(tree, page);
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, true, cur, iosize);
+			ret = end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3577,7 +3565,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		 */
 		if (block_start == EXTENT_MAP_INLINE) {
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, false, cur, iosize);
+			ret = end_page_read(page, false, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3595,7 +3583,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 			*bio_flags = this_bio_flag;
 		} else {
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, false, cur, iosize);
+			ret = end_page_read(page, false, cur, iosize);
 			goto out;
 		}
 		cur = cur + iosize;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 4/5] btrfs: fallback to buffered io for verity files
       [not found] <cover.1620241221.git.boris@bur.io>
                   ` (2 preceding siblings ...)
  2021-05-05 19:20 ` [PATCH v4 3/5] btrfs: check verity for reads of inline extents and holes Boris Burkov
@ 2021-05-05 19:20 ` Boris Burkov
  2021-05-05 19:20 ` [PATCH v4 5/5] btrfs: verity metadata orphan items Boris Burkov
  4 siblings, 0 replies; 26+ messages in thread
From: Boris Burkov @ 2021-05-05 19:20 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

Reading the contents with direct IO would circumvent verity checks, so
fallback to buffered reads. For what it's worth, this is how ext4
handles it as well.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/file.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a99470303bd9..34bc22fa6b1f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3628,6 +3628,9 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if (fsverity_active(inode))
+		return 0;
+
 	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
 		return 0;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 5/5] btrfs: verity metadata orphan items
       [not found] <cover.1620241221.git.boris@bur.io>
                   ` (3 preceding siblings ...)
  2021-05-05 19:20 ` [PATCH v4 4/5] btrfs: fallback to buffered io for verity files Boris Burkov
@ 2021-05-05 19:20 ` Boris Burkov
  2021-05-12 17:48   ` David Sterba
  4 siblings, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2021-05-05 19:20 UTC (permalink / raw)
  To: linux-btrfs, linux-fscrypt, kernel-team

If we don't finish creating fsverity metadata for a file, or fail to
clean up already created metadata after a failure, we could leak the
verity items.

To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.

There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.

A final thing to note is that this is a resurrection of using orphans to
signal orphaned metadata that isn't the inode itself. This makes the
comment discussing deprecating that concept a bit messy in full context.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c  | 15 +++++++--
 fs/btrfs/verity.c | 79 ++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1b1101369777..67eba8db4b65 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3419,7 +3419,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 
 		/*
 		 * If we have an inode with links, there are a couple of
-		 * possibilities. Old kernels (before v3.12) used to create an
+		 * possibilities:
+		 *
+		 * 1. Old kernels (before v3.12) used to create an
 		 * orphan item for truncate indicating that there were possibly
 		 * extent items past i_size that needed to be deleted. In v3.12,
 		 * truncate was changed to update i_size in sync with the extent
@@ -3432,13 +3434,22 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * slim, and it's a pain to do the truncate now, so just delete
 		 * the orphan item.
 		 *
+		 * 2. We were halfway through creating fsverity metadata for the
+		 * file. In that case, the orphan item represents incomplete
+		 * fsverity metadata which must be cleaned up with
+		 * btrfs_drop_verity_items.
+		 *
 		 * It's also possible that this orphan item was supposed to be
 		 * deleted but wasn't. The inode number may have been reused,
 		 * but either way, we can delete the orphan item.
 		 */
 		if (ret == -ENOENT || inode->i_nlink) {
-			if (!ret)
+			if (!ret) {
+				ret = btrfs_drop_verity_items(BTRFS_I(inode));
 				iput(inode);
+				if (ret)
+					goto out;
+			}
 			trans = btrfs_start_transaction(root, 1);
 			if (IS_ERR(trans)) {
 				ret = PTR_ERR(trans);
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index feaf5908b3d3..3a115cdca018 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -362,6 +362,64 @@ static ssize_t read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset
 	return ret;
 }
 
+/*
+ * Helper to manage the transaction for adding an orphan item.
+ */
+static int add_orphan(struct btrfs_inode *inode)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	int ret = 0;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+	ret = btrfs_orphan_add(trans, inode);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
+	btrfs_end_transaction(trans);
+
+out:
+	return ret;
+}
+
+/*
+ * Helper to manage the transaction for deleting an orphan item.
+ */
+static int del_orphan(struct btrfs_inode *inode)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	int ret;
+
+	/*
+	 * If the inode has no links, it is either already unlinked, or was
+	 * created with O_TMPFILE. In either case, it should have an orphan from
+	 * that other operation. Rather than reference count the orphans, we
+	 * simply ignore them here, because we only invoke the verity path in
+	 * the orphan logic when i_nlink is 0.
+	 */
+	if (!inode->vfs_inode.i_nlink)
+		return 0;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
+
+	btrfs_end_transaction(trans);
+	return ret;
+}
+
 /*
  * Drop verity items from the btree and from the page cache
  *
@@ -399,11 +457,12 @@ static int btrfs_begin_enable_verity(struct file *filp)
 		return -EBUSY;
 
 	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
-	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
+
+	ret = btrfs_drop_verity_items(BTRFS_I(inode));
 	if (ret)
 		goto err;
 
-	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+	ret = add_orphan(BTRFS_I(inode));
 	if (ret)
 		goto err;
 
@@ -430,6 +489,7 @@ static int btrfs_end_enable_verity(struct file *filp, const void *desc,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_verity_descriptor_item item;
 	int ret;
+	int keep_orphan = 0;
 
 	if (desc != NULL) {
 		/* write out the descriptor item */
@@ -461,11 +521,20 @@ static int btrfs_end_enable_verity(struct file *filp, const void *desc,
 
 out:
 	if (desc == NULL || ret) {
-		/* If we failed, drop all the verity items */
-		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
-		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+		/*
+		 * If verity failed (here or in the generic code), drop all the
+		 * verity items.
+		 */
+		keep_orphan = btrfs_drop_verity_items(BTRFS_I(inode));
 	} else
 		btrfs_set_fs_compat_ro(root->fs_info, VERITY);
+	/*
+	 * If we are handling an error, but failed to drop the verity items,
+	 * we still need the orphan.
+	 */
+	if (!keep_orphan)
+		del_orphan(BTRFS_I(inode));
+
 	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
 	return ret;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-05 19:20 ` [PATCH v4 2/5] btrfs: initial fsverity support Boris Burkov
@ 2021-05-06  0:09     ` kernel test robot
  2021-05-11 19:20   ` David Sterba
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2021-05-06  0:09 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, linux-fscrypt, kernel-team
  Cc: kbuild-all, clang-built-linux

[-- Attachment #1: Type: text/plain, Size: 4547 bytes --]

Hi Boris,

I love your patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[cannot apply to v5.12 next-20210505]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-add-compat_flags-to-btrfs_inode_item/20210506-042129
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-a014-20210505 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 8f5a2a5836cc8e4c1def2bdeb022e7b496623439)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/f61feb554b6d2710f17960a9775bf9ba41bb2dc2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Boris-Burkov/btrfs-add-compat_flags-to-btrfs_inode_item/20210506-042129
        git checkout f61feb554b6d2710f17960a9775bf9ba41bb2dc2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/verity.c:434:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (desc != NULL) {
               ^~~~~~~~~~~~
   fs/btrfs/verity.c:470:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   fs/btrfs/verity.c:434:2: note: remove the 'if' if its condition is always true
           if (desc != NULL) {
           ^~~~~~~~~~~~~~~~~~
   fs/btrfs/verity.c:432:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 warning generated.


vim +434 fs/btrfs/verity.c

   417	
   418	/*
   419	 * fsverity op that ends enabling verity.
   420	 * fsverity calls this when it's done with all of the pages in the file
   421	 * and all of the merkle items have been inserted.  We write the
   422	 * descriptor and update the inode in the btree to reflect its new life
   423	 * as a verity file.
   424	 */
   425	static int btrfs_end_enable_verity(struct file *filp, const void *desc,
   426					  size_t desc_size, u64 merkle_tree_size)
   427	{
   428		struct btrfs_trans_handle *trans;
   429		struct inode *inode = file_inode(filp);
   430		struct btrfs_root *root = BTRFS_I(inode)->root;
   431		struct btrfs_verity_descriptor_item item;
   432		int ret;
   433	
 > 434		if (desc != NULL) {
   435			/* write out the descriptor item */
   436			memset(&item, 0, sizeof(item));
   437			btrfs_set_stack_verity_descriptor_size(&item, desc_size);
   438			ret = write_key_bytes(BTRFS_I(inode),
   439					      BTRFS_VERITY_DESC_ITEM_KEY, 0,
   440					      (const char *)&item, sizeof(item));
   441			if (ret)
   442				goto out;
   443			/* write out the descriptor itself */
   444			ret = write_key_bytes(BTRFS_I(inode),
   445					      BTRFS_VERITY_DESC_ITEM_KEY, 1,
   446					      desc, desc_size);
   447			if (ret)
   448				goto out;
   449	
   450			/* update our inode flags to include fs verity */
   451			trans = btrfs_start_transaction(root, 1);
   452			if (IS_ERR(trans)) {
   453				ret = PTR_ERR(trans);
   454				goto out;
   455			}
   456			BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
   457			btrfs_sync_inode_flags_to_i_flags(inode);
   458			ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
   459			btrfs_end_transaction(trans);
   460		}
   461	
   462	out:
   463		if (desc == NULL || ret) {
   464			/* If we failed, drop all the verity items */
   465			drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
   466			drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
   467		} else
   468			btrfs_set_fs_compat_ro(root->fs_info, VERITY);
   469		clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
   470		return ret;
   471	}
   472	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35932 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
@ 2021-05-06  0:09     ` kernel test robot
  0 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2021-05-06  0:09 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4658 bytes --]

Hi Boris,

I love your patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[cannot apply to v5.12 next-20210505]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-add-compat_flags-to-btrfs_inode_item/20210506-042129
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-a014-20210505 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 8f5a2a5836cc8e4c1def2bdeb022e7b496623439)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/f61feb554b6d2710f17960a9775bf9ba41bb2dc2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Boris-Burkov/btrfs-add-compat_flags-to-btrfs_inode_item/20210506-042129
        git checkout f61feb554b6d2710f17960a9775bf9ba41bb2dc2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/verity.c:434:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (desc != NULL) {
               ^~~~~~~~~~~~
   fs/btrfs/verity.c:470:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   fs/btrfs/verity.c:434:2: note: remove the 'if' if its condition is always true
           if (desc != NULL) {
           ^~~~~~~~~~~~~~~~~~
   fs/btrfs/verity.c:432:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 warning generated.


vim +434 fs/btrfs/verity.c

   417	
   418	/*
   419	 * fsverity op that ends enabling verity.
   420	 * fsverity calls this when it's done with all of the pages in the file
   421	 * and all of the merkle items have been inserted.  We write the
   422	 * descriptor and update the inode in the btree to reflect its new life
   423	 * as a verity file.
   424	 */
   425	static int btrfs_end_enable_verity(struct file *filp, const void *desc,
   426					  size_t desc_size, u64 merkle_tree_size)
   427	{
   428		struct btrfs_trans_handle *trans;
   429		struct inode *inode = file_inode(filp);
   430		struct btrfs_root *root = BTRFS_I(inode)->root;
   431		struct btrfs_verity_descriptor_item item;
   432		int ret;
   433	
 > 434		if (desc != NULL) {
   435			/* write out the descriptor item */
   436			memset(&item, 0, sizeof(item));
   437			btrfs_set_stack_verity_descriptor_size(&item, desc_size);
   438			ret = write_key_bytes(BTRFS_I(inode),
   439					      BTRFS_VERITY_DESC_ITEM_KEY, 0,
   440					      (const char *)&item, sizeof(item));
   441			if (ret)
   442				goto out;
   443			/* write out the descriptor itself */
   444			ret = write_key_bytes(BTRFS_I(inode),
   445					      BTRFS_VERITY_DESC_ITEM_KEY, 1,
   446					      desc, desc_size);
   447			if (ret)
   448				goto out;
   449	
   450			/* update our inode flags to include fs verity */
   451			trans = btrfs_start_transaction(root, 1);
   452			if (IS_ERR(trans)) {
   453				ret = PTR_ERR(trans);
   454				goto out;
   455			}
   456			BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
   457			btrfs_sync_inode_flags_to_i_flags(inode);
   458			ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
   459			btrfs_end_transaction(trans);
   460		}
   461	
   462	out:
   463		if (desc == NULL || ret) {
   464			/* If we failed, drop all the verity items */
   465			drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
   466			drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
   467		} else
   468			btrfs_set_fs_compat_ro(root->fs_info, VERITY);
   469		clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
   470		return ret;
   471	}
   472	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 35932 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-05-05 19:20 ` [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
@ 2021-05-11 19:11   ` David Sterba
  2021-05-17 21:48     ` David Sterba
  2021-05-25 18:12   ` Eric Biggers
  1 sibling, 1 reply; 26+ messages in thread
From: David Sterba @ 2021-05-11 19:11 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 05, 2021 at 12:20:39PM -0700, Boris Burkov wrote:
> The tree checker currently rejects unrecognized flags when it reads
> btrfs_inode_item. Practically, this means that adding a new flag makes
> the change backwards incompatible if the flag is ever set on a file.

Is there any other known problem when the verity flag is set? The tree
checker is naturally the first instance where it gets noticed and I
haven't found any other place as the flag would be just another one.

Why am I asking: allocating 8 bytes for incompat bits where we know
there will be likely just one used is wasteful. I'm exploring
possibilities if the incompat flags can be squeezed to existing flags.
In the end the size can be reduced to u16, u64 is really too much.

> Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a
> new "compat_flags". These flags are zero on inode creation in btrfs and
> mkfs and are ignored by an older kernel, so it should be safe to use
> them in this way.

Yeah this should be safe.

> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/btrfs_inode.h          | 1 +
>  fs/btrfs/ctree.h                | 2 ++
>  fs/btrfs/delayed-inode.c        | 2 ++
>  fs/btrfs/inode.c                | 3 +++
>  fs/btrfs/ioctl.c                | 7 ++++---
>  fs/btrfs/tree-log.c             | 1 +
>  include/uapi/linux/btrfs_tree.h | 7 ++++++-
>  7 files changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index c652e19ad74e..e8dbc8e848ce 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -191,6 +191,7 @@ struct btrfs_inode {
>  
>  	/* flags field from the on disk inode */
>  	u32 flags;
> +	u64 compat_flags;

This got me curious, u32 flags is for the in-memory inode, but the
on-disk inode_item::flags is u64

>  BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> +BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);

>  	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);

Which means we currently use only 32 bits and half of the on-disk
inode_item::flags is always zero. So the idea is to repurpose this for
the incompat bits (say upper 16 bits). With a minimal patch to tree
checker we can make old kernels accept a verity-enabled kernel.

It could be tricky, but for backport only additional bitmask would be
added to BTRFS_INODE_FLAG_MASK to ignore bits 48-63.

For proper support the inode_item::flags can be simply used as one space
where the split would be just logical, and IMO manageable.

> +	btrfs_set_stack_inode_compat_flags(inode_item, BTRFS_I(inode)->compat_flags);
>  	btrfs_set_stack_inode_block_group(inode_item, 0);
>  
>  	btrfs_set_stack_timespec_sec(&inode_item->atime,
> @@ -1776,6 +1777,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
>  	inode->i_rdev = 0;
>  	*rdev = btrfs_stack_inode_rdev(inode_item);
>  	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);

As another example, the stack inode flags get trimmed from u64 to u32,
so old kernels won't notice.

> +	BTRFS_I(inode)->compat_flags = btrfs_stack_inode_compat_flags(inode_item);
>  
>  	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
>  	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 69fcdf8f0b1c..d89000577f7f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3627,6 +3627,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
>  
>  	BTRFS_I(inode)->index_cnt = (u64)-1;
>  	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
> +	BTRFS_I(inode)->compat_flags = btrfs_inode_compat_flags(leaf, inode_item);
>  
>  cache_index:
>  	/*
> @@ -3793,6 +3794,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
>  	btrfs_set_token_inode_transid(&token, item, trans->transid);
>  	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
>  	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
> +	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
>  	btrfs_set_token_inode_block_group(&token, item, 0);
>  }
>  
> @@ -8857,6 +8859,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>  	ei->defrag_bytes = 0;
>  	ei->disk_i_size = 0;
>  	ei->flags = 0;
> +	ei->compat_flags = 0;
>  	ei->csum_bytes = 0;
>  	ei->index_cnt = (u64)-1;
>  	ei->dir_index = 0;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0ba0e4ddaf6b..ff335c192170 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -102,8 +102,9 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
>   * Export internal inode flags to the format expected by the FS_IOC_GETFLAGS
>   * ioctl.
>   */
> -static unsigned int btrfs_inode_flags_to_fsflags(unsigned int flags)
> +static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
>  {
> +	unsigned int flags = binode->flags;

So things like the above must be careful and store the variables to
properly sized integers.

>  	unsigned int iflags = 0;
>  
>  	if (flags & BTRFS_INODE_SYNC)
> @@ -156,7 +157,7 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
>  static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
>  {
>  	struct btrfs_inode *binode = BTRFS_I(file_inode(file));
> -	unsigned int flags = btrfs_inode_flags_to_fsflags(binode->flags);
> +	unsigned int flags = btrfs_inode_flags_to_fsflags(binode);

This now does not apply to 5.13-rc1 as there was a patchset converting
all the file attributes to a common API and this hunk now does not apply
as the btrfs_ioctl_getflags is handled by fileattr_fill_flags in
btrfs_fileattr_get.

The fix seem to be simple as it's using the same helpers but I did not
get far enough to resolve the conflict compeletely, so please rebase and
resend.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-05 19:20 ` [PATCH v4 2/5] btrfs: initial fsverity support Boris Burkov
  2021-05-06  0:09     ` kernel test robot
@ 2021-05-11 19:20   ` David Sterba
  2021-05-11 20:31   ` David Sterba
  2021-05-12 17:34   ` David Sterba
  3 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2021-05-11 19:20 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 05, 2021 at 12:20:40PM -0700, Boris Burkov wrote:

> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d89000577f7f..1b1101369777 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9041,6 +9045,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
>  	struct inode *inode = d_inode(path->dentry);
>  	u32 blocksize = inode->i_sb->s_blocksize;
>  	u32 bi_flags = BTRFS_I(inode)->flags;
> +	u32 bi_compat_flags = BTRFS_I(inode)->compat_flags;

This is u64 -> u32, not a problem at the moment but the type width
should match.

>  
>  	stat->result_mask |= STATX_BTIME;
>  	stat->btime.tv_sec = BTRFS_I(inode)->i_otime.tv_sec;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index ff335c192170..4b8f38fe4226 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -105,6 +106,7 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
>  static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
>  {
>  	unsigned int flags = binode->flags;
> +	unsigned int compat_flags = binode->compat_flags;

And same here.

>  	unsigned int iflags = 0;
>  
>  	if (flags & BTRFS_INODE_SYNC)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-05 19:20 ` [PATCH v4 2/5] btrfs: initial fsverity support Boris Burkov
  2021-05-06  0:09     ` kernel test robot
  2021-05-11 19:20   ` David Sterba
@ 2021-05-11 20:31   ` David Sterba
  2021-05-11 21:52     ` Boris Burkov
  2021-05-13 19:19     ` Boris Burkov
  2021-05-12 17:34   ` David Sterba
  3 siblings, 2 replies; 26+ messages in thread
From: David Sterba @ 2021-05-11 20:31 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 05, 2021 at 12:20:40PM -0700, Boris Burkov wrote:
> From: Chris Mason <clm@fb.com>
> 
> Add support for fsverity in btrfs. To support the generic interface in
> fs/verity, we add two new item types in the fs tree for inodes with
> verity enabled. One stores the per-file verity descriptor and the other
> stores the Merkle tree data itself.
> 
> Verity checking is done at the end of IOs to ensure each page is checked
> before it is marked uptodate.
> 
> Verity relies on PageChecked for the Merkle tree data itself to avoid
> re-walking up shared paths in the tree. For this reason, we need to
> cache the Merkle tree data.

What's the estimated size of the Merkle tree data? Does the whole tree
need to be kept cached or is it only for data that are in page cache?

> Since the file is immutable after verity is
> turned on, we can cache it at an index past EOF.
> 
> Use the new inode compat_flags to store verity on the inode item, so
> that we can enable verity on a file, then rollback to an older kernel
> and still mount the file system and read the file. Since we can't safely
> write the file anymore without ruining the invariants of the Merkle
> tree, we mark a ro_compat flag on the file system when a file has verity
> enabled.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/Makefile               |   1 +
>  fs/btrfs/btrfs_inode.h          |   1 +
>  fs/btrfs/ctree.h                |  30 +-
>  fs/btrfs/extent_io.c            |  27 +-
>  fs/btrfs/file.c                 |   6 +
>  fs/btrfs/inode.c                |   7 +
>  fs/btrfs/ioctl.c                |  14 +-
>  fs/btrfs/super.c                |   3 +
>  fs/btrfs/sysfs.c                |   6 +
>  fs/btrfs/verity.c               | 617 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/btrfs.h      |   2 +-
>  include/uapi/linux/btrfs_tree.h |  15 +
>  12 files changed, 718 insertions(+), 11 deletions(-)
>  create mode 100644 fs/btrfs/verity.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index cec88a66bd6c..3dcf9bcc2326 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -36,6 +36,7 @@ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
>  btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
> +btrfs-$(CONFIG_FS_VERITY) += verity.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
>  	tests/extent-buffer-tests.o tests/btrfs-tests.o \
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index e8dbc8e848ce..4536548b9e79 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -51,6 +51,7 @@ enum {
>  	 * the file range, inode's io_tree).
>  	 */
>  	BTRFS_INODE_NO_DELALLOC_FLUSH,
> +	BTRFS_INODE_VERITY_IN_PROGRESS,

Please add a comment

>  };
>  
>  /* in memory btrfs inode */
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0546273a520b..c5aab6a639ef 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -279,9 +279,10 @@ struct btrfs_super_block {
>  #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
>  #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
>  
> -#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
> -	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
> -	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
> +#define BTRFS_FEATURE_COMPAT_RO_SUPP				\
> +	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |		\
> +	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID |	\
> +	 BTRFS_FEATURE_COMPAT_RO_VERITY)
>  
>  #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
>  #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
> @@ -1505,6 +1506,11 @@ do {                                                                   \
>  	 BTRFS_INODE_COMPRESS |						\
>  	 BTRFS_INODE_ROOT_ITEM_INIT)
>  
> +/*
> + * Inode compat flags
> + */
> +#define BTRFS_INODE_VERITY		(1 << 0)
> +
>  struct btrfs_map_token {
>  	struct extent_buffer *eb;
>  	char *kaddr;
> @@ -3766,6 +3772,24 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
>  	return signal_pending(current);
>  }
>  
> +/* verity.c */
> +#ifdef CONFIG_FS_VERITY
> +extern const struct fsverity_operations btrfs_verityops;
> +int btrfs_drop_verity_items(struct btrfs_inode *inode);
> +BTRFS_SETGET_FUNCS(verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
> +		   encryption, 8);
> +BTRFS_SETGET_FUNCS(verity_descriptor_size, struct btrfs_verity_descriptor_item, size, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
> +			 encryption, 8);
> +BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size, struct btrfs_verity_descriptor_item,
> +			 size, 64);
> +#else
> +static inline int btrfs_drop_verity_items(struct btrfs_inode *inode)
> +{
> +	return 0;
> +}
> +#endif
> +
>  /* Sanity test specific functions */
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  void btrfs_test_destroy_inode(struct inode *inode);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4fb33cadc41a..d1f57a4ad2fb 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -13,6 +13,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/prefetch.h>
>  #include <linux/cleancache.h>
> +#include <linux/fsverity.h>
>  #include "misc.h"
>  #include "extent_io.h"
>  #include "extent-io-tree.h"
> @@ -2862,15 +2863,28 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
>  	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
>  }
>  
> -static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> +static int end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>  {
> -	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> +	int ret = 0;
> +	struct inode *inode = page->mapping->host;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  
>  	ASSERT(page_offset(page) <= start &&
>  		start + len <= page_offset(page) + PAGE_SIZE);
>  
>  	if (uptodate) {
> -		btrfs_page_set_uptodate(fs_info, page, start, len);
> +		/*
> +		 * buffered reads of a file with page alignment will issue a
> +		 * 0 length read for one page past the end of file, so we must
> +		 * explicitly skip checking verity on that page of zeros.
> +		 */
> +		if (!PageError(page) && !PageUptodate(page) &&
> +		    start < i_size_read(inode) &&
> +		    fsverity_active(inode) &&
> +		    !fsverity_verify_page(page))
> +			ret = -EIO;
> +		else
> +			btrfs_page_set_uptodate(fs_info, page, start, len);
>  	} else {
>  		btrfs_page_clear_uptodate(fs_info, page, start, len);
>  		btrfs_page_set_error(fs_info, page, start, len);
> @@ -2878,12 +2892,13 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>  
>  	if (fs_info->sectorsize == PAGE_SIZE)
>  		unlock_page(page);
> -	else if (is_data_inode(page->mapping->host))
> +	else if (is_data_inode(inode))
>  		/*
>  		 * For subpage data, unlock the page if we're the last reader.
>  		 * For subpage metadata, page lock is not utilized for read.
>  		 */
>  		btrfs_subpage_end_reader(fs_info, page, start, len);
> +	return ret;
>  }
>  
>  /*
> @@ -3059,7 +3074,9 @@ static void end_bio_extent_readpage(struct bio *bio)
>  		bio_offset += len;
>  
>  		/* Update page status and unlock */
> -		end_page_read(page, uptodate, start, len);
> +		ret = end_page_read(page, uptodate, start, len);
> +		if (ret)
> +			uptodate = 0;
>  		endio_readpage_release_extent(&processed, BTRFS_I(inode),
>  					      start, end, uptodate);
>  	}
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 3b10d98b4ebb..a99470303bd9 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -16,6 +16,7 @@
>  #include <linux/btrfs.h>
>  #include <linux/uio.h>
>  #include <linux/iversion.h>
> +#include <linux/fsverity.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -3593,7 +3594,12 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
>  
>  static int btrfs_file_open(struct inode *inode, struct file *filp)
>  {
> +	int ret;

Missing newline

>  	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
> +
> +	ret = fsverity_file_open(inode, filp);
> +	if (ret)
> +		return ret;
>  	return generic_file_open(inode, filp);
>  }
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d89000577f7f..1b1101369777 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -32,6 +32,7 @@
>  #include <linux/sched/mm.h>
>  #include <linux/iomap.h>
>  #include <asm/unaligned.h>
> +#include <linux/fsverity.h>
>  #include "misc.h"
>  #include "ctree.h"
>  #include "disk-io.h"
> @@ -5405,7 +5406,9 @@ void btrfs_evict_inode(struct inode *inode)
>  
>  	trace_btrfs_inode_evict(inode);
>  
> +

Extra newline

>  	if (!root) {
> +		fsverity_cleanup_inode(inode);
>  		clear_inode(inode);
>  		return;
>  	}
> @@ -5488,6 +5491,7 @@ void btrfs_evict_inode(struct inode *inode)
>  	 * to retry these periodically in the future.
>  	 */
>  	btrfs_remove_delayed_node(BTRFS_I(inode));
> +	fsverity_cleanup_inode(inode);
>  	clear_inode(inode);
>  }
>  
> @@ -9041,6 +9045,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
>  	struct inode *inode = d_inode(path->dentry);
>  	u32 blocksize = inode->i_sb->s_blocksize;
>  	u32 bi_flags = BTRFS_I(inode)->flags;
> +	u32 bi_compat_flags = BTRFS_I(inode)->compat_flags;
>  
>  	stat->result_mask |= STATX_BTIME;
>  	stat->btime.tv_sec = BTRFS_I(inode)->i_otime.tv_sec;
> @@ -9053,6 +9058,8 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
>  		stat->attributes |= STATX_ATTR_IMMUTABLE;
>  	if (bi_flags & BTRFS_INODE_NODUMP)
>  		stat->attributes |= STATX_ATTR_NODUMP;
> +	if (bi_compat_flags & BTRFS_INODE_VERITY)
> +		stat->attributes |= STATX_ATTR_VERITY;
>  
>  	stat->attributes_mask |= (STATX_ATTR_APPEND |
>  				  STATX_ATTR_COMPRESSED |
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index ff335c192170..4b8f38fe4226 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -26,6 +26,7 @@
>  #include <linux/btrfs.h>
>  #include <linux/uaccess.h>
>  #include <linux/iversion.h>
> +#include <linux/fsverity.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "export.h"
> @@ -105,6 +106,7 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
>  static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
>  {
>  	unsigned int flags = binode->flags;
> +	unsigned int compat_flags = binode->compat_flags;
>  	unsigned int iflags = 0;
>  
>  	if (flags & BTRFS_INODE_SYNC)
> @@ -121,6 +123,8 @@ static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
>  		iflags |= FS_DIRSYNC_FL;
>  	if (flags & BTRFS_INODE_NODATACOW)
>  		iflags |= FS_NOCOW_FL;
> +	if (compat_flags & BTRFS_INODE_VERITY)
> +		iflags |= FS_VERITY_FL;
>  
>  	if (flags & BTRFS_INODE_NOCOMPRESS)
>  		iflags |= FS_NOCOMP_FL;
> @@ -148,10 +152,12 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
>  		new_fl |= S_NOATIME;
>  	if (binode->flags & BTRFS_INODE_DIRSYNC)
>  		new_fl |= S_DIRSYNC;
> +	if (binode->compat_flags & BTRFS_INODE_VERITY)
> +		new_fl |= S_VERITY;
>  
>  	set_mask_bits(&inode->i_flags,
> -		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> -		      new_fl);
> +		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
> +		      S_VERITY, new_fl);
>  }
>  
>  static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
> @@ -5072,6 +5078,10 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_get_subvol_rootref(file, argp);
>  	case BTRFS_IOC_INO_LOOKUP_USER:
>  		return btrfs_ioctl_ino_lookup_user(file, argp);
> +	case FS_IOC_ENABLE_VERITY:
> +		return fsverity_ioctl_enable(file, (const void __user *)argp);
> +	case FS_IOC_MEASURE_VERITY:
> +		return fsverity_ioctl_measure(file, argp);
>  	}
>  
>  	return -ENOTTY;
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 4a396c1147f1..aa41ee30e3ca 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1365,6 +1365,9 @@ static int btrfs_fill_super(struct super_block *sb,
>  	sb->s_op = &btrfs_super_ops;
>  	sb->s_d_op = &btrfs_dentry_operations;
>  	sb->s_export_op = &btrfs_export_ops;
> +#ifdef CONFIG_FS_VERITY
> +	sb->s_vop = &btrfs_verityops;
> +#endif
>  	sb->s_xattr = btrfs_xattr_handlers;
>  	sb->s_time_gran = 1;
>  #ifdef CONFIG_BTRFS_FS_POSIX_ACL
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 436ac7b4b334..331ea4febcb1 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -267,6 +267,9 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
>  #ifdef CONFIG_BTRFS_DEBUG
>  BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
>  #endif
> +#ifdef CONFIG_FS_VERITY
> +BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
> +#endif
>  
>  static struct attribute *btrfs_supported_feature_attrs[] = {
>  	BTRFS_FEAT_ATTR_PTR(mixed_backref),
> @@ -284,6 +287,9 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
>  	BTRFS_FEAT_ATTR_PTR(raid1c34),
>  #ifdef CONFIG_BTRFS_DEBUG
>  	BTRFS_FEAT_ATTR_PTR(zoned),
> +#endif
> +#ifdef CONFIG_FS_VERITY
> +	BTRFS_FEAT_ATTR_PTR(verity),
>  #endif
>  	NULL
>  };
> diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
> new file mode 100644
> index 000000000000..feaf5908b3d3
> --- /dev/null
> +++ b/fs/btrfs/verity.c
> @@ -0,0 +1,617 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2020 Facebook.  All rights reserved.
> + */

This is not necessary since we have the SPDX tags,
https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Copyright_notices_in_files.2C_SPDX

> +
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/rwsem.h>
> +#include <linux/xattr.h>
> +#include <linux/security.h>
> +#include <linux/posix_acl_xattr.h>
> +#include <linux/iversion.h>
> +#include <linux/fsverity.h>
> +#include <linux/sched/mm.h>
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "locking.h"
> +
> +/*
> + * Just like ext4, we cache the merkle tree in pages after EOF in the page
> + * cache.  Unlike ext4, we're storing these in dedicated btree items and
> + * not just shoving them after EOF in the file.  This means we'll need to
> + * do extra work to encrypt them once encryption is supported in btrfs,
> + * but btrfs has a lot of careful code around i_size and it seems better
> + * to make a new key type than try and adjust all of our expectations
> + * for i_size.

Can you please rephrase that so it does not start with what other
filesystems do but what is the actual design and put references to ext4
eventually?

> + *
> + * fs verity items are stored under two different key types on disk.
> + *
> + * The descriptor items:
> + * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]

Please put that to the key definitions

> + *
> + * At offset 0, we store a btrfs_verity_descriptor_item which tracks the
> + * size of the descriptor item and some extra data for encryption.
> + * Starting at offset 1, these hold the generic fs verity descriptor.
> + * These are opaque to btrfs, we just read and write them as a blob for
> + * the higher level verity code.  The most common size for this is 256 bytes.
> + *
> + * The merkle tree items:
> + * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
> + *
> + * These also start at offset 0, and correspond to the merkle tree bytes.
> + * So when fsverity asks for page 0 of the merkle tree, we pull up one page
> + * starting at offset 0 for this key type.  These are also opaque to btrfs,
> + * we're blindly storing whatever fsverity sends down.
> + */
> +
> +/*
> + * Compute the logical file offset where we cache the Merkle tree.
> + *
> + * @inode: the inode of the verity file
> + *
> + * For the purposes of caching the Merkle tree pages, as required by
> + * fs-verity, it is convenient to do size computations in terms of a file
> + * offset, rather than in terms of page indices.
> + *
> + * Returns the file offset on success, negative error code on failure.
> + */
> +static loff_t merkle_file_pos(const struct inode *inode)
> +{
> +	u64 sz = inode->i_size;
> +	u64 ret = round_up(sz, 65536);

What's the reason for the extra variable sz? If that is meant to make
the whole u64 is read consistently, then it needs protection and the
i_read_size if the status of inode lock and context of call is unknown.
Compiler will happily merge that to round_up(inode->i_size).

Next, what's the meaning of the constant 65536?

> +
> +	if (ret > inode->i_sb->s_maxbytes)
> +		return -EFBIG;
> +	return ret;

ret is u64 so the function should also return u64

> +}
> +
> +/*
> + * Drop all the items for this inode with this key_type.

Newline

> + * @inode: The inode to drop items for
> + * @key_type: The type of items to drop (VERITY_DESC_ITEM or
> + *            VERITY_MERKLE_ITEM)

Please format the agrumgenst according to the description in
https://btrfs.wiki.kernel.org/index.php/Development_notes#Comments

> + *
> + * Before doing a verity enable we cleanup any existing verity items.
> + *
> + * This is also used to clean up if a verity enable failed half way
> + * through.
> + *
> + * Returns 0 on success, negative error code on failure.
> + */
> +static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	while (1) {
> +		trans = btrfs_start_transaction(root, 1);

Transaction start should document what are the reserved items, ie. what
is the 1 related to.

> +		if (IS_ERR(trans)) {
> +			ret = PTR_ERR(trans);
> +			goto out;
> +		}
> +
> +		/*
> +		 * walk backwards through all the items until we find one

Comments should start with uppercase unless it's and identifier name.
This is in many other places so please update them as well.

> +		 * that isn't from our key type or objectid
> +		 */
> +		key.objectid = btrfs_ino(inode);
> +		key.offset = (u64)-1;
> +		key.type = key_type;

It's common to sort the members as they go in order so
objectid/type/offset, this helps to keep the idea of the key.

> +
> +		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
> +		if (ret > 0) {
> +			ret = 0;
> +			/* no more keys of this type, we're done */
> +			if (path->slots[0] == 0)
> +				break;
> +			path->slots[0]--;
> +		} else if (ret < 0) {
> +			break;
> +		}
> +
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +		/* no more keys of this type, we're done */
> +		if (key.objectid != btrfs_ino(inode) || key.type != key_type)
> +			break;
> +
> +		/*
> +		 * this shouldn't be a performance sensitive function because
> +		 * it's not used as part of truncate.  If it ever becomes
> +		 * perf sensitive, change this to walk forward and bulk delete
> +		 * items
> +		 */
> +		ret = btrfs_del_items(trans, root, path,
> +				      path->slots[0], 1);

This will probably fit on one line, no need to split the parameters.

> +		btrfs_release_path(path);
> +		btrfs_end_transaction(trans);
> +
> +		if (ret)
> +			goto out;
> +	}
> +
> +	btrfs_end_transaction(trans);
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +
> +}
> +
> +/*
> + * Insert and write inode items with a given key type and offset.
> + * @inode: The inode to insert for.
> + * @key_type: The key type to insert.
> + * @offset: The item offset to insert at.
> + * @src: Source data to write.
> + * @len: Length of source data to write.
> + *
> + * Write len bytes from src into items of up to 1k length.
> + * The inserted items will have key <ino, key_type, offset + off> where
> + * off is consecutively increasing from 0 up to the last item ending at
> + * offset + len.
> + *
> + * Returns 0 on success and a negative error code on failure.
> + */
> +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> +			   const char *src, u64 len)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	u64 copied = 0;
> +	unsigned long copy_bytes;
> +	unsigned long src_offset = 0;
> +	void *data;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	while (len > 0) {
> +		trans = btrfs_start_transaction(root, 1);

Same as before, please document what items are reserved

> +		if (IS_ERR(trans)) {
> +			ret = PTR_ERR(trans);
> +			break;
> +		}
> +
> +		key.objectid = btrfs_ino(inode);
> +		key.offset = offset;
> +		key.type = key_type;

objectid/type/offset

> +
> +		/*
> +		 * insert 1K at a time mostly to be friendly for smaller
> +		 * leaf size filesystems
> +		 */
> +		copy_bytes = min_t(u64, len, 1024);
> +
> +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> +		if (ret) {
> +			btrfs_end_transaction(trans);
> +			break;
> +		}
> +
> +		leaf = path->nodes[0];
> +
> +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> +		write_extent_buffer(leaf, src + src_offset,
> +				    (unsigned long)data, copy_bytes);
> +		offset += copy_bytes;
> +		src_offset += copy_bytes;
> +		len -= copy_bytes;
> +		copied += copy_bytes;
> +
> +		btrfs_release_path(path);
> +		btrfs_end_transaction(trans);
> +	}
> +
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +/*
> + * Read inode items of the given key type and offset from the btree.
> + * @inode: The inode to read items of.
> + * @key_type: The key type to read.
> + * @offset: The item offset to read from.
> + * @dest: The buffer to read into. This parameter has slightly tricky
> + *        semantics.  If it is NULL, the function will not do any copying
> + *        and will just return the size of all the items up to len bytes.
> + *        If dest_page is passed, then the function will kmap_atomic the
> + *        page and ignore dest, but it must still be non-NULL to avoid the
> + *        counting-only behavior.
> + * @len: Length in bytes to read.
> + * @dest_page: Copy into this page instead of the dest buffer.
> + *
> + * Helper function to read items from the btree.  This returns the number
> + * of bytes read or < 0 for errors.  We can return short reads if the
> + * items don't exist on disk or aren't big enough to fill the desired length.
> + *
> + * Supports reading into a provided buffer (dest) or into the page cache
> + *
> + * Returns number of bytes read or a negative error code on failure.
> + */
> +static ssize_t read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,

Why does this return ssize_t? The type is not utilized anywhere in the
function an 'int' should work.

> +			  char *dest, u64 len, struct page *dest_page)
> +{
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	u64 item_end;
> +	u64 copy_end;
> +	u64 copied = 0;

Here copied is u64

> +	u32 copy_offset;
> +	unsigned long copy_bytes;
> +	unsigned long dest_offset = 0;
> +	void *data;
> +	char *kaddr = dest;
> +	int ret;

and ret is int

> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	if (dest_page)
> +		path->reada = READA_FORWARD;
> +
> +	key.objectid = btrfs_ino(inode);
> +	key.offset = offset;
> +	key.type = key_type;
> +
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0) {
> +		goto out;
> +	} else if (ret > 0) {
> +		ret = 0;
> +		if (path->slots[0] == 0)
> +			goto out;
> +		path->slots[0]--;
> +	}
> +
> +	while (len > 0) {
> +		leaf = path->nodes[0];
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> +		if (key.objectid != btrfs_ino(inode) ||
> +		    key.type != key_type)
> +			break;
> +
> +		item_end = btrfs_item_size_nr(leaf, path->slots[0]) + key.offset;
> +
> +		if (copied > 0) {
> +			/*
> +			 * once we've copied something, we want all of the items
> +			 * to be sequential
> +			 */
> +			if (key.offset != offset)
> +				break;
> +		} else {
> +			/*
> +			 * our initial offset might be in the middle of an
> +			 * item.  Make sure it all makes sense
> +			 */
> +			if (key.offset > offset)
> +				break;
> +			if (item_end <= offset)
> +				break;
> +		}
> +
> +		/* desc = NULL to just sum all the item lengths */
> +		if (!dest)
> +			copy_end = item_end;
> +		else
> +			copy_end = min(offset + len, item_end);
> +
> +		/* number of bytes in this item we want to copy */
> +		copy_bytes = copy_end - offset;
> +
> +		/* offset from the start of item for copying */
> +		copy_offset = offset - key.offset;
> +
> +		if (dest) {
> +			if (dest_page)
> +				kaddr = kmap_atomic(dest_page);

I think the kmap_atomic should not be used, there was a patchset
cleaning it up and replacing by kmap_local so we should not introduce
new instances.

> +
> +			data = btrfs_item_ptr(leaf, path->slots[0], void);
> +			read_extent_buffer(leaf, kaddr + dest_offset,
> +					   (unsigned long)data + copy_offset,
> +					   copy_bytes);
> +
> +			if (dest_page)
> +				kunmap_atomic(kaddr);
> +		}
> +
> +		offset += copy_bytes;
> +		dest_offset += copy_bytes;
> +		len -= copy_bytes;
> +		copied += copy_bytes;
> +
> +		path->slots[0]++;
> +		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> +			/*
> +			 * we've reached the last slot in this leaf and we need
> +			 * to go to the next leaf.
> +			 */
> +			ret = btrfs_next_leaf(root, path);
> +			if (ret < 0) {
> +				break;
> +			} else if (ret > 0) {
> +				ret = 0;
> +				break;
> +			}
> +		}
> +	}
> +out:
> +	btrfs_free_path(path);
> +	if (!ret)
> +		ret = copied;
> +	return ret;

In the end it's int and copied u64 is truncated to int.

> +}
> +
> +/*
> + * Drop verity items from the btree and from the page cache
> + *
> + * @inode: the inode to drop items for
> + *
> + * If we fail partway through enabling verity, enable verity and have some
> + * partial data extant, or cleanup orphaned verity data, we need to truncate it
                   extent

> + * from the cache and delete the items themselves from the btree.
> + *
> + * Returns 0 on success, negative error code on failure.
> + */
> +int btrfs_drop_verity_items(struct btrfs_inode *inode)
> +{
> +	int ret;
> +	struct inode *ino = &inode->vfs_inode;

'ino' is usually used for inode number so this is a bit confusing,

> +
> +	truncate_inode_pages(ino->i_mapping, ino->i_size);
> +	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
> +	if (ret)
> +		return ret;
> +	return drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
> +}
> +
> +/*
> + * fsverity op that begins enabling verity.
> + * fsverity calls this to ask us to setup the inode for enabling.  We
> + * drop any existing verity items and set the in progress bit.

Please rephrase it so it says something like "Begin enabling verity on
and inode. We drop ... "

> + */
> +static int btrfs_begin_enable_verity(struct file *filp)
> +{
> +	struct inode *inode = file_inode(filp);

Please replace this with struct btrfs_inode * inode = ... and don't do
the BTRFS_I conversion in the rest of the function.

> +	int ret;
> +
> +	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags))
> +		return -EBUSY;
> +
> +	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);

So the test and set are separate, can this race? No, as this is called
under the inode lock but this needs a trip to fsverity sources so be
sure. I'd suggest to put at least inode lock assertion, or a comment but
this is weaker than a runtime check.

> +	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
> +	if (ret)
> +		goto err;
> +
> +	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +
> +err:
> +	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> +	return ret;
> +

Extra newline

> +}
> +
> +/*
> + * fsverity op that ends enabling verity.
> + * fsverity calls this when it's done with all of the pages in the file
> + * and all of the merkle items have been inserted.  We write the
> + * descriptor and update the inode in the btree to reflect its new life
> + * as a verity file.

Please rephrase

> + */
> +static int btrfs_end_enable_verity(struct file *filp, const void *desc,
> +				  size_t desc_size, u64 merkle_tree_size)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct inode *inode = file_inode(filp);

Same as above, replace by btrfs inode and drop BTRFS_I below

> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_verity_descriptor_item item;
> +	int ret;
> +
> +	if (desc != NULL) {
> +		/* write out the descriptor item */
> +		memset(&item, 0, sizeof(item));
> +		btrfs_set_stack_verity_descriptor_size(&item, desc_size);
> +		ret = write_key_bytes(BTRFS_I(inode),
> +				      BTRFS_VERITY_DESC_ITEM_KEY, 0,
> +				      (const char *)&item, sizeof(item));
> +		if (ret)
> +			goto out;
> +		/* write out the descriptor itself */
> +		ret = write_key_bytes(BTRFS_I(inode),
> +				      BTRFS_VERITY_DESC_ITEM_KEY, 1,
> +				      desc, desc_size);
> +		if (ret)
> +			goto out;
> +
> +		/* update our inode flags to include fs verity */
> +		trans = btrfs_start_transaction(root, 1);
> +		if (IS_ERR(trans)) {
> +			ret = PTR_ERR(trans);
> +			goto out;
> +		}
> +		BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
> +		btrfs_sync_inode_flags_to_i_flags(inode);
> +		ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
> +		btrfs_end_transaction(trans);
> +	}
> +
> +out:
> +	if (desc == NULL || ret) {
> +		/* If we failed, drop all the verity items */
> +		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
> +		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
> +	} else

	} else {

> +		btrfs_set_fs_compat_ro(root->fs_info, VERITY);

	}

> +	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> +	return ret;
> +}
> +
> +/*
> + * fsverity op that gets the struct fsverity_descriptor.
> + * fsverity does a two pass setup for reading the descriptor, in the first pass
> + * it calls with buf_size = 0 to query the size of the descriptor,
> + * and then in the second pass it actually reads the descriptor off
> + * disk.
> + */
> +static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
> +				       size_t buf_size)
> +{
> +	u64 true_size;
> +	ssize_t ret = 0;
> +	struct btrfs_verity_descriptor_item item;
> +
> +	memset(&item, 0, sizeof(item));
> +	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
> +			     0, (char *)&item, sizeof(item), NULL);

Given that read_key_bytes does not need to return ssize_t, you can
switch ret to 0 here, so the function return type actually matches what
you return.

> +	if (ret < 0)
> +		return ret;

eg. here

> +
> +	if (item.reserved[0] != 0 || item.reserved[1] != 0)
> +		return -EUCLEAN;
> +
> +	true_size = btrfs_stack_verity_descriptor_size(&item);
> +	if (true_size > INT_MAX)
> +		return -EUCLEAN;
> +
> +	if (!buf_size)
> +		return true_size;
> +	if (buf_size < true_size)
> +		return -ERANGE;
> +
> +	ret = read_key_bytes(BTRFS_I(inode),
> +			     BTRFS_VERITY_DESC_ITEM_KEY, 1,
> +			     buf, buf_size, NULL);
> +	if (ret < 0)
> +		return ret;
> +	if (ret != true_size)
> +		return -EIO;
> +
> +	return true_size;
> +}
> +
> +/*
> + * fsverity op that reads and caches a merkle tree page.  These are stored
> + * in the btree, but we cache them in the inode's address space after EOF.
> + */
> +static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
> +					       pgoff_t index,
> +					       unsigned long num_ra_pages)
> +{
> +	struct page *p;

Please don't use single letter variables

> +	u64 off = index << PAGE_SHIFT;

pgoff_t is unsigned long, the shift will trim high bytes, you may want
to use the page_offset helper instead.

> +	loff_t merkle_pos = merkle_file_pos(inode);

u64, that should work with comparison to loff_t

> +	ssize_t ret;
> +	int err;
> +
> +	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
> +		return ERR_PTR(-EFBIG);
> +	index += merkle_pos >> PAGE_SHIFT;
> +again:
> +	p = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED);
> +	if (p) {
> +		if (PageUptodate(p))
> +			return p;
> +
> +		lock_page(p);
> +		/*
> +		 * we only insert uptodate pages, so !Uptodate has to be
> +		 * an error
> +		 */
> +		if (!PageUptodate(p)) {
> +			unlock_page(p);
> +			put_page(p);
> +			return ERR_PTR(-EIO);
> +		}
> +		unlock_page(p);
> +		return p;
> +	}
> +
> +	p = page_cache_alloc(inode->i_mapping);

So this performs an allocation with GFP flags from the inode mapping.
I'm not sure if this is safe, eg. in add_ra_bio_pages we do 

548     page = __page_cache_alloc(mapping_gfp_constraint(mapping,                                                                                                
549                                                      ~__GFP_FS));

to emulate GFP_NOFS. Either that or do the scoped nofs with
memalloc_nofs_save/_restore.

> +	if (!p)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * merkle item keys are indexed from byte 0 in the merkle tree.
> +	 * they have the form:
> +	 *
> +	 * [ inode objectid, BTRFS_MERKLE_ITEM_KEY, offset in bytes ]
> +	 */
> +	ret = read_key_bytes(BTRFS_I(inode),
> +			     BTRFS_VERITY_MERKLE_ITEM_KEY, off,
> +			     page_address(p), PAGE_SIZE, p);
> +	if (ret < 0) {
> +		put_page(p);
> +		return ERR_PTR(ret);
> +	}
> +
> +	/* zero fill any bytes we didn't write into the page */
> +	if (ret < PAGE_SIZE) {
> +		char *kaddr = kmap_atomic(p);
> +
> +		memset(kaddr + ret, 0, PAGE_SIZE - ret);
> +		kunmap_atomic(kaddr);

There's helper memzero_page wrapping the kmap

> +	}
> +	SetPageUptodate(p);
> +	err = add_to_page_cache_lru(p, inode->i_mapping, index,

Please drop err and use ret

> +				    mapping_gfp_mask(inode->i_mapping));
> +
> +	if (!err) {
> +		/* inserted and ready for fsverity */
> +		unlock_page(p);
> +	} else {
> +		put_page(p);
> +		/* did someone race us into inserting this page? */
> +		if (err == -EEXIST)
> +			goto again;
> +		p = ERR_PTR(err);
> +	}
> +	return p;
> +}
> +
> +/*
> + * fsverity op that writes a merkle tree block into the btree in 1k chunks.

Should it say "in 2^log_blocksize chunks" instead?

> + */
> +static int btrfs_write_merkle_tree_block(struct inode *inode, const void *buf,
> +					u64 index, int log_blocksize)
> +{
> +	u64 off = index << log_blocksize;
> +	u64 len = 1 << log_blocksize;
> +
> +	if (merkle_file_pos(inode) > inode->i_sb->s_maxbytes - off - len)
> +		return -EFBIG;
> +
> +	return write_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY,
> +			       off, buf, len);
> +}
> +
> +const struct fsverity_operations btrfs_verityops = {
> +	.begin_enable_verity	= btrfs_begin_enable_verity,
> +	.end_enable_verity	= btrfs_end_enable_verity,
> +	.get_verity_descriptor	= btrfs_get_verity_descriptor,
> +	.read_merkle_tree_page	= btrfs_read_merkle_tree_page,
> +	.write_merkle_tree_block = btrfs_write_merkle_tree_block,
> +};
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 5df73001aad4..fa21c8aac78d 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
>   * first mount when booting older kernel versions.
>   */
>  #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
> +#define BTRFS_FEATURE_COMPAT_RO_VERITY		(1ULL << 2)
>  
>  #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
>  #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
> @@ -308,7 +309,6 @@ struct btrfs_ioctl_fs_info_args {
>  #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
>  #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
>  #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
> -

Keep the newline please

>  struct btrfs_ioctl_feature_flags {
>  	__u64 compat_flags;
>  	__u64 compat_ro_flags;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index ae25280316bd..2be57416f886 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -118,6 +118,14 @@
>  #define BTRFS_INODE_REF_KEY		12
>  #define BTRFS_INODE_EXTREF_KEY		13
>  #define BTRFS_XATTR_ITEM_KEY		24
> +
> +/*
> + * fsverity has a descriptor per file, and then
> + * a number of sha or csum items indexed by offset in to the file.
> + */
> +#define BTRFS_VERITY_DESC_ITEM_KEY	36
> +#define BTRFS_VERITY_MERKLE_ITEM_KEY	37
> +
>  #define BTRFS_ORPHAN_ITEM_KEY		48
>  /* reserve 2-15 close to the inode for later flexibility */
>  
> @@ -996,4 +1004,11 @@ struct btrfs_qgroup_limit_item {
>  	__le64 rsv_excl;
>  } __attribute__ ((__packed__));
>  
> +struct btrfs_verity_descriptor_item {
> +	/* size of the verity descriptor in bytes */
> +	__le64 size;
> +	__le64 reserved[2];

Is the reserved space "just in case" or are there plans to use it? For
items the extension and compatibility can be done by checking the item
size, without further flags or bits set to distinguish that.

If the extension happens rarely it's manageable to do the size check
instead of reserving the space.

The reserved space must be otherwise zero if not used, this serves as
the way to check the compatibility. It still may need additional code to
make sure old kernel does recognize unkown contents and eg. refuses to
work. I can imagine in the context of verity it could be significant.

> +	__u8 encryption;
> +} __attribute__ ((__packed__));
> +
>  #endif /* _BTRFS_CTREE_H_ */



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-11 20:31   ` David Sterba
@ 2021-05-11 21:52     ` Boris Burkov
  2021-05-12 17:10       ` David Sterba
  2021-05-13 19:19     ` Boris Burkov
  1 sibling, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2021-05-11 21:52 UTC (permalink / raw)
  To: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Tue, May 11, 2021 at 10:31:43PM +0200, David Sterba wrote:
> On Wed, May 05, 2021 at 12:20:40PM -0700, Boris Burkov wrote:
> > From: Chris Mason <clm@fb.com>
> > 
> > Add support for fsverity in btrfs. To support the generic interface in
> > fs/verity, we add two new item types in the fs tree for inodes with
> > verity enabled. One stores the per-file verity descriptor and the other
> > stores the Merkle tree data itself.
> > 
> > Verity checking is done at the end of IOs to ensure each page is checked
> > before it is marked uptodate.
> > 
> > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > re-walking up shared paths in the tree. For this reason, we need to
> > cache the Merkle tree data.
> 
> What's the estimated size of the Merkle tree data? Does the whole tree
> need to be kept cached or is it only for data that are in page cache?

With the default of SHA256 and 4K blocks, we have 32 byte digests which
which fits 128 digests per block, so the Merkle tree will be almost
exactly 1/127 of the size of the file.

As far as I know, there is no special requirement that the Merkle tree
data stays cached. If a Merkle tree block is evicted, then a data block
is evicted and re-read, we would need to read the Merkle tree block
again and possibly up the path to the root until a cached block with
PageChecked.

> 
> > Since the file is immutable after verity is
> > turned on, we can cache it at an index past EOF.
> > 
> > Use the new inode compat_flags to store verity on the inode item, so
> > that we can enable verity on a file, then rollback to an older kernel
> > and still mount the file system and read the file. Since we can't safely
> > write the file anymore without ruining the invariants of the Merkle
> > tree, we mark a ro_compat flag on the file system when a file has verity
> > enabled.
> > 
> > Signed-off-by: Chris Mason <clm@fb.com>
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/Makefile               |   1 +
> >  fs/btrfs/btrfs_inode.h          |   1 +
> >  fs/btrfs/ctree.h                |  30 +-
> >  fs/btrfs/extent_io.c            |  27 +-
> >  fs/btrfs/file.c                 |   6 +
> >  fs/btrfs/inode.c                |   7 +
> >  fs/btrfs/ioctl.c                |  14 +-
> >  fs/btrfs/super.c                |   3 +
> >  fs/btrfs/sysfs.c                |   6 +
> >  fs/btrfs/verity.c               | 617 ++++++++++++++++++++++++++++++++
> >  include/uapi/linux/btrfs.h      |   2 +-
> >  include/uapi/linux/btrfs_tree.h |  15 +
> >  12 files changed, 718 insertions(+), 11 deletions(-)
> >  create mode 100644 fs/btrfs/verity.c
> > 
> > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> > index cec88a66bd6c..3dcf9bcc2326 100644
> > --- a/fs/btrfs/Makefile
> > +++ b/fs/btrfs/Makefile
> > @@ -36,6 +36,7 @@ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> >  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> >  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> >  btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
> > +btrfs-$(CONFIG_FS_VERITY) += verity.o
> >  
> >  btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
> >  	tests/extent-buffer-tests.o tests/btrfs-tests.o \
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index e8dbc8e848ce..4536548b9e79 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -51,6 +51,7 @@ enum {
> >  	 * the file range, inode's io_tree).
> >  	 */
> >  	BTRFS_INODE_NO_DELALLOC_FLUSH,
> > +	BTRFS_INODE_VERITY_IN_PROGRESS,
> 
> Please add a comment
> 
> >  };
> >  
> >  /* in memory btrfs inode */
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 0546273a520b..c5aab6a639ef 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -279,9 +279,10 @@ struct btrfs_super_block {
> >  #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
> >  #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
> >  
> > -#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
> > -	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
> > -	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
> > +#define BTRFS_FEATURE_COMPAT_RO_SUPP				\
> > +	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |		\
> > +	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID |	\
> > +	 BTRFS_FEATURE_COMPAT_RO_VERITY)
> >  
> >  #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
> >  #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
> > @@ -1505,6 +1506,11 @@ do {                                                                   \
> >  	 BTRFS_INODE_COMPRESS |						\
> >  	 BTRFS_INODE_ROOT_ITEM_INIT)
> >  
> > +/*
> > + * Inode compat flags
> > + */
> > +#define BTRFS_INODE_VERITY		(1 << 0)
> > +
> >  struct btrfs_map_token {
> >  	struct extent_buffer *eb;
> >  	char *kaddr;
> > @@ -3766,6 +3772,24 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
> >  	return signal_pending(current);
> >  }
> >  
> > +/* verity.c */
> > +#ifdef CONFIG_FS_VERITY
> > +extern const struct fsverity_operations btrfs_verityops;
> > +int btrfs_drop_verity_items(struct btrfs_inode *inode);
> > +BTRFS_SETGET_FUNCS(verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
> > +		   encryption, 8);
> > +BTRFS_SETGET_FUNCS(verity_descriptor_size, struct btrfs_verity_descriptor_item, size, 64);
> > +BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
> > +			 encryption, 8);
> > +BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size, struct btrfs_verity_descriptor_item,
> > +			 size, 64);
> > +#else
> > +static inline int btrfs_drop_verity_items(struct btrfs_inode *inode)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  /* Sanity test specific functions */
> >  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> >  void btrfs_test_destroy_inode(struct inode *inode);
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 4fb33cadc41a..d1f57a4ad2fb 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/pagevec.h>
> >  #include <linux/prefetch.h>
> >  #include <linux/cleancache.h>
> > +#include <linux/fsverity.h>
> >  #include "misc.h"
> >  #include "extent_io.h"
> >  #include "extent-io-tree.h"
> > @@ -2862,15 +2863,28 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
> >  	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
> >  }
> >  
> > -static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> > +static int end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> >  {
> > -	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> > +	int ret = 0;
> > +	struct inode *inode = page->mapping->host;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  
> >  	ASSERT(page_offset(page) <= start &&
> >  		start + len <= page_offset(page) + PAGE_SIZE);
> >  
> >  	if (uptodate) {
> > -		btrfs_page_set_uptodate(fs_info, page, start, len);
> > +		/*
> > +		 * buffered reads of a file with page alignment will issue a
> > +		 * 0 length read for one page past the end of file, so we must
> > +		 * explicitly skip checking verity on that page of zeros.
> > +		 */
> > +		if (!PageError(page) && !PageUptodate(page) &&
> > +		    start < i_size_read(inode) &&
> > +		    fsverity_active(inode) &&
> > +		    !fsverity_verify_page(page))
> > +			ret = -EIO;
> > +		else
> > +			btrfs_page_set_uptodate(fs_info, page, start, len);
> >  	} else {
> >  		btrfs_page_clear_uptodate(fs_info, page, start, len);
> >  		btrfs_page_set_error(fs_info, page, start, len);
> > @@ -2878,12 +2892,13 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> >  
> >  	if (fs_info->sectorsize == PAGE_SIZE)
> >  		unlock_page(page);
> > -	else if (is_data_inode(page->mapping->host))
> > +	else if (is_data_inode(inode))
> >  		/*
> >  		 * For subpage data, unlock the page if we're the last reader.
> >  		 * For subpage metadata, page lock is not utilized for read.
> >  		 */
> >  		btrfs_subpage_end_reader(fs_info, page, start, len);
> > +	return ret;
> >  }
> >  
> >  /*
> > @@ -3059,7 +3074,9 @@ static void end_bio_extent_readpage(struct bio *bio)
> >  		bio_offset += len;
> >  
> >  		/* Update page status and unlock */
> > -		end_page_read(page, uptodate, start, len);
> > +		ret = end_page_read(page, uptodate, start, len);
> > +		if (ret)
> > +			uptodate = 0;
> >  		endio_readpage_release_extent(&processed, BTRFS_I(inode),
> >  					      start, end, uptodate);
> >  	}
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 3b10d98b4ebb..a99470303bd9 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/btrfs.h>
> >  #include <linux/uio.h>
> >  #include <linux/iversion.h>
> > +#include <linux/fsverity.h>
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "transaction.h"
> > @@ -3593,7 +3594,12 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
> >  
> >  static int btrfs_file_open(struct inode *inode, struct file *filp)
> >  {
> > +	int ret;
> 
> Missing newline

Weird, I ran checkpatch so many times.. My bad.

> 
> >  	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
> > +
> > +	ret = fsverity_file_open(inode, filp);
> > +	if (ret)
> > +		return ret;
> >  	return generic_file_open(inode, filp);
> >  }
> >  
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index d89000577f7f..1b1101369777 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/sched/mm.h>
> >  #include <linux/iomap.h>
> >  #include <asm/unaligned.h>
> > +#include <linux/fsverity.h>
> >  #include "misc.h"
> >  #include "ctree.h"
> >  #include "disk-io.h"
> > @@ -5405,7 +5406,9 @@ void btrfs_evict_inode(struct inode *inode)
> >  
> >  	trace_btrfs_inode_evict(inode);
> >  
> > +
> 
> Extra newline
> 
> >  	if (!root) {
> > +		fsverity_cleanup_inode(inode);
> >  		clear_inode(inode);
> >  		return;
> >  	}
> > @@ -5488,6 +5491,7 @@ void btrfs_evict_inode(struct inode *inode)
> >  	 * to retry these periodically in the future.
> >  	 */
> >  	btrfs_remove_delayed_node(BTRFS_I(inode));
> > +	fsverity_cleanup_inode(inode);
> >  	clear_inode(inode);
> >  }
> >  
> > @@ -9041,6 +9045,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
> >  	struct inode *inode = d_inode(path->dentry);
> >  	u32 blocksize = inode->i_sb->s_blocksize;
> >  	u32 bi_flags = BTRFS_I(inode)->flags;
> > +	u32 bi_compat_flags = BTRFS_I(inode)->compat_flags;
> >  
> >  	stat->result_mask |= STATX_BTIME;
> >  	stat->btime.tv_sec = BTRFS_I(inode)->i_otime.tv_sec;
> > @@ -9053,6 +9058,8 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
> >  		stat->attributes |= STATX_ATTR_IMMUTABLE;
> >  	if (bi_flags & BTRFS_INODE_NODUMP)
> >  		stat->attributes |= STATX_ATTR_NODUMP;
> > +	if (bi_compat_flags & BTRFS_INODE_VERITY)
> > +		stat->attributes |= STATX_ATTR_VERITY;
> >  
> >  	stat->attributes_mask |= (STATX_ATTR_APPEND |
> >  				  STATX_ATTR_COMPRESSED |
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index ff335c192170..4b8f38fe4226 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -26,6 +26,7 @@
> >  #include <linux/btrfs.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/iversion.h>
> > +#include <linux/fsverity.h>
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "export.h"
> > @@ -105,6 +106,7 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
> >  static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
> >  {
> >  	unsigned int flags = binode->flags;
> > +	unsigned int compat_flags = binode->compat_flags;
> >  	unsigned int iflags = 0;
> >  
> >  	if (flags & BTRFS_INODE_SYNC)
> > @@ -121,6 +123,8 @@ static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
> >  		iflags |= FS_DIRSYNC_FL;
> >  	if (flags & BTRFS_INODE_NODATACOW)
> >  		iflags |= FS_NOCOW_FL;
> > +	if (compat_flags & BTRFS_INODE_VERITY)
> > +		iflags |= FS_VERITY_FL;
> >  
> >  	if (flags & BTRFS_INODE_NOCOMPRESS)
> >  		iflags |= FS_NOCOMP_FL;
> > @@ -148,10 +152,12 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
> >  		new_fl |= S_NOATIME;
> >  	if (binode->flags & BTRFS_INODE_DIRSYNC)
> >  		new_fl |= S_DIRSYNC;
> > +	if (binode->compat_flags & BTRFS_INODE_VERITY)
> > +		new_fl |= S_VERITY;
> >  
> >  	set_mask_bits(&inode->i_flags,
> > -		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> > -		      new_fl);
> > +		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
> > +		      S_VERITY, new_fl);
> >  }
> >  
> >  static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
> > @@ -5072,6 +5078,10 @@ long btrfs_ioctl(struct file *file, unsigned int
> >  		return btrfs_ioctl_get_subvol_rootref(file, argp);
> >  	case BTRFS_IOC_INO_LOOKUP_USER:
> >  		return btrfs_ioctl_ino_lookup_user(file, argp);
> > +	case FS_IOC_ENABLE_VERITY:
> > +		return fsverity_ioctl_enable(file, (const void __user *)argp);
> > +	case FS_IOC_MEASURE_VERITY:
> > +		return fsverity_ioctl_measure(file, argp);
> >  	}
> >  
> >  	return -ENOTTY;
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index 4a396c1147f1..aa41ee30e3ca 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -1365,6 +1365,9 @@ static int btrfs_fill_super(struct super_block *sb,
> >  	sb->s_op = &btrfs_super_ops;
> >  	sb->s_d_op = &btrfs_dentry_operations;
> >  	sb->s_export_op = &btrfs_export_ops;
> > +#ifdef CONFIG_FS_VERITY
> > +	sb->s_vop = &btrfs_verityops;
> > +#endif
> >  	sb->s_xattr = btrfs_xattr_handlers;
> >  	sb->s_time_gran = 1;
> >  #ifdef CONFIG_BTRFS_FS_POSIX_ACL
> > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > index 436ac7b4b334..331ea4febcb1 100644
> > --- a/fs/btrfs/sysfs.c
> > +++ b/fs/btrfs/sysfs.c
> > @@ -267,6 +267,9 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> >  #ifdef CONFIG_BTRFS_DEBUG
> >  BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
> >  #endif
> > +#ifdef CONFIG_FS_VERITY
> > +BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
> > +#endif
> >  
> >  static struct attribute *btrfs_supported_feature_attrs[] = {
> >  	BTRFS_FEAT_ATTR_PTR(mixed_backref),
> > @@ -284,6 +287,9 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
> >  	BTRFS_FEAT_ATTR_PTR(raid1c34),
> >  #ifdef CONFIG_BTRFS_DEBUG
> >  	BTRFS_FEAT_ATTR_PTR(zoned),
> > +#endif
> > +#ifdef CONFIG_FS_VERITY
> > +	BTRFS_FEAT_ATTR_PTR(verity),
> >  #endif
> >  	NULL
> >  };
> > diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
> > new file mode 100644
> > index 000000000000..feaf5908b3d3
> > --- /dev/null
> > +++ b/fs/btrfs/verity.c
> > @@ -0,0 +1,617 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2020 Facebook.  All rights reserved.
> > + */
> 
> This is not necessary since we have the SPDX tags,
> https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Copyright_notices_in_files.2C_SPDX
> 
> > +
> > +#include <linux/init.h>
> > +#include <linux/fs.h>
> > +#include <linux/slab.h>
> > +#include <linux/rwsem.h>
> > +#include <linux/xattr.h>
> > +#include <linux/security.h>
> > +#include <linux/posix_acl_xattr.h>
> > +#include <linux/iversion.h>
> > +#include <linux/fsverity.h>
> > +#include <linux/sched/mm.h>
> > +#include "ctree.h"
> > +#include "btrfs_inode.h"
> > +#include "transaction.h"
> > +#include "disk-io.h"
> > +#include "locking.h"
> > +
> > +/*
> > + * Just like ext4, we cache the merkle tree in pages after EOF in the page
> > + * cache.  Unlike ext4, we're storing these in dedicated btree items and
> > + * not just shoving them after EOF in the file.  This means we'll need to
> > + * do extra work to encrypt them once encryption is supported in btrfs,
> > + * but btrfs has a lot of careful code around i_size and it seems better
> > + * to make a new key type than try and adjust all of our expectations
> > + * for i_size.
> 
> Can you please rephrase that so it does not start with what other
> filesystems do but what is the actual design and put references to ext4
> eventually?
> 
> > + *
> > + * fs verity items are stored under two different key types on disk.
> > + *
> > + * The descriptor items:
> > + * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]
> 
> Please put that to the key definitions

Do you mean to move this whole comment to btrfs_tree.h?

> 
> > + *
> > + * At offset 0, we store a btrfs_verity_descriptor_item which tracks the
> > + * size of the descriptor item and some extra data for encryption.
> > + * Starting at offset 1, these hold the generic fs verity descriptor.
> > + * These are opaque to btrfs, we just read and write them as a blob for
> > + * the higher level verity code.  The most common size for this is 256 bytes.
> > + *
> > + * The merkle tree items:
> > + * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
> > + *
> > + * These also start at offset 0, and correspond to the merkle tree bytes.
> > + * So when fsverity asks for page 0 of the merkle tree, we pull up one page
> > + * starting at offset 0 for this key type.  These are also opaque to btrfs,
> > + * we're blindly storing whatever fsverity sends down.
> > + */
> > +
> > +/*
> > + * Compute the logical file offset where we cache the Merkle tree.
> > + *
> > + * @inode: the inode of the verity file
> > + *
> > + * For the purposes of caching the Merkle tree pages, as required by
> > + * fs-verity, it is convenient to do size computations in terms of a file
> > + * offset, rather than in terms of page indices.
> > + *
> > + * Returns the file offset on success, negative error code on failure.
> > + */
> > +static loff_t merkle_file_pos(const struct inode *inode)
> > +{
> > +	u64 sz = inode->i_size;
> > +	u64 ret = round_up(sz, 65536);
> 
> What's the reason for the extra variable sz? If that is meant to make
> the whole u64 is read consistently, then it needs protection and the
> i_read_size if the status of inode lock and context of call is unknown.
> Compiler will happily merge that to round_up(inode->i_size).

This was the result of getting a bit lazy reading assembly. My intent
was to ensure that we don't overflow the round_up, which is a macro that
depends on the type of the input. I was messing around figuring out what
effect casting had on it but gave up and just put it in a u64 before
calling it.

> 
> Next, what's the meaning of the constant 65536?
> 

It's arbitrary, and copied from ext4. I _believe_ the idea behind it is
that it should be a fixed constant to avoid making the page size change
the maximum file size subtly, but should be big enough to be a fresh
page truly past the end of the file pages on a 64K page size system.

> > +
> > +	if (ret > inode->i_sb->s_maxbytes)
> > +		return -EFBIG;
> > +	return ret;
> 
> ret is u64 so the function should also return u64

This was intentional as we do want an loff_t (long long) returned, but
use the u64 for the overflow checking above.

> 
> > +}
> > +
> > +/*
> > + * Drop all the items for this inode with this key_type.
> 
> Newline
> 
> > + * @inode: The inode to drop items for
> > + * @key_type: The type of items to drop (VERITY_DESC_ITEM or
> > + *            VERITY_MERKLE_ITEM)
> 
> Please format the agrumgenst according to the description in
> https://btrfs.wiki.kernel.org/index.php/Development_notes#Comments
> 
> > + *
> > + * Before doing a verity enable we cleanup any existing verity items.
> > + *
> > + * This is also used to clean up if a verity enable failed half way
> > + * through.
> > + *
> > + * Returns 0 on success, negative error code on failure.
> > + */
> > +static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_root *root = inode->root;
> > +	struct btrfs_path *path;
> > +	struct btrfs_key key;
> > +	int ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	while (1) {
> > +		trans = btrfs_start_transaction(root, 1);
> 
> Transaction start should document what are the reserved items, ie. what
> is the 1 related to.
> 
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * walk backwards through all the items until we find one
> 
> Comments should start with uppercase unless it's and identifier name.
> This is in many other places so please update them as well.
> 
> > +		 * that isn't from our key type or objectid
> > +		 */
> > +		key.objectid = btrfs_ino(inode);
> > +		key.offset = (u64)-1;
> > +		key.type = key_type;
> 
> It's common to sort the members as they go in order so
> objectid/type/offset, this helps to keep the idea of the key.
> 
> > +
> > +		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
> > +		if (ret > 0) {
> > +			ret = 0;
> > +			/* no more keys of this type, we're done */
> > +			if (path->slots[0] == 0)
> > +				break;
> > +			path->slots[0]--;
> > +		} else if (ret < 0) {
> > +			break;
> > +		}
> > +
> > +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > +
> > +		/* no more keys of this type, we're done */
> > +		if (key.objectid != btrfs_ino(inode) || key.type != key_type)
> > +			break;
> > +
> > +		/*
> > +		 * this shouldn't be a performance sensitive function because
> > +		 * it's not used as part of truncate.  If it ever becomes
> > +		 * perf sensitive, change this to walk forward and bulk delete
> > +		 * items
> > +		 */
> > +		ret = btrfs_del_items(trans, root, path,
> > +				      path->slots[0], 1);
> 
> This will probably fit on one line, no need to split the parameters.
> 
> > +		btrfs_release_path(path);
> > +		btrfs_end_transaction(trans);
> > +
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	btrfs_end_transaction(trans);
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +
> > +}
> > +
> > +/*
> > + * Insert and write inode items with a given key type and offset.
> > + * @inode: The inode to insert for.
> > + * @key_type: The key type to insert.
> > + * @offset: The item offset to insert at.
> > + * @src: Source data to write.
> > + * @len: Length of source data to write.
> > + *
> > + * Write len bytes from src into items of up to 1k length.
> > + * The inserted items will have key <ino, key_type, offset + off> where
> > + * off is consecutively increasing from 0 up to the last item ending at
> > + * offset + len.
> > + *
> > + * Returns 0 on success and a negative error code on failure.
> > + */
> > +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> > +			   const char *src, u64 len)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_path *path;
> > +	struct btrfs_root *root = inode->root;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_key key;
> > +	u64 copied = 0;
> > +	unsigned long copy_bytes;
> > +	unsigned long src_offset = 0;
> > +	void *data;
> > +	int ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	while (len > 0) {
> > +		trans = btrfs_start_transaction(root, 1);
> 
> Same as before, please document what items are reserved
> 
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			break;
> > +		}
> > +
> > +		key.objectid = btrfs_ino(inode);
> > +		key.offset = offset;
> > +		key.type = key_type;
> 
> objectid/type/offset
> 
> > +
> > +		/*
> > +		 * insert 1K at a time mostly to be friendly for smaller
> > +		 * leaf size filesystems
> > +		 */
> > +		copy_bytes = min_t(u64, len, 1024);
> > +
> > +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> > +		if (ret) {
> > +			btrfs_end_transaction(trans);
> > +			break;
> > +		}
> > +
> > +		leaf = path->nodes[0];
> > +
> > +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> > +		write_extent_buffer(leaf, src + src_offset,
> > +				    (unsigned long)data, copy_bytes);
> > +		offset += copy_bytes;
> > +		src_offset += copy_bytes;
> > +		len -= copy_bytes;
> > +		copied += copy_bytes;
> > +
> > +		btrfs_release_path(path);
> > +		btrfs_end_transaction(trans);
> > +	}
> > +
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Read inode items of the given key type and offset from the btree.
> > + * @inode: The inode to read items of.
> > + * @key_type: The key type to read.
> > + * @offset: The item offset to read from.
> > + * @dest: The buffer to read into. This parameter has slightly tricky
> > + *        semantics.  If it is NULL, the function will not do any copying
> > + *        and will just return the size of all the items up to len bytes.
> > + *        If dest_page is passed, then the function will kmap_atomic the
> > + *        page and ignore dest, but it must still be non-NULL to avoid the
> > + *        counting-only behavior.
> > + * @len: Length in bytes to read.
> > + * @dest_page: Copy into this page instead of the dest buffer.
> > + *
> > + * Helper function to read items from the btree.  This returns the number
> > + * of bytes read or < 0 for errors.  We can return short reads if the
> > + * items don't exist on disk or aren't big enough to fill the desired length.
> > + *
> > + * Supports reading into a provided buffer (dest) or into the page cache
> > + *
> > + * Returns number of bytes read or a negative error code on failure.
> > + */
> > +static ssize_t read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> 
> Why does this return ssize_t? The type is not utilized anywhere in the
> function an 'int' should work.
> 
> > +			  char *dest, u64 len, struct page *dest_page)
> > +{
> > +	struct btrfs_path *path;
> > +	struct btrfs_root *root = inode->root;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_key key;
> > +	u64 item_end;
> > +	u64 copy_end;
> > +	u64 copied = 0;
> 
> Here copied is u64
> 
> > +	u32 copy_offset;
> > +	unsigned long copy_bytes;
> > +	unsigned long dest_offset = 0;
> > +	void *data;
> > +	char *kaddr = dest;
> > +	int ret;
> 
> and ret is int
> 
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	if (dest_page)
> > +		path->reada = READA_FORWARD;
> > +
> > +	key.objectid = btrfs_ino(inode);
> > +	key.offset = offset;
> > +	key.type = key_type;
> > +
> > +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +	if (ret < 0) {
> > +		goto out;
> > +	} else if (ret > 0) {
> > +		ret = 0;
> > +		if (path->slots[0] == 0)
> > +			goto out;
> > +		path->slots[0]--;
> > +	}
> > +
> > +	while (len > 0) {
> > +		leaf = path->nodes[0];
> > +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +
> > +		if (key.objectid != btrfs_ino(inode) ||
> > +		    key.type != key_type)
> > +			break;
> > +
> > +		item_end = btrfs_item_size_nr(leaf, path->slots[0]) + key.offset;
> > +
> > +		if (copied > 0) {
> > +			/*
> > +			 * once we've copied something, we want all of the items
> > +			 * to be sequential
> > +			 */
> > +			if (key.offset != offset)
> > +				break;
> > +		} else {
> > +			/*
> > +			 * our initial offset might be in the middle of an
> > +			 * item.  Make sure it all makes sense
> > +			 */
> > +			if (key.offset > offset)
> > +				break;
> > +			if (item_end <= offset)
> > +				break;
> > +		}
> > +
> > +		/* desc = NULL to just sum all the item lengths */
> > +		if (!dest)
> > +			copy_end = item_end;
> > +		else
> > +			copy_end = min(offset + len, item_end);
> > +
> > +		/* number of bytes in this item we want to copy */
> > +		copy_bytes = copy_end - offset;
> > +
> > +		/* offset from the start of item for copying */
> > +		copy_offset = offset - key.offset;
> > +
> > +		if (dest) {
> > +			if (dest_page)
> > +				kaddr = kmap_atomic(dest_page);
> 
> I think the kmap_atomic should not be used, there was a patchset
> cleaning it up and replacing by kmap_local so we should not introduce
> new instances.
> 
> > +
> > +			data = btrfs_item_ptr(leaf, path->slots[0], void);
> > +			read_extent_buffer(leaf, kaddr + dest_offset,
> > +					   (unsigned long)data + copy_offset,
> > +					   copy_bytes);
> > +
> > +			if (dest_page)
> > +				kunmap_atomic(kaddr);
> > +		}
> > +
> > +		offset += copy_bytes;
> > +		dest_offset += copy_bytes;
> > +		len -= copy_bytes;
> > +		copied += copy_bytes;
> > +
> > +		path->slots[0]++;
> > +		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> > +			/*
> > +			 * we've reached the last slot in this leaf and we need
> > +			 * to go to the next leaf.
> > +			 */
> > +			ret = btrfs_next_leaf(root, path);
> > +			if (ret < 0) {
> > +				break;
> > +			} else if (ret > 0) {
> > +				ret = 0;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +out:
> > +	btrfs_free_path(path);
> > +	if (!ret)
> > +		ret = copied;
> > +	return ret;
> 
> In the end it's int and copied u64 is truncated to int.
> 
> > +}
> > +
> > +/*
> > + * Drop verity items from the btree and from the page cache
> > + *
> > + * @inode: the inode to drop items for
> > + *
> > + * If we fail partway through enabling verity, enable verity and have some
> > + * partial data extant, or cleanup orphaned verity data, we need to truncate it
>                    extent
> 
> > + * from the cache and delete the items themselves from the btree.
> > + *
> > + * Returns 0 on success, negative error code on failure.
> > + */
> > +int btrfs_drop_verity_items(struct btrfs_inode *inode)
> > +{
> > +	int ret;
> > +	struct inode *ino = &inode->vfs_inode;
> 
> 'ino' is usually used for inode number so this is a bit confusing,
> 
> > +
> > +	truncate_inode_pages(ino->i_mapping, ino->i_size);
> > +	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
> > +	if (ret)
> > +		return ret;
> > +	return drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
> > +}
> > +
> > +/*
> > + * fsverity op that begins enabling verity.
> > + * fsverity calls this to ask us to setup the inode for enabling.  We
> > + * drop any existing verity items and set the in progress bit.
> 
> Please rephrase it so it says something like "Begin enabling verity on
> and inode. We drop ... "
> 
> > + */
> > +static int btrfs_begin_enable_verity(struct file *filp)
> > +{
> > +	struct inode *inode = file_inode(filp);
> 
> Please replace this with struct btrfs_inode * inode = ... and don't do
> the BTRFS_I conversion in the rest of the function.
> 
> > +	int ret;
> > +
> > +	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags))
> > +		return -EBUSY;
> > +
> > +	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> 
> So the test and set are separate, can this race? No, as this is called
> under the inode lock but this needs a trip to fsverity sources so be
> sure. I'd suggest to put at least inode lock assertion, or a comment but
> this is weaker than a runtime check.
> 
> > +	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
> > +	if (ret)
> > +		goto err;
> > +
> > +	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
> > +	if (ret)
> > +		goto err;
> > +
> > +	return 0;
> > +
> > +err:
> > +	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> > +	return ret;
> > +
> 
> Extra newline
> 
> > +}
> > +
> > +/*
> > + * fsverity op that ends enabling verity.
> > + * fsverity calls this when it's done with all of the pages in the file
> > + * and all of the merkle items have been inserted.  We write the
> > + * descriptor and update the inode in the btree to reflect its new life
> > + * as a verity file.
> 
> Please rephrase
> 
> > + */
> > +static int btrfs_end_enable_verity(struct file *filp, const void *desc,
> > +				  size_t desc_size, u64 merkle_tree_size)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct inode *inode = file_inode(filp);
> 
> Same as above, replace by btrfs inode and drop BTRFS_I below
> 
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > +	struct btrfs_verity_descriptor_item item;
> > +	int ret;
> > +
> > +	if (desc != NULL) {
> > +		/* write out the descriptor item */
> > +		memset(&item, 0, sizeof(item));
> > +		btrfs_set_stack_verity_descriptor_size(&item, desc_size);
> > +		ret = write_key_bytes(BTRFS_I(inode),
> > +				      BTRFS_VERITY_DESC_ITEM_KEY, 0,
> > +				      (const char *)&item, sizeof(item));
> > +		if (ret)
> > +			goto out;
> > +		/* write out the descriptor itself */
> > +		ret = write_key_bytes(BTRFS_I(inode),
> > +				      BTRFS_VERITY_DESC_ITEM_KEY, 1,
> > +				      desc, desc_size);
> > +		if (ret)
> > +			goto out;
> > +
> > +		/* update our inode flags to include fs verity */
> > +		trans = btrfs_start_transaction(root, 1);
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			goto out;
> > +		}
> > +		BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
> > +		btrfs_sync_inode_flags_to_i_flags(inode);
> > +		ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
> > +		btrfs_end_transaction(trans);
> > +	}
> > +
> > +out:
> > +	if (desc == NULL || ret) {
> > +		/* If we failed, drop all the verity items */
> > +		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
> > +		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
> > +	} else
> 
> 	} else {
> 
> > +		btrfs_set_fs_compat_ro(root->fs_info, VERITY);
> 
> 	}
> 
> > +	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * fsverity op that gets the struct fsverity_descriptor.
> > + * fsverity does a two pass setup for reading the descriptor, in the first pass
> > + * it calls with buf_size = 0 to query the size of the descriptor,
> > + * and then in the second pass it actually reads the descriptor off
> > + * disk.
> > + */
> > +static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
> > +				       size_t buf_size)
> > +{
> > +	u64 true_size;
> > +	ssize_t ret = 0;
> > +	struct btrfs_verity_descriptor_item item;
> > +
> > +	memset(&item, 0, sizeof(item));
> > +	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
> > +			     0, (char *)&item, sizeof(item), NULL);
> 
> Given that read_key_bytes does not need to return ssize_t, you can
> switch ret to 0 here, so the function return type actually matches what
> you return.

I apologize, I don't think I understand this one. Do you mean to change
ret (and read_key_bytes) from ssize_t to int? Or is there something else
I should do here as well?

> 
> > +	if (ret < 0)
> > +		return ret;
> 
> eg. here
> 
> > +
> > +	if (item.reserved[0] != 0 || item.reserved[1] != 0)
> > +		return -EUCLEAN;
> > +
> > +	true_size = btrfs_stack_verity_descriptor_size(&item);
> > +	if (true_size > INT_MAX)
> > +		return -EUCLEAN;
> > +
> > +	if (!buf_size)
> > +		return true_size;
> > +	if (buf_size < true_size)
> > +		return -ERANGE;
> > +
> > +	ret = read_key_bytes(BTRFS_I(inode),
> > +			     BTRFS_VERITY_DESC_ITEM_KEY, 1,
> > +			     buf, buf_size, NULL);
> > +	if (ret < 0)
> > +		return ret;
> > +	if (ret != true_size)
> > +		return -EIO;
> > +
> > +	return true_size;
> > +}
> > +
> > +/*
> > + * fsverity op that reads and caches a merkle tree page.  These are stored
> > + * in the btree, but we cache them in the inode's address space after EOF.
> > + */
> > +static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
> > +					       pgoff_t index,
> > +					       unsigned long num_ra_pages)
> > +{
> > +	struct page *p;
> 
> Please don't use single letter variables
> 
> > +	u64 off = index << PAGE_SHIFT;
> 
> pgoff_t is unsigned long, the shift will trim high bytes, you may want
> to use the page_offset helper instead.

Ah, my intent was that it should all fit in off, but yes this does seem
like it could lose bytes (I thiiiink we might be safe because a write
would fail first, but I would like this code to be correct). I'll look
into the helper.

> 
> > +	loff_t merkle_pos = merkle_file_pos(inode);
> 
> u64, that should work with comparison to loff_t
> 
> > +	ssize_t ret;
> > +	int err;
> > +
> > +	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
> > +		return ERR_PTR(-EFBIG);
> > +	index += merkle_pos >> PAGE_SHIFT;
> > +again:
> > +	p = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED);
> > +	if (p) {
> > +		if (PageUptodate(p))
> > +			return p;
> > +
> > +		lock_page(p);
> > +		/*
> > +		 * we only insert uptodate pages, so !Uptodate has to be
> > +		 * an error
> > +		 */
> > +		if (!PageUptodate(p)) {
> > +			unlock_page(p);
> > +			put_page(p);
> > +			return ERR_PTR(-EIO);
> > +		}
> > +		unlock_page(p);
> > +		return p;
> > +	}
> > +
> > +	p = page_cache_alloc(inode->i_mapping);
> 
> So this performs an allocation with GFP flags from the inode mapping.
> I'm not sure if this is safe, eg. in add_ra_bio_pages we do 
> 
> 548     page = __page_cache_alloc(mapping_gfp_constraint(mapping,                                                                                                
> 549                                                      ~__GFP_FS));
> 
> to emulate GFP_NOFS. Either that or do the scoped nofs with
> memalloc_nofs_save/_restore.
> 
> > +	if (!p)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/*
> > +	 * merkle item keys are indexed from byte 0 in the merkle tree.
> > +	 * they have the form:
> > +	 *
> > +	 * [ inode objectid, BTRFS_MERKLE_ITEM_KEY, offset in bytes ]
> > +	 */
> > +	ret = read_key_bytes(BTRFS_I(inode),
> > +			     BTRFS_VERITY_MERKLE_ITEM_KEY, off,
> > +			     page_address(p), PAGE_SIZE, p);
> > +	if (ret < 0) {
> > +		put_page(p);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	/* zero fill any bytes we didn't write into the page */
> > +	if (ret < PAGE_SIZE) {
> > +		char *kaddr = kmap_atomic(p);
> > +
> > +		memset(kaddr + ret, 0, PAGE_SIZE - ret);
> > +		kunmap_atomic(kaddr);
> 
> There's helper memzero_page wrapping the kmap
> 
> > +	}
> > +	SetPageUptodate(p);
> > +	err = add_to_page_cache_lru(p, inode->i_mapping, index,
> 
> Please drop err and use ret
> 
> > +				    mapping_gfp_mask(inode->i_mapping));
> > +
> > +	if (!err) {
> > +		/* inserted and ready for fsverity */
> > +		unlock_page(p);
> > +	} else {
> > +		put_page(p);
> > +		/* did someone race us into inserting this page? */
> > +		if (err == -EEXIST)
> > +			goto again;
> > +		p = ERR_PTR(err);
> > +	}
> > +	return p;
> > +}
> > +
> > +/*
> > + * fsverity op that writes a merkle tree block into the btree in 1k chunks.
> 
> Should it say "in 2^log_blocksize chunks" instead?

I was trying to highlight that though we are writing a 4K merkle block,
we will write it in 1k pieces (per write_key_bytes). Happy to change
this comment to be more useful, though.

> 
> > + */
> > +static int btrfs_write_merkle_tree_block(struct inode *inode, const void *buf,
> > +					u64 index, int log_blocksize)
> > +{
> > +	u64 off = index << log_blocksize;
> > +	u64 len = 1 << log_blocksize;
> > +
> > +	if (merkle_file_pos(inode) > inode->i_sb->s_maxbytes - off - len)
> > +		return -EFBIG;
> > +
> > +	return write_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY,
> > +			       off, buf, len);
> > +}
> > +
> > +const struct fsverity_operations btrfs_verityops = {
> > +	.begin_enable_verity	= btrfs_begin_enable_verity,
> > +	.end_enable_verity	= btrfs_end_enable_verity,
> > +	.get_verity_descriptor	= btrfs_get_verity_descriptor,
> > +	.read_merkle_tree_page	= btrfs_read_merkle_tree_page,
> > +	.write_merkle_tree_block = btrfs_write_merkle_tree_block,
> > +};
> > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> > index 5df73001aad4..fa21c8aac78d 100644
> > --- a/include/uapi/linux/btrfs.h
> > +++ b/include/uapi/linux/btrfs.h
> > @@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
> >   * first mount when booting older kernel versions.
> >   */
> >  #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
> > +#define BTRFS_FEATURE_COMPAT_RO_VERITY		(1ULL << 2)
> >  
> >  #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
> >  #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
> > @@ -308,7 +309,6 @@ struct btrfs_ioctl_fs_info_args {
> >  #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
> >  #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
> >  #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
> > -
> 
> Keep the newline please
> 
> >  struct btrfs_ioctl_feature_flags {
> >  	__u64 compat_flags;
> >  	__u64 compat_ro_flags;
> > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > index ae25280316bd..2be57416f886 100644
> > --- a/include/uapi/linux/btrfs_tree.h
> > +++ b/include/uapi/linux/btrfs_tree.h
> > @@ -118,6 +118,14 @@
> >  #define BTRFS_INODE_REF_KEY		12
> >  #define BTRFS_INODE_EXTREF_KEY		13
> >  #define BTRFS_XATTR_ITEM_KEY		24
> > +
> > +/*
> > + * fsverity has a descriptor per file, and then
> > + * a number of sha or csum items indexed by offset in to the file.
> > + */
> > +#define BTRFS_VERITY_DESC_ITEM_KEY	36
> > +#define BTRFS_VERITY_MERKLE_ITEM_KEY	37
> > +
> >  #define BTRFS_ORPHAN_ITEM_KEY		48
> >  /* reserve 2-15 close to the inode for later flexibility */
> >  
> > @@ -996,4 +1004,11 @@ struct btrfs_qgroup_limit_item {
> >  	__le64 rsv_excl;
> >  } __attribute__ ((__packed__));
> >  
> > +struct btrfs_verity_descriptor_item {
> > +	/* size of the verity descriptor in bytes */
> > +	__le64 size;
> > +	__le64 reserved[2];
> 
> Is the reserved space "just in case" or are there plans to use it? For
> items the extension and compatibility can be done by checking the item
> size, without further flags or bits set to distinguish that.
> 
> If the extension happens rarely it's manageable to do the size check
> instead of reserving the space.
> 
> The reserved space must be otherwise zero if not used, this serves as
> the way to check the compatibility. It still may need additional code to
> make sure old kernel does recognize unkown contents and eg. refuses to
> work. I can imagine in the context of verity it could be significant.
> 
> > +	__u8 encryption;
> > +} __attribute__ ((__packed__));
> > +
> >  #endif /* _BTRFS_CTREE_H_ */
> 
> 

Thank you for the in-depth review, and sorry for the sloppy newline
stuff. Everything I didn't explicitly respond to, I'll either fix or
study further.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-11 21:52     ` Boris Burkov
@ 2021-05-12 17:10       ` David Sterba
  0 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2021-05-12 17:10 UTC (permalink / raw)
  To: Boris Burkov; +Cc: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Tue, May 11, 2021 at 02:52:15PM -0700, Boris Burkov wrote:
> On Tue, May 11, 2021 at 10:31:43PM +0200, David Sterba wrote:
> > On Wed, May 05, 2021 at 12:20:40PM -0700, Boris Burkov wrote:
> > > From: Chris Mason <clm@fb.com>
> > > 
> > > Add support for fsverity in btrfs. To support the generic interface in
> > > fs/verity, we add two new item types in the fs tree for inodes with
> > > verity enabled. One stores the per-file verity descriptor and the other
> > > stores the Merkle tree data itself.
> > > 
> > > Verity checking is done at the end of IOs to ensure each page is checked
> > > before it is marked uptodate.
> > > 
> > > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > > re-walking up shared paths in the tree. For this reason, we need to
> > > cache the Merkle tree data.
> > 
> > What's the estimated size of the Merkle tree data? Does the whole tree
> > need to be kept cached or is it only for data that are in page cache?
> 
> With the default of SHA256 and 4K blocks, we have 32 byte digests which
> which fits 128 digests per block, so the Merkle tree will be almost
> exactly 1/127 of the size of the file.

Thanks, so it's roughly 8K per 1M.

> As far as I know, there is no special requirement that the Merkle tree
> data stays cached. If a Merkle tree block is evicted, then a data block
> is evicted and re-read, we would need to read the Merkle tree block
> again and possibly up the path to the root until a cached block with
> PageChecked.
> 
> > > +++ b/fs/btrfs/file.c
> > > @@ -16,6 +16,7 @@
> > >  #include <linux/btrfs.h>
> > >  #include <linux/uio.h>
> > >  #include <linux/iversion.h>
> > > +#include <linux/fsverity.h>
> > >  #include "ctree.h"
> > >  #include "disk-io.h"
> > >  #include "transaction.h"
> > > @@ -3593,7 +3594,12 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
> > >  
> > >  static int btrfs_file_open(struct inode *inode, struct file *filp)
> > >  {
> > > +	int ret;
> > 
> > Missing newline
> 
> Weird, I ran checkpatch so many times.. My bad.

No big deal, such things can slip in. I would not point them out unless
there's another reason to resend the patches. Once all the real things
are done I do a pass just fixing style.

> > > +#include <linux/sched/mm.h>
> > > +#include "ctree.h"
> > > +#include "btrfs_inode.h"
> > > +#include "transaction.h"
> > > +#include "disk-io.h"
> > > +#include "locking.h"
> > > +
> > > +/*
> > > + * Just like ext4, we cache the merkle tree in pages after EOF in the page
> > > + * cache.  Unlike ext4, we're storing these in dedicated btree items and
> > > + * not just shoving them after EOF in the file.  This means we'll need to
> > > + * do extra work to encrypt them once encryption is supported in btrfs,
> > > + * but btrfs has a lot of careful code around i_size and it seems better
> > > + * to make a new key type than try and adjust all of our expectations
> > > + * for i_size.
> > 
> > Can you please rephrase that so it does not start with what other
> > filesystems do but what is the actual design and put references to ext4
> > eventually?
> > 
> > > + *
> > > + * fs verity items are stored under two different key types on disk.
> > > + *
> > > + * The descriptor items:
> > > + * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]
> > 
> > Please put that to the key definitions
> 
> Do you mean to move this whole comment to btrfs_tree.h?

Yeah, there are already some key descriptions, so that should be the
palce where to look. You can of course repeat what yout need in this
comment for context and so that the text is understandable without going
to other files.

> > > +static loff_t merkle_file_pos(const struct inode *inode)
> > > +{
> > > +	u64 sz = inode->i_size;
> > > +	u64 ret = round_up(sz, 65536);
> > 
> > What's the reason for the extra variable sz? If that is meant to make
> > the whole u64 is read consistently, then it needs protection and the
> > i_read_size if the status of inode lock and context of call is unknown.
> > Compiler will happily merge that to round_up(inode->i_size).
> 
> This was the result of getting a bit lazy reading assembly. My intent
> was to ensure that we don't overflow the round_up, which is a macro that
> depends on the type of the input. I was messing around figuring out what
> effect casting had on it but gave up and just put it in a u64 before
> calling it.

Forcing the type by assigning it to another variable is ok. The problem
is potential multiple evaluation of inode->i_size. Here's a real example
how this can lead to real bugs
https://git.kernel.org/linus/d98da49977f67394db492 , in that case it was
inside max(). Reading that again, just using i_size_read is not
sufficient because it still does not have READ_ONCE. We had to emulate
it using compiler barrier.

But now I realize the i_size can't change because fsverity works in
read-only mode once enabled on the inode, right? In that case the only
concern would indeed be the type safety for round_up.

> > Next, what's the meaning of the constant 65536?
> > 
> 
> It's arbitrary, and copied from ext4. I _believe_ the idea behind it is
> that it should be a fixed constant to avoid making the page size change
> the maximum file size subtly, but should be big enough to be a fresh
> page truly past the end of the file pages on a 64K page size system.

Yeah, according to commit c93d8f88580921c84d it seems to be the maximum
expected page size, in that case a symbolic name would be more
appropriate.

fs/ext4/verity.c:
* Using a 64K boundary rather than a 4K one keeps things ready for
* architectures with 64K pages, and it doesn't necessarily waste space on-disk
* since there can be a hole between i_size and the start of the Merkle tree.

> > > +	if (ret > inode->i_sb->s_maxbytes)
> > > +		return -EFBIG;
> > > +	return ret;
> > 
> > ret is u64 so the function should also return u64
> 
> This was intentional as we do want an loff_t (long long) returned, but
> use the u64 for the overflow checking above.

So the reason why you probably want to use loff_t is
inode->i_sb->s_maxbytes, which is loff_t. As this is a constant
MAX_LFS_FILESIZE (sb initialized in btrfs_fill_super), I think you can
replace all instances of sb->s_maxbytes by that, or add something like
BTRFS_MAX_INODE_SIZE that will be u64.

The VFS interface has to use loff_t but as we use u64 in btrfs for
almost everything I'd rather unify that according to the style used in
btrfs and make sure that the conversion is OK on the VFS->FS boundary.

> > > +static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
> > > +					       pgoff_t index,
> > > +					       unsigned long num_ra_pages)
> > > +{
> > > +	struct page *p;
> > 
> > Please don't use single letter variables
> > 
> > > +	u64 off = index << PAGE_SHIFT;
> > 
> > pgoff_t is unsigned long, the shift will trim high bytes, you may want
> > to use the page_offset helper instead.
> 
> Ah, my intent was that it should all fit in off, but yes this does seem
> like it could lose bytes (I thiiiink we might be safe because a write
> would fail first, but I would like this code to be correct). I'll look
> into the helper.

The index gets cast to loff_t before shift, so that's what you'd have to
do here too.

> Thank you for the in-depth review, and sorry for the sloppy newline
> stuff. Everything I didn't explicitly respond to, I'll either fix or
> study further.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-05 19:20 ` [PATCH v4 2/5] btrfs: initial fsverity support Boris Burkov
                     ` (2 preceding siblings ...)
  2021-05-11 20:31   ` David Sterba
@ 2021-05-12 17:34   ` David Sterba
  3 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2021-05-12 17:34 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 05, 2021 at 12:20:40PM -0700, Boris Burkov wrote:
> +/*
> + * Insert and write inode items with a given key type and offset.
> + * @inode: The inode to insert for.
> + * @key_type: The key type to insert.
> + * @offset: The item offset to insert at.
> + * @src: Source data to write.
> + * @len: Length of source data to write.
> + *
> + * Write len bytes from src into items of up to 1k length.
> + * The inserted items will have key <ino, key_type, offset + off> where
> + * off is consecutively increasing from 0 up to the last item ending at
> + * offset + len.
> + *
> + * Returns 0 on success and a negative error code on failure.
> + */
> +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> +			   const char *src, u64 len)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	u64 copied = 0;
> +	unsigned long copy_bytes;
> +	unsigned long src_offset = 0;
> +	void *data;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	while (len > 0) {
> +		trans = btrfs_start_transaction(root, 1);

This starts transaction for each 1K of written data, this can become
potentially slow. In btrfs_end_enable_verity it's called 3 times and
then there's another transaction started to set the VERITY bit on the
inode. There's no commit called so it's not forced but it could happen
at any time independently. So this could result in partial verity data
stored.

We can't use join_transaction, or not without some block reserve magic.

> +		if (IS_ERR(trans)) {
> +			ret = PTR_ERR(trans);
> +			break;
> +		}
> +
> +		key.objectid = btrfs_ino(inode);
> +		key.offset = offset;
> +		key.type = key_type;
> +
> +		/*
> +		 * insert 1K at a time mostly to be friendly for smaller
> +		 * leaf size filesystems
> +		 */
> +		copy_bytes = min_t(u64, len, 1024);

The smallest we consider is 4K, I'm not sure if we would do eg. 2K to
allow testing the subpage blocksize even on x86_64. Otherwise I'd target
4K and adjust the limits accordingly. To reduce the transaction start
count, eg. 2K per round could half the number.

> +
> +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> +		if (ret) {

Does this also need to abort the transaction? This could lead to
filesystem in some incomplete state. If the whole operation is
restartable then it could avoid the abort and just return error, but
also must undo all changes. This is not always possible so aborting is
the only option left.

> +			btrfs_end_transaction(trans);
> +			break;
> +		}
> +
> +		leaf = path->nodes[0];
> +
> +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> +		write_extent_buffer(leaf, src + src_offset,
> +				    (unsigned long)data, copy_bytes);
> +		offset += copy_bytes;
> +		src_offset += copy_bytes;
> +		len -= copy_bytes;
> +		copied += copy_bytes;
> +
> +		btrfs_release_path(path);
> +		btrfs_end_transaction(trans);
> +	}
> +
> +	btrfs_free_path(path);
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 5/5] btrfs: verity metadata orphan items
  2021-05-05 19:20 ` [PATCH v4 5/5] btrfs: verity metadata orphan items Boris Burkov
@ 2021-05-12 17:48   ` David Sterba
  2021-05-12 18:08     ` Boris Burkov
  0 siblings, 1 reply; 26+ messages in thread
From: David Sterba @ 2021-05-12 17:48 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 05, 2021 at 12:20:43PM -0700, Boris Burkov wrote:
> +/*
> + * Helper to manage the transaction for adding an orphan item.
> + */
> +static int add_orphan(struct btrfs_inode *inode)

I wonder if this helper is useful, it's used only once and the code is
not long. Simply wrapping btrfs_orphan_add into a transaction is short
enough to be in btrfs_begin_enable_verity.

> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_root *root = inode->root;
> +	int ret = 0;
> +
> +	trans = btrfs_start_transaction(root, 1);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		goto out;
> +	}
> +	ret = btrfs_orphan_add(trans, inode);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto out;
> +	}
> +	btrfs_end_transaction(trans);
> +
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Helper to manage the transaction for deleting an orphan item.
> + */
> +static int del_orphan(struct btrfs_inode *inode)

Same here.

> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_root *root = inode->root;
> +	int ret;
> +
> +	/*
> +	 * If the inode has no links, it is either already unlinked, or was
> +	 * created with O_TMPFILE. In either case, it should have an orphan from
> +	 * that other operation. Rather than reference count the orphans, we
> +	 * simply ignore them here, because we only invoke the verity path in
> +	 * the orphan logic when i_nlink is 0.
> +	 */
> +	if (!inode->vfs_inode.i_nlink)
> +		return 0;
> +
> +	trans = btrfs_start_transaction(root, 1);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +
> +	ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		return ret;
> +	}
> +
> +	btrfs_end_transaction(trans);
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/5] btrfs: check verity for reads of inline extents and holes
  2021-05-05 19:20 ` [PATCH v4 3/5] btrfs: check verity for reads of inline extents and holes Boris Burkov
@ 2021-05-12 17:57   ` David Sterba
  2021-05-12 18:25     ` Boris Burkov
  0 siblings, 1 reply; 26+ messages in thread
From: David Sterba @ 2021-05-12 17:57 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 05, 2021 at 12:20:41PM -0700, Boris Burkov wrote:
> The majority of reads receive a verity check after the bio is complete
> as the page is marked uptodate. However, there is a class of reads which
> are handled with btrfs logic in readpage, rather than by submitting a
> bio. Specifically, these are inline extents, preallocated extents, and
> holes. Tweak readpage so that if it is going to mark such a page
> uptodate, it first checks verity on it.

So verity works with inline extents and fills the unused space by zeros
before hashing?

> Now if a veritied file has corruption to this class of EXTENT_DATA
> items, it will be detected at read time.
> 
> There is one annoying edge case that requires checking for start <
> last_byte: if userspace reads to the end of a file with page aligned
> size and then tries to keep reading (as cat does), the buffered read
> code will try to read the page past the end of the file, and expects it
> to be filled with 0s and marked uptodate. That bogus page is not part of
> the data hashed by verity, so we have to ignore it.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/extent_io.c | 26 +++++++-------------------
>  1 file changed, 7 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index d1f57a4ad2fb..d1493a876915 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2202,18 +2202,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	return bitset;
>  }
>  
> -/*
> - * helper function to set a given page up to date if all the
> - * extents in the tree for that page are up to date
> - */
> -static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
> -{
> -	u64 start = page_offset(page);
> -	u64 end = start + PAGE_SIZE - 1;
> -	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
> -		SetPageUptodate(page);
> -}
> -
>  int free_io_failure(struct extent_io_tree *failure_tree,
>  		    struct extent_io_tree *io_tree,
>  		    struct io_failure_record *rec)
> @@ -3467,14 +3455,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>  					    &cached, GFP_NOFS);
>  			unlock_extent_cached(tree, cur,
>  					     cur + iosize - 1, &cached);
> -			end_page_read(page, true, cur, iosize);
> +			ret = end_page_read(page, true, cur, iosize);

Latest version of end_page_read does not return any value.

>  			break;
>  		}
>  		em = __get_extent_map(inode, page, pg_offset, cur,
>  				      end - cur + 1, em_cached);
>  		if (IS_ERR_OR_NULL(em)) {
>  			unlock_extent(tree, cur, end);
> -			end_page_read(page, false, cur, end + 1 - cur);
> +			ret = end_page_read(page, false, cur, end + 1 - cur);
>  			break;
>  		}
>  		extent_offset = cur - em->start;
> @@ -3555,9 +3543,10 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>  
>  			set_extent_uptodate(tree, cur, cur + iosize - 1,
>  					    &cached, GFP_NOFS);
> +
>  			unlock_extent_cached(tree, cur,
>  					     cur + iosize - 1, &cached);
> -			end_page_read(page, true, cur, iosize);
> +			ret = end_page_read(page, true, cur, iosize);

And if it would, you'd have to check it in all cases when it's not
followed by break, like here.

>  			cur = cur + iosize;
>  			pg_offset += iosize;
>  			continue;
> @@ -3565,9 +3554,8 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>  		/* the get_extent function already copied into the page */
>  		if (test_range_bit(tree, cur, cur_end,
>  				   EXTENT_UPTODATE, 1, NULL)) {
> -			check_page_uptodate(tree, page);
>  			unlock_extent(tree, cur, cur + iosize - 1);
> -			end_page_read(page, true, cur, iosize);
> +			ret = end_page_read(page, true, cur, iosize);
>  			cur = cur + iosize;
>  			pg_offset += iosize;
>  			continue;
> @@ -3577,7 +3565,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>  		 */
>  		if (block_start == EXTENT_MAP_INLINE) {
>  			unlock_extent(tree, cur, cur + iosize - 1);
> -			end_page_read(page, false, cur, iosize);
> +			ret = end_page_read(page, false, cur, iosize);
>  			cur = cur + iosize;
>  			pg_offset += iosize;
>  			continue;
> @@ -3595,7 +3583,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>  			*bio_flags = this_bio_flag;
>  		} else {
>  			unlock_extent(tree, cur, cur + iosize - 1);
> -			end_page_read(page, false, cur, iosize);
> +			ret = end_page_read(page, false, cur, iosize);
>  			goto out;
>  		}
>  		cur = cur + iosize;
> -- 
> 2.30.2

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 5/5] btrfs: verity metadata orphan items
  2021-05-12 17:48   ` David Sterba
@ 2021-05-12 18:08     ` Boris Burkov
  2021-05-12 23:36       ` David Sterba
  0 siblings, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2021-05-12 18:08 UTC (permalink / raw)
  To: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 12, 2021 at 07:48:27PM +0200, David Sterba wrote:
> On Wed, May 05, 2021 at 12:20:43PM -0700, Boris Burkov wrote:
> > +/*
> > + * Helper to manage the transaction for adding an orphan item.
> > + */
> > +static int add_orphan(struct btrfs_inode *inode)
> 
> I wonder if this helper is useful, it's used only once and the code is
> not long. Simply wrapping btrfs_orphan_add into a transaction is short
> enough to be in btrfs_begin_enable_verity.
> 

I agree that just the plain transaction logic is not a big deal, and I
couldn't figure out how to phrase the comment so I left it at that,
which is unhelpful.

With that said, I found that pulling it out into a helper function
significantly reduced the gross-ness of the error handling in the
callsites. Especially for del_orphan in end verity which tries to
handle failures deleting the orphans, which quickly got tangled up with
other errors in the function and the possible transaction errors.

Honestly, I was surprised just how much it helped, and couldn't really
figure out why. If a helper being really beneficial is abnormal, I can
try again to figure out a clean way to write the code with the
transaction in-line.

> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_root *root = inode->root;
> > +	int ret = 0;
> > +
> > +	trans = btrfs_start_transaction(root, 1);
> > +	if (IS_ERR(trans)) {
> > +		ret = PTR_ERR(trans);
> > +		goto out;
> > +	}
> > +	ret = btrfs_orphan_add(trans, inode);
> > +	if (ret) {
> > +		btrfs_abort_transaction(trans, ret);
> > +		goto out;
> > +	}
> > +	btrfs_end_transaction(trans);
> > +
> > +out:
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Helper to manage the transaction for deleting an orphan item.
> > + */
> > +static int del_orphan(struct btrfs_inode *inode)
> 
> Same here.

My comment is dumb again, but the nlink check does make this function
marginally more useful for re-use/correctness.

> 
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_root *root = inode->root;
> > +	int ret;
> > +
> > +	/*
> > +	 * If the inode has no links, it is either already unlinked, or was
> > +	 * created with O_TMPFILE. In either case, it should have an orphan from
> > +	 * that other operation. Rather than reference count the orphans, we
> > +	 * simply ignore them here, because we only invoke the verity path in
> > +	 * the orphan logic when i_nlink is 0.
> > +	 */
> > +	if (!inode->vfs_inode.i_nlink)
> > +		return 0;
> > +
> > +	trans = btrfs_start_transaction(root, 1);
> > +	if (IS_ERR(trans))
> > +		return PTR_ERR(trans);
> > +
> > +	ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
> > +	if (ret) {
> > +		btrfs_abort_transaction(trans, ret);
> > +		return ret;
> > +	}
> > +
> > +	btrfs_end_transaction(trans);
> > +	return ret;
> > +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/5] btrfs: check verity for reads of inline extents and holes
  2021-05-12 17:57   ` David Sterba
@ 2021-05-12 18:25     ` Boris Burkov
  0 siblings, 0 replies; 26+ messages in thread
From: Boris Burkov @ 2021-05-12 18:25 UTC (permalink / raw)
  To: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 12, 2021 at 07:57:54PM +0200, David Sterba wrote:
> On Wed, May 05, 2021 at 12:20:41PM -0700, Boris Burkov wrote:
> > The majority of reads receive a verity check after the bio is complete
> > as the page is marked uptodate. However, there is a class of reads which
> > are handled with btrfs logic in readpage, rather than by submitting a
> > bio. Specifically, these are inline extents, preallocated extents, and
> > holes. Tweak readpage so that if it is going to mark such a page
> > uptodate, it first checks verity on it.
> 
> So verity works with inline extents and fills the unused space by zeros
> before hashing?

There is no special logic to zero the unused space for verity, we just
ship the page off to the VFS verity code before marking it Uptodate. The
inline extent logic in btrfs_get_extent does zero the parts of the page
past the data copied in.

> 
> > Now if a veritied file has corruption to this class of EXTENT_DATA
> > items, it will be detected at read time.
> > 
> > There is one annoying edge case that requires checking for start <
> > last_byte: if userspace reads to the end of a file with page aligned
> > size and then tries to keep reading (as cat does), the buffered read
> > code will try to read the page past the end of the file, and expects it
> > to be filled with 0s and marked uptodate. That bogus page is not part of
> > the data hashed by verity, so we have to ignore it.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/extent_io.c | 26 +++++++-------------------
> >  1 file changed, 7 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index d1f57a4ad2fb..d1493a876915 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2202,18 +2202,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
> >  	return bitset;
> >  }
> >  
> > -/*
> > - * helper function to set a given page up to date if all the
> > - * extents in the tree for that page are up to date
> > - */
> > -static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
> > -{
> > -	u64 start = page_offset(page);
> > -	u64 end = start + PAGE_SIZE - 1;
> > -	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
> > -		SetPageUptodate(page);
> > -}
> > -
> >  int free_io_failure(struct extent_io_tree *failure_tree,
> >  		    struct extent_io_tree *io_tree,
> >  		    struct io_failure_record *rec)
> > @@ -3467,14 +3455,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
> >  					    &cached, GFP_NOFS);
> >  			unlock_extent_cached(tree, cur,
> >  					     cur + iosize - 1, &cached);
> > -			end_page_read(page, true, cur, iosize);
> > +			ret = end_page_read(page, true, cur, iosize);
> 
> Latest version of end_page_read does not return any value.

In case you missed it, I modified it to return a value in the second
patch (btrfs: initial support for fsverity)

> 
> >  			break;
> >  		}
> >  		em = __get_extent_map(inode, page, pg_offset, cur,
> >  				      end - cur + 1, em_cached);
> >  		if (IS_ERR_OR_NULL(em)) {
> >  			unlock_extent(tree, cur, end);
> > -			end_page_read(page, false, cur, end + 1 - cur);
> > +			ret = end_page_read(page, false, cur, end + 1 - cur);
> >  			break;
> >  		}
> >  		extent_offset = cur - em->start;
> > @@ -3555,9 +3543,10 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
> >  
> >  			set_extent_uptodate(tree, cur, cur + iosize - 1,
> >  					    &cached, GFP_NOFS);
> > +
> >  			unlock_extent_cached(tree, cur,
> >  					     cur + iosize - 1, &cached);
> > -			end_page_read(page, true, cur, iosize);
> > +			ret = end_page_read(page, true, cur, iosize);
> 
> And if it would, you'd have to check it in all cases when it's not
> followed by break, like here.

Agreed. I think I got "lucky" because the continues all break the loop in
the cases I've tried. Thinking about it more, it looks like I need to set
the error bit on the page too, so that might work without end_page_read
having a return value.

> 
> >  			cur = cur + iosize;
> >  			pg_offset += iosize;
> >  			continue;
> > @@ -3565,9 +3554,8 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
> >  		/* the get_extent function already copied into the page */
> >  		if (test_range_bit(tree, cur, cur_end,
> >  				   EXTENT_UPTODATE, 1, NULL)) {
> > -			check_page_uptodate(tree, page);
> >  			unlock_extent(tree, cur, cur + iosize - 1);
> > -			end_page_read(page, true, cur, iosize);
> > +			ret = end_page_read(page, true, cur, iosize);
> >  			cur = cur + iosize;
> >  			pg_offset += iosize;
> >  			continue;
> > @@ -3577,7 +3565,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
> >  		 */
> >  		if (block_start == EXTENT_MAP_INLINE) {
> >  			unlock_extent(tree, cur, cur + iosize - 1);
> > -			end_page_read(page, false, cur, iosize);
> > +			ret = end_page_read(page, false, cur, iosize);
> >  			cur = cur + iosize;
> >  			pg_offset += iosize;
> >  			continue;
> > @@ -3595,7 +3583,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
> >  			*bio_flags = this_bio_flag;
> >  		} else {
> >  			unlock_extent(tree, cur, cur + iosize - 1);
> > -			end_page_read(page, false, cur, iosize);
> > +			ret = end_page_read(page, false, cur, iosize);
> >  			goto out;
> >  		}
> >  		cur = cur + iosize;
> > -- 
> > 2.30.2

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 5/5] btrfs: verity metadata orphan items
  2021-05-12 18:08     ` Boris Burkov
@ 2021-05-12 23:36       ` David Sterba
  0 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2021-05-12 23:36 UTC (permalink / raw)
  To: Boris Burkov; +Cc: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 12, 2021 at 11:08:57AM -0700, Boris Burkov wrote:
> On Wed, May 12, 2021 at 07:48:27PM +0200, David Sterba wrote:
> > On Wed, May 05, 2021 at 12:20:43PM -0700, Boris Burkov wrote:
> > > +/*
> > > + * Helper to manage the transaction for adding an orphan item.
> > > + */
> > > +static int add_orphan(struct btrfs_inode *inode)
> > 
> > I wonder if this helper is useful, it's used only once and the code is
> > not long. Simply wrapping btrfs_orphan_add into a transaction is short
> > enough to be in btrfs_begin_enable_verity.
> 
> I agree that just the plain transaction logic is not a big deal, and I
> couldn't figure out how to phrase the comment so I left it at that,
> which is unhelpful.
> 
> With that said, I found that pulling it out into a helper function
> significantly reduced the gross-ness of the error handling in the
> callsites. Especially for del_orphan in end verity which tries to
> handle failures deleting the orphans, which quickly got tangled up with
> other errors in the function and the possible transaction errors.
> 
> Honestly, I was surprised just how much it helped, and couldn't really
> figure out why. If a helper being really beneficial is abnormal, I can
> try again to figure out a clean way to write the code with the
> transaction in-line.

This gives me an impression that the helper in your view helps
readability and that's something I'm fine with. In the past we got
cleanups that remove one time helpers so I'm affected by that. Also the
helpers hide some details like the transaction start that could be
considered heavy so the helper kind of obscures that. But there's
another aspect, again readability, "do that and the caller does not need
to care", and when the helper is static in the same file it's easy to
look up and not a big deal.

> > > +{
> > > +	struct btrfs_trans_handle *trans;
> > > +	struct btrfs_root *root = inode->root;
> > > +	int ret = 0;
> > > +
> > > +	trans = btrfs_start_transaction(root, 1);
> > > +	if (IS_ERR(trans)) {
> > > +		ret = PTR_ERR(trans);
> > > +		goto out;
> > > +	}
> > > +	ret = btrfs_orphan_add(trans, inode);
> > > +	if (ret) {
> > > +		btrfs_abort_transaction(trans, ret);
> > > +		goto out;
> > > +	}
> > > +	btrfs_end_transaction(trans);
> > > +
> > > +out:
> > > +	return ret;
> > > +}
> > > +
> > > +/*
> > > + * Helper to manage the transaction for deleting an orphan item.
> > > + */
> > > +static int del_orphan(struct btrfs_inode *inode)
> > 
> > Same here.
> 
> My comment is dumb again, but the nlink check does make this function
> marginally more useful for re-use/correctness.

I don't think it's dumb, the nlink check is one line with several lines
of comment explaining and also described in the changelog as a corner
case and it's not obvious. For that reason a helper is fine and let's
keep the helpers as they are, so it's consistent. It's just when I'm
reading the code I'm questioning everything but it does not mean that
all of that needs to be done the way I see it, in the comments I'm
just exploring the possibility to do so.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-11 20:31   ` David Sterba
  2021-05-11 21:52     ` Boris Burkov
@ 2021-05-13 19:19     ` Boris Burkov
  2021-05-17 21:40       ` David Sterba
  1 sibling, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2021-05-13 19:19 UTC (permalink / raw)
  To: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Tue, May 11, 2021 at 10:31:43PM +0200, David Sterba wrote:
> On Wed, May 05, 2021 at 12:20:40PM -0700, Boris Burkov wrote:
> > From: Chris Mason <clm@fb.com>
> > 
> > Add support for fsverity in btrfs. To support the generic interface in
> > fs/verity, we add two new item types in the fs tree for inodes with
> > verity enabled. One stores the per-file verity descriptor and the other
> > stores the Merkle tree data itself.
> > 
> > Verity checking is done at the end of IOs to ensure each page is checked
> > before it is marked uptodate.
> > 
> > Verity relies on PageChecked for the Merkle tree data itself to avoid
> > re-walking up shared paths in the tree. For this reason, we need to
> > cache the Merkle tree data.
> 
> What's the estimated size of the Merkle tree data? Does the whole tree
> need to be kept cached or is it only for data that are in page cache?
> 
> > Since the file is immutable after verity is
> > turned on, we can cache it at an index past EOF.
> > 
> > Use the new inode compat_flags to store verity on the inode item, so
> > that we can enable verity on a file, then rollback to an older kernel
> > and still mount the file system and read the file. Since we can't safely
> > write the file anymore without ruining the invariants of the Merkle
> > tree, we mark a ro_compat flag on the file system when a file has verity
> > enabled.
> > 
> > Signed-off-by: Chris Mason <clm@fb.com>
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/Makefile               |   1 +
> >  fs/btrfs/btrfs_inode.h          |   1 +
> >  fs/btrfs/ctree.h                |  30 +-
> >  fs/btrfs/extent_io.c            |  27 +-
> >  fs/btrfs/file.c                 |   6 +
> >  fs/btrfs/inode.c                |   7 +
> >  fs/btrfs/ioctl.c                |  14 +-
> >  fs/btrfs/super.c                |   3 +
> >  fs/btrfs/sysfs.c                |   6 +
> >  fs/btrfs/verity.c               | 617 ++++++++++++++++++++++++++++++++
> >  include/uapi/linux/btrfs.h      |   2 +-
> >  include/uapi/linux/btrfs_tree.h |  15 +
> >  12 files changed, 718 insertions(+), 11 deletions(-)
> >  create mode 100644 fs/btrfs/verity.c
> > 
> > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> > index cec88a66bd6c..3dcf9bcc2326 100644
> > --- a/fs/btrfs/Makefile
> > +++ b/fs/btrfs/Makefile
> > @@ -36,6 +36,7 @@ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> >  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> >  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> >  btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
> > +btrfs-$(CONFIG_FS_VERITY) += verity.o
> >  
> >  btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
> >  	tests/extent-buffer-tests.o tests/btrfs-tests.o \
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index e8dbc8e848ce..4536548b9e79 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -51,6 +51,7 @@ enum {
> >  	 * the file range, inode's io_tree).
> >  	 */
> >  	BTRFS_INODE_NO_DELALLOC_FLUSH,
> > +	BTRFS_INODE_VERITY_IN_PROGRESS,
> 
> Please add a comment
> 
> >  };
> >  
> >  /* in memory btrfs inode */
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 0546273a520b..c5aab6a639ef 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -279,9 +279,10 @@ struct btrfs_super_block {
> >  #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
> >  #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
> >  
> > -#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
> > -	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
> > -	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
> > +#define BTRFS_FEATURE_COMPAT_RO_SUPP				\
> > +	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |		\
> > +	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID |	\
> > +	 BTRFS_FEATURE_COMPAT_RO_VERITY)
> >  
> >  #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
> >  #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
> > @@ -1505,6 +1506,11 @@ do {                                                                   \
> >  	 BTRFS_INODE_COMPRESS |						\
> >  	 BTRFS_INODE_ROOT_ITEM_INIT)
> >  
> > +/*
> > + * Inode compat flags
> > + */
> > +#define BTRFS_INODE_VERITY		(1 << 0)
> > +
> >  struct btrfs_map_token {
> >  	struct extent_buffer *eb;
> >  	char *kaddr;
> > @@ -3766,6 +3772,24 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
> >  	return signal_pending(current);
> >  }
> >  
> > +/* verity.c */
> > +#ifdef CONFIG_FS_VERITY
> > +extern const struct fsverity_operations btrfs_verityops;
> > +int btrfs_drop_verity_items(struct btrfs_inode *inode);
> > +BTRFS_SETGET_FUNCS(verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
> > +		   encryption, 8);
> > +BTRFS_SETGET_FUNCS(verity_descriptor_size, struct btrfs_verity_descriptor_item, size, 64);
> > +BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
> > +			 encryption, 8);
> > +BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size, struct btrfs_verity_descriptor_item,
> > +			 size, 64);
> > +#else
> > +static inline int btrfs_drop_verity_items(struct btrfs_inode *inode)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  /* Sanity test specific functions */
> >  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> >  void btrfs_test_destroy_inode(struct inode *inode);
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 4fb33cadc41a..d1f57a4ad2fb 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/pagevec.h>
> >  #include <linux/prefetch.h>
> >  #include <linux/cleancache.h>
> > +#include <linux/fsverity.h>
> >  #include "misc.h"
> >  #include "extent_io.h"
> >  #include "extent-io-tree.h"
> > @@ -2862,15 +2863,28 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
> >  	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
> >  }
> >  
> > -static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> > +static int end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> >  {
> > -	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> > +	int ret = 0;
> > +	struct inode *inode = page->mapping->host;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  
> >  	ASSERT(page_offset(page) <= start &&
> >  		start + len <= page_offset(page) + PAGE_SIZE);
> >  
> >  	if (uptodate) {
> > -		btrfs_page_set_uptodate(fs_info, page, start, len);
> > +		/*
> > +		 * buffered reads of a file with page alignment will issue a
> > +		 * 0 length read for one page past the end of file, so we must
> > +		 * explicitly skip checking verity on that page of zeros.
> > +		 */
> > +		if (!PageError(page) && !PageUptodate(page) &&
> > +		    start < i_size_read(inode) &&
> > +		    fsverity_active(inode) &&
> > +		    !fsverity_verify_page(page))
> > +			ret = -EIO;
> > +		else
> > +			btrfs_page_set_uptodate(fs_info, page, start, len);
> >  	} else {
> >  		btrfs_page_clear_uptodate(fs_info, page, start, len);
> >  		btrfs_page_set_error(fs_info, page, start, len);
> > @@ -2878,12 +2892,13 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
> >  
> >  	if (fs_info->sectorsize == PAGE_SIZE)
> >  		unlock_page(page);
> > -	else if (is_data_inode(page->mapping->host))
> > +	else if (is_data_inode(inode))
> >  		/*
> >  		 * For subpage data, unlock the page if we're the last reader.
> >  		 * For subpage metadata, page lock is not utilized for read.
> >  		 */
> >  		btrfs_subpage_end_reader(fs_info, page, start, len);
> > +	return ret;
> >  }
> >  
> >  /*
> > @@ -3059,7 +3074,9 @@ static void end_bio_extent_readpage(struct bio *bio)
> >  		bio_offset += len;
> >  
> >  		/* Update page status and unlock */
> > -		end_page_read(page, uptodate, start, len);
> > +		ret = end_page_read(page, uptodate, start, len);
> > +		if (ret)
> > +			uptodate = 0;
> >  		endio_readpage_release_extent(&processed, BTRFS_I(inode),
> >  					      start, end, uptodate);
> >  	}
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 3b10d98b4ebb..a99470303bd9 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/btrfs.h>
> >  #include <linux/uio.h>
> >  #include <linux/iversion.h>
> > +#include <linux/fsverity.h>
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "transaction.h"
> > @@ -3593,7 +3594,12 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
> >  
> >  static int btrfs_file_open(struct inode *inode, struct file *filp)
> >  {
> > +	int ret;
> 
> Missing newline
> 
> >  	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
> > +
> > +	ret = fsverity_file_open(inode, filp);
> > +	if (ret)
> > +		return ret;
> >  	return generic_file_open(inode, filp);
> >  }
> >  
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index d89000577f7f..1b1101369777 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/sched/mm.h>
> >  #include <linux/iomap.h>
> >  #include <asm/unaligned.h>
> > +#include <linux/fsverity.h>
> >  #include "misc.h"
> >  #include "ctree.h"
> >  #include "disk-io.h"
> > @@ -5405,7 +5406,9 @@ void btrfs_evict_inode(struct inode *inode)
> >  
> >  	trace_btrfs_inode_evict(inode);
> >  
> > +
> 
> Extra newline
> 
> >  	if (!root) {
> > +		fsverity_cleanup_inode(inode);
> >  		clear_inode(inode);
> >  		return;
> >  	}
> > @@ -5488,6 +5491,7 @@ void btrfs_evict_inode(struct inode *inode)
> >  	 * to retry these periodically in the future.
> >  	 */
> >  	btrfs_remove_delayed_node(BTRFS_I(inode));
> > +	fsverity_cleanup_inode(inode);
> >  	clear_inode(inode);
> >  }
> >  
> > @@ -9041,6 +9045,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
> >  	struct inode *inode = d_inode(path->dentry);
> >  	u32 blocksize = inode->i_sb->s_blocksize;
> >  	u32 bi_flags = BTRFS_I(inode)->flags;
> > +	u32 bi_compat_flags = BTRFS_I(inode)->compat_flags;
> >  
> >  	stat->result_mask |= STATX_BTIME;
> >  	stat->btime.tv_sec = BTRFS_I(inode)->i_otime.tv_sec;
> > @@ -9053,6 +9058,8 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
> >  		stat->attributes |= STATX_ATTR_IMMUTABLE;
> >  	if (bi_flags & BTRFS_INODE_NODUMP)
> >  		stat->attributes |= STATX_ATTR_NODUMP;
> > +	if (bi_compat_flags & BTRFS_INODE_VERITY)
> > +		stat->attributes |= STATX_ATTR_VERITY;
> >  
> >  	stat->attributes_mask |= (STATX_ATTR_APPEND |
> >  				  STATX_ATTR_COMPRESSED |
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index ff335c192170..4b8f38fe4226 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -26,6 +26,7 @@
> >  #include <linux/btrfs.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/iversion.h>
> > +#include <linux/fsverity.h>
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "export.h"
> > @@ -105,6 +106,7 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
> >  static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
> >  {
> >  	unsigned int flags = binode->flags;
> > +	unsigned int compat_flags = binode->compat_flags;
> >  	unsigned int iflags = 0;
> >  
> >  	if (flags & BTRFS_INODE_SYNC)
> > @@ -121,6 +123,8 @@ static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
> >  		iflags |= FS_DIRSYNC_FL;
> >  	if (flags & BTRFS_INODE_NODATACOW)
> >  		iflags |= FS_NOCOW_FL;
> > +	if (compat_flags & BTRFS_INODE_VERITY)
> > +		iflags |= FS_VERITY_FL;
> >  
> >  	if (flags & BTRFS_INODE_NOCOMPRESS)
> >  		iflags |= FS_NOCOMP_FL;
> > @@ -148,10 +152,12 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
> >  		new_fl |= S_NOATIME;
> >  	if (binode->flags & BTRFS_INODE_DIRSYNC)
> >  		new_fl |= S_DIRSYNC;
> > +	if (binode->compat_flags & BTRFS_INODE_VERITY)
> > +		new_fl |= S_VERITY;
> >  
> >  	set_mask_bits(&inode->i_flags,
> > -		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> > -		      new_fl);
> > +		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
> > +		      S_VERITY, new_fl);
> >  }
> >  
> >  static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
> > @@ -5072,6 +5078,10 @@ long btrfs_ioctl(struct file *file, unsigned int
> >  		return btrfs_ioctl_get_subvol_rootref(file, argp);
> >  	case BTRFS_IOC_INO_LOOKUP_USER:
> >  		return btrfs_ioctl_ino_lookup_user(file, argp);
> > +	case FS_IOC_ENABLE_VERITY:
> > +		return fsverity_ioctl_enable(file, (const void __user *)argp);
> > +	case FS_IOC_MEASURE_VERITY:
> > +		return fsverity_ioctl_measure(file, argp);
> >  	}
> >  
> >  	return -ENOTTY;
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index 4a396c1147f1..aa41ee30e3ca 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -1365,6 +1365,9 @@ static int btrfs_fill_super(struct super_block *sb,
> >  	sb->s_op = &btrfs_super_ops;
> >  	sb->s_d_op = &btrfs_dentry_operations;
> >  	sb->s_export_op = &btrfs_export_ops;
> > +#ifdef CONFIG_FS_VERITY
> > +	sb->s_vop = &btrfs_verityops;
> > +#endif
> >  	sb->s_xattr = btrfs_xattr_handlers;
> >  	sb->s_time_gran = 1;
> >  #ifdef CONFIG_BTRFS_FS_POSIX_ACL
> > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > index 436ac7b4b334..331ea4febcb1 100644
> > --- a/fs/btrfs/sysfs.c
> > +++ b/fs/btrfs/sysfs.c
> > @@ -267,6 +267,9 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> >  #ifdef CONFIG_BTRFS_DEBUG
> >  BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
> >  #endif
> > +#ifdef CONFIG_FS_VERITY
> > +BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
> > +#endif
> >  
> >  static struct attribute *btrfs_supported_feature_attrs[] = {
> >  	BTRFS_FEAT_ATTR_PTR(mixed_backref),
> > @@ -284,6 +287,9 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
> >  	BTRFS_FEAT_ATTR_PTR(raid1c34),
> >  #ifdef CONFIG_BTRFS_DEBUG
> >  	BTRFS_FEAT_ATTR_PTR(zoned),
> > +#endif
> > +#ifdef CONFIG_FS_VERITY
> > +	BTRFS_FEAT_ATTR_PTR(verity),
> >  #endif
> >  	NULL
> >  };
> > diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
> > new file mode 100644
> > index 000000000000..feaf5908b3d3
> > --- /dev/null
> > +++ b/fs/btrfs/verity.c
> > @@ -0,0 +1,617 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2020 Facebook.  All rights reserved.
> > + */
> 
> This is not necessary since we have the SPDX tags,
> https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Copyright_notices_in_files.2C_SPDX
> 
> > +
> > +#include <linux/init.h>
> > +#include <linux/fs.h>
> > +#include <linux/slab.h>
> > +#include <linux/rwsem.h>
> > +#include <linux/xattr.h>
> > +#include <linux/security.h>
> > +#include <linux/posix_acl_xattr.h>
> > +#include <linux/iversion.h>
> > +#include <linux/fsverity.h>
> > +#include <linux/sched/mm.h>
> > +#include "ctree.h"
> > +#include "btrfs_inode.h"
> > +#include "transaction.h"
> > +#include "disk-io.h"
> > +#include "locking.h"
> > +
> > +/*
> > + * Just like ext4, we cache the merkle tree in pages after EOF in the page
> > + * cache.  Unlike ext4, we're storing these in dedicated btree items and
> > + * not just shoving them after EOF in the file.  This means we'll need to
> > + * do extra work to encrypt them once encryption is supported in btrfs,
> > + * but btrfs has a lot of careful code around i_size and it seems better
> > + * to make a new key type than try and adjust all of our expectations
> > + * for i_size.
> 
> Can you please rephrase that so it does not start with what other
> filesystems do but what is the actual design and put references to ext4
> eventually?
> 
> > + *
> > + * fs verity items are stored under two different key types on disk.
> > + *
> > + * The descriptor items:
> > + * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]
> 
> Please put that to the key definitions
> 
> > + *
> > + * At offset 0, we store a btrfs_verity_descriptor_item which tracks the
> > + * size of the descriptor item and some extra data for encryption.
> > + * Starting at offset 1, these hold the generic fs verity descriptor.
> > + * These are opaque to btrfs, we just read and write them as a blob for
> > + * the higher level verity code.  The most common size for this is 256 bytes.
> > + *
> > + * The merkle tree items:
> > + * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
> > + *
> > + * These also start at offset 0, and correspond to the merkle tree bytes.
> > + * So when fsverity asks for page 0 of the merkle tree, we pull up one page
> > + * starting at offset 0 for this key type.  These are also opaque to btrfs,
> > + * we're blindly storing whatever fsverity sends down.
> > + */
> > +
> > +/*
> > + * Compute the logical file offset where we cache the Merkle tree.
> > + *
> > + * @inode: the inode of the verity file
> > + *
> > + * For the purposes of caching the Merkle tree pages, as required by
> > + * fs-verity, it is convenient to do size computations in terms of a file
> > + * offset, rather than in terms of page indices.
> > + *
> > + * Returns the file offset on success, negative error code on failure.
> > + */
> > +static loff_t merkle_file_pos(const struct inode *inode)
> > +{
> > +	u64 sz = inode->i_size;
> > +	u64 ret = round_up(sz, 65536);
> 
> What's the reason for the extra variable sz? If that is meant to make
> the whole u64 is read consistently, then it needs protection and the
> i_read_size if the status of inode lock and context of call is unknown.
> Compiler will happily merge that to round_up(inode->i_size).
> 
> Next, what's the meaning of the constant 65536?
> 
> > +
> > +	if (ret > inode->i_sb->s_maxbytes)
> > +		return -EFBIG;
> > +	return ret;
> 
> ret is u64 so the function should also return u64
> 
> > +}
> > +
> > +/*
> > + * Drop all the items for this inode with this key_type.
> 
> Newline
> 
> > + * @inode: The inode to drop items for
> > + * @key_type: The type of items to drop (VERITY_DESC_ITEM or
> > + *            VERITY_MERKLE_ITEM)
> 
> Please format the agrumgenst according to the description in
> https://btrfs.wiki.kernel.org/index.php/Development_notes#Comments
> 
> > + *
> > + * Before doing a verity enable we cleanup any existing verity items.
> > + *
> > + * This is also used to clean up if a verity enable failed half way
> > + * through.
> > + *
> > + * Returns 0 on success, negative error code on failure.
> > + */
> > +static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_root *root = inode->root;
> > +	struct btrfs_path *path;
> > +	struct btrfs_key key;
> > +	int ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	while (1) {
> > +		trans = btrfs_start_transaction(root, 1);
> 
> Transaction start should document what are the reserved items, ie. what
> is the 1 related to.
> 
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * walk backwards through all the items until we find one
> 
> Comments should start with uppercase unless it's and identifier name.
> This is in many other places so please update them as well.
> 
> > +		 * that isn't from our key type or objectid
> > +		 */
> > +		key.objectid = btrfs_ino(inode);
> > +		key.offset = (u64)-1;
> > +		key.type = key_type;
> 
> It's common to sort the members as they go in order so
> objectid/type/offset, this helps to keep the idea of the key.
> 
> > +
> > +		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
> > +		if (ret > 0) {
> > +			ret = 0;
> > +			/* no more keys of this type, we're done */
> > +			if (path->slots[0] == 0)
> > +				break;
> > +			path->slots[0]--;
> > +		} else if (ret < 0) {
> > +			break;
> > +		}
> > +
> > +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > +
> > +		/* no more keys of this type, we're done */
> > +		if (key.objectid != btrfs_ino(inode) || key.type != key_type)
> > +			break;
> > +
> > +		/*
> > +		 * this shouldn't be a performance sensitive function because
> > +		 * it's not used as part of truncate.  If it ever becomes
> > +		 * perf sensitive, change this to walk forward and bulk delete
> > +		 * items
> > +		 */
> > +		ret = btrfs_del_items(trans, root, path,
> > +				      path->slots[0], 1);
> 
> This will probably fit on one line, no need to split the parameters.
> 
> > +		btrfs_release_path(path);
> > +		btrfs_end_transaction(trans);
> > +
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	btrfs_end_transaction(trans);
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +
> > +}
> > +
> > +/*
> > + * Insert and write inode items with a given key type and offset.
> > + * @inode: The inode to insert for.
> > + * @key_type: The key type to insert.
> > + * @offset: The item offset to insert at.
> > + * @src: Source data to write.
> > + * @len: Length of source data to write.
> > + *
> > + * Write len bytes from src into items of up to 1k length.
> > + * The inserted items will have key <ino, key_type, offset + off> where
> > + * off is consecutively increasing from 0 up to the last item ending at
> > + * offset + len.
> > + *
> > + * Returns 0 on success and a negative error code on failure.
> > + */
> > +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> > +			   const char *src, u64 len)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_path *path;
> > +	struct btrfs_root *root = inode->root;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_key key;
> > +	u64 copied = 0;
> > +	unsigned long copy_bytes;
> > +	unsigned long src_offset = 0;
> > +	void *data;
> > +	int ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	while (len > 0) {
> > +		trans = btrfs_start_transaction(root, 1);
> 
> Same as before, please document what items are reserved
> 
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			break;
> > +		}
> > +
> > +		key.objectid = btrfs_ino(inode);
> > +		key.offset = offset;
> > +		key.type = key_type;
> 
> objectid/type/offset
> 
> > +
> > +		/*
> > +		 * insert 1K at a time mostly to be friendly for smaller
> > +		 * leaf size filesystems
> > +		 */
> > +		copy_bytes = min_t(u64, len, 1024);
> > +
> > +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> > +		if (ret) {
> > +			btrfs_end_transaction(trans);
> > +			break;
> > +		}
> > +
> > +		leaf = path->nodes[0];
> > +
> > +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> > +		write_extent_buffer(leaf, src + src_offset,
> > +				    (unsigned long)data, copy_bytes);
> > +		offset += copy_bytes;
> > +		src_offset += copy_bytes;
> > +		len -= copy_bytes;
> > +		copied += copy_bytes;
> > +
> > +		btrfs_release_path(path);
> > +		btrfs_end_transaction(trans);
> > +	}
> > +
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Read inode items of the given key type and offset from the btree.
> > + * @inode: The inode to read items of.
> > + * @key_type: The key type to read.
> > + * @offset: The item offset to read from.
> > + * @dest: The buffer to read into. This parameter has slightly tricky
> > + *        semantics.  If it is NULL, the function will not do any copying
> > + *        and will just return the size of all the items up to len bytes.
> > + *        If dest_page is passed, then the function will kmap_atomic the
> > + *        page and ignore dest, but it must still be non-NULL to avoid the
> > + *        counting-only behavior.
> > + * @len: Length in bytes to read.
> > + * @dest_page: Copy into this page instead of the dest buffer.
> > + *
> > + * Helper function to read items from the btree.  This returns the number
> > + * of bytes read or < 0 for errors.  We can return short reads if the
> > + * items don't exist on disk or aren't big enough to fill the desired length.
> > + *
> > + * Supports reading into a provided buffer (dest) or into the page cache
> > + *
> > + * Returns number of bytes read or a negative error code on failure.
> > + */
> > +static ssize_t read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> 
> Why does this return ssize_t? The type is not utilized anywhere in the
> function an 'int' should work.
> 
> > +			  char *dest, u64 len, struct page *dest_page)
> > +{
> > +	struct btrfs_path *path;
> > +	struct btrfs_root *root = inode->root;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_key key;
> > +	u64 item_end;
> > +	u64 copy_end;
> > +	u64 copied = 0;
> 
> Here copied is u64
> 
> > +	u32 copy_offset;
> > +	unsigned long copy_bytes;
> > +	unsigned long dest_offset = 0;
> > +	void *data;
> > +	char *kaddr = dest;
> > +	int ret;
> 
> and ret is int
> 
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	if (dest_page)
> > +		path->reada = READA_FORWARD;
> > +
> > +	key.objectid = btrfs_ino(inode);
> > +	key.offset = offset;
> > +	key.type = key_type;
> > +
> > +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +	if (ret < 0) {
> > +		goto out;
> > +	} else if (ret > 0) {
> > +		ret = 0;
> > +		if (path->slots[0] == 0)
> > +			goto out;
> > +		path->slots[0]--;
> > +	}
> > +
> > +	while (len > 0) {
> > +		leaf = path->nodes[0];
> > +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +
> > +		if (key.objectid != btrfs_ino(inode) ||
> > +		    key.type != key_type)
> > +			break;
> > +
> > +		item_end = btrfs_item_size_nr(leaf, path->slots[0]) + key.offset;
> > +
> > +		if (copied > 0) {
> > +			/*
> > +			 * once we've copied something, we want all of the items
> > +			 * to be sequential
> > +			 */
> > +			if (key.offset != offset)
> > +				break;
> > +		} else {
> > +			/*
> > +			 * our initial offset might be in the middle of an
> > +			 * item.  Make sure it all makes sense
> > +			 */
> > +			if (key.offset > offset)
> > +				break;
> > +			if (item_end <= offset)
> > +				break;
> > +		}
> > +
> > +		/* desc = NULL to just sum all the item lengths */
> > +		if (!dest)
> > +			copy_end = item_end;
> > +		else
> > +			copy_end = min(offset + len, item_end);
> > +
> > +		/* number of bytes in this item we want to copy */
> > +		copy_bytes = copy_end - offset;
> > +
> > +		/* offset from the start of item for copying */
> > +		copy_offset = offset - key.offset;
> > +
> > +		if (dest) {
> > +			if (dest_page)
> > +				kaddr = kmap_atomic(dest_page);
> 
> I think the kmap_atomic should not be used, there was a patchset
> cleaning it up and replacing by kmap_local so we should not introduce
> new instances.
> 
> > +
> > +			data = btrfs_item_ptr(leaf, path->slots[0], void);
> > +			read_extent_buffer(leaf, kaddr + dest_offset,
> > +					   (unsigned long)data + copy_offset,
> > +					   copy_bytes);
> > +
> > +			if (dest_page)
> > +				kunmap_atomic(kaddr);
> > +		}
> > +
> > +		offset += copy_bytes;
> > +		dest_offset += copy_bytes;
> > +		len -= copy_bytes;
> > +		copied += copy_bytes;
> > +
> > +		path->slots[0]++;
> > +		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> > +			/*
> > +			 * we've reached the last slot in this leaf and we need
> > +			 * to go to the next leaf.
> > +			 */
> > +			ret = btrfs_next_leaf(root, path);
> > +			if (ret < 0) {
> > +				break;
> > +			} else if (ret > 0) {
> > +				ret = 0;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +out:
> > +	btrfs_free_path(path);
> > +	if (!ret)
> > +		ret = copied;
> > +	return ret;
> 
> In the end it's int and copied u64 is truncated to int.
> 
> > +}
> > +
> > +/*
> > + * Drop verity items from the btree and from the page cache
> > + *
> > + * @inode: the inode to drop items for
> > + *
> > + * If we fail partway through enabling verity, enable verity and have some
> > + * partial data extant, or cleanup orphaned verity data, we need to truncate it
>                    extent
> 
> > + * from the cache and delete the items themselves from the btree.
> > + *
> > + * Returns 0 on success, negative error code on failure.
> > + */
> > +int btrfs_drop_verity_items(struct btrfs_inode *inode)
> > +{
> > +	int ret;
> > +	struct inode *ino = &inode->vfs_inode;
> 
> 'ino' is usually used for inode number so this is a bit confusing,
> 
> > +
> > +	truncate_inode_pages(ino->i_mapping, ino->i_size);
> > +	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
> > +	if (ret)
> > +		return ret;
> > +	return drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
> > +}
> > +
> > +/*
> > + * fsverity op that begins enabling verity.
> > + * fsverity calls this to ask us to setup the inode for enabling.  We
> > + * drop any existing verity items and set the in progress bit.
> 
> Please rephrase it so it says something like "Begin enabling verity on
> and inode. We drop ... "
> 
> > + */
> > +static int btrfs_begin_enable_verity(struct file *filp)
> > +{
> > +	struct inode *inode = file_inode(filp);
> 
> Please replace this with struct btrfs_inode * inode = ... and don't do
> the BTRFS_I conversion in the rest of the function.
> 
> > +	int ret;
> > +
> > +	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags))
> > +		return -EBUSY;
> > +
> > +	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> 
> So the test and set are separate, can this race? No, as this is called
> under the inode lock but this needs a trip to fsverity sources so be
> sure. I'd suggest to put at least inode lock assertion, or a comment but
> this is weaker than a runtime check.
> 
> > +	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
> > +	if (ret)
> > +		goto err;
> > +
> > +	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
> > +	if (ret)
> > +		goto err;
> > +
> > +	return 0;
> > +
> > +err:
> > +	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> > +	return ret;
> > +
> 
> Extra newline
> 
> > +}
> > +
> > +/*
> > + * fsverity op that ends enabling verity.
> > + * fsverity calls this when it's done with all of the pages in the file
> > + * and all of the merkle items have been inserted.  We write the
> > + * descriptor and update the inode in the btree to reflect its new life
> > + * as a verity file.
> 
> Please rephrase
> 
> > + */
> > +static int btrfs_end_enable_verity(struct file *filp, const void *desc,
> > +				  size_t desc_size, u64 merkle_tree_size)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct inode *inode = file_inode(filp);
> 
> Same as above, replace by btrfs inode and drop BTRFS_I below
> 
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > +	struct btrfs_verity_descriptor_item item;
> > +	int ret;
> > +
> > +	if (desc != NULL) {
> > +		/* write out the descriptor item */
> > +		memset(&item, 0, sizeof(item));
> > +		btrfs_set_stack_verity_descriptor_size(&item, desc_size);
> > +		ret = write_key_bytes(BTRFS_I(inode),
> > +				      BTRFS_VERITY_DESC_ITEM_KEY, 0,
> > +				      (const char *)&item, sizeof(item));
> > +		if (ret)
> > +			goto out;
> > +		/* write out the descriptor itself */
> > +		ret = write_key_bytes(BTRFS_I(inode),
> > +				      BTRFS_VERITY_DESC_ITEM_KEY, 1,
> > +				      desc, desc_size);
> > +		if (ret)
> > +			goto out;
> > +
> > +		/* update our inode flags to include fs verity */
> > +		trans = btrfs_start_transaction(root, 1);
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			goto out;
> > +		}
> > +		BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
> > +		btrfs_sync_inode_flags_to_i_flags(inode);
> > +		ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
> > +		btrfs_end_transaction(trans);
> > +	}
> > +
> > +out:
> > +	if (desc == NULL || ret) {
> > +		/* If we failed, drop all the verity items */
> > +		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
> > +		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
> > +	} else
> 
> 	} else {
> 
> > +		btrfs_set_fs_compat_ro(root->fs_info, VERITY);
> 
> 	}
> 
> > +	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * fsverity op that gets the struct fsverity_descriptor.
> > + * fsverity does a two pass setup for reading the descriptor, in the first pass
> > + * it calls with buf_size = 0 to query the size of the descriptor,
> > + * and then in the second pass it actually reads the descriptor off
> > + * disk.
> > + */
> > +static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
> > +				       size_t buf_size)
> > +{
> > +	u64 true_size;
> > +	ssize_t ret = 0;
> > +	struct btrfs_verity_descriptor_item item;
> > +
> > +	memset(&item, 0, sizeof(item));
> > +	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
> > +			     0, (char *)&item, sizeof(item), NULL);
> 
> Given that read_key_bytes does not need to return ssize_t, you can
> switch ret to 0 here, so the function return type actually matches what
> you return.
> 
> > +	if (ret < 0)
> > +		return ret;
> 
> eg. here
> 
> > +
> > +	if (item.reserved[0] != 0 || item.reserved[1] != 0)
> > +		return -EUCLEAN;
> > +
> > +	true_size = btrfs_stack_verity_descriptor_size(&item);
> > +	if (true_size > INT_MAX)
> > +		return -EUCLEAN;
> > +
> > +	if (!buf_size)
> > +		return true_size;
> > +	if (buf_size < true_size)
> > +		return -ERANGE;
> > +
> > +	ret = read_key_bytes(BTRFS_I(inode),
> > +			     BTRFS_VERITY_DESC_ITEM_KEY, 1,
> > +			     buf, buf_size, NULL);
> > +	if (ret < 0)
> > +		return ret;
> > +	if (ret != true_size)
> > +		return -EIO;
> > +
> > +	return true_size;
> > +}
> > +
> > +/*
> > + * fsverity op that reads and caches a merkle tree page.  These are stored
> > + * in the btree, but we cache them in the inode's address space after EOF.
> > + */
> > +static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
> > +					       pgoff_t index,
> > +					       unsigned long num_ra_pages)
> > +{
> > +	struct page *p;
> 
> Please don't use single letter variables
> 
> > +	u64 off = index << PAGE_SHIFT;
> 
> pgoff_t is unsigned long, the shift will trim high bytes, you may want
> to use the page_offset helper instead.
> 
> > +	loff_t merkle_pos = merkle_file_pos(inode);
> 
> u64, that should work with comparison to loff_t
> 
> > +	ssize_t ret;
> > +	int err;
> > +
> > +	if (merkle_pos > inode->i_sb->s_maxbytes - off - PAGE_SIZE)
> > +		return ERR_PTR(-EFBIG);
> > +	index += merkle_pos >> PAGE_SHIFT;
> > +again:
> > +	p = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED);
> > +	if (p) {
> > +		if (PageUptodate(p))
> > +			return p;
> > +
> > +		lock_page(p);
> > +		/*
> > +		 * we only insert uptodate pages, so !Uptodate has to be
> > +		 * an error
> > +		 */
> > +		if (!PageUptodate(p)) {
> > +			unlock_page(p);
> > +			put_page(p);
> > +			return ERR_PTR(-EIO);
> > +		}
> > +		unlock_page(p);
> > +		return p;
> > +	}
> > +
> > +	p = page_cache_alloc(inode->i_mapping);
> 
> So this performs an allocation with GFP flags from the inode mapping.
> I'm not sure if this is safe, eg. in add_ra_bio_pages we do 
> 
> 548     page = __page_cache_alloc(mapping_gfp_constraint(mapping,                                                                                                
> 549                                                      ~__GFP_FS));
> 
> to emulate GFP_NOFS. Either that or do the scoped nofs with
> memalloc_nofs_save/_restore.
> 
> > +	if (!p)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/*
> > +	 * merkle item keys are indexed from byte 0 in the merkle tree.
> > +	 * they have the form:
> > +	 *
> > +	 * [ inode objectid, BTRFS_MERKLE_ITEM_KEY, offset in bytes ]
> > +	 */
> > +	ret = read_key_bytes(BTRFS_I(inode),
> > +			     BTRFS_VERITY_MERKLE_ITEM_KEY, off,
> > +			     page_address(p), PAGE_SIZE, p);
> > +	if (ret < 0) {
> > +		put_page(p);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	/* zero fill any bytes we didn't write into the page */
> > +	if (ret < PAGE_SIZE) {
> > +		char *kaddr = kmap_atomic(p);
> > +
> > +		memset(kaddr + ret, 0, PAGE_SIZE - ret);
> > +		kunmap_atomic(kaddr);
> 
> There's helper memzero_page wrapping the kmap

FWIW, that helper uses kmap_atomic, not kmap_local_page. Would you
prefer I use the helper or not introduce new uses of kmap_atomic?

> 
> > +	}
> > +	SetPageUptodate(p);
> > +	err = add_to_page_cache_lru(p, inode->i_mapping, index,
> 
> Please drop err and use ret
> 
> > +				    mapping_gfp_mask(inode->i_mapping));
> > +
> > +	if (!err) {
> > +		/* inserted and ready for fsverity */
> > +		unlock_page(p);
> > +	} else {
> > +		put_page(p);
> > +		/* did someone race us into inserting this page? */
> > +		if (err == -EEXIST)
> > +			goto again;
> > +		p = ERR_PTR(err);
> > +	}
> > +	return p;
> > +}
> > +
> > +/*
> > + * fsverity op that writes a merkle tree block into the btree in 1k chunks.
> 
> Should it say "in 2^log_blocksize chunks" instead?
> 
> > + */
> > +static int btrfs_write_merkle_tree_block(struct inode *inode, const void *buf,
> > +					u64 index, int log_blocksize)
> > +{
> > +	u64 off = index << log_blocksize;
> > +	u64 len = 1 << log_blocksize;
> > +
> > +	if (merkle_file_pos(inode) > inode->i_sb->s_maxbytes - off - len)
> > +		return -EFBIG;
> > +
> > +	return write_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY,
> > +			       off, buf, len);
> > +}
> > +
> > +const struct fsverity_operations btrfs_verityops = {
> > +	.begin_enable_verity	= btrfs_begin_enable_verity,
> > +	.end_enable_verity	= btrfs_end_enable_verity,
> > +	.get_verity_descriptor	= btrfs_get_verity_descriptor,
> > +	.read_merkle_tree_page	= btrfs_read_merkle_tree_page,
> > +	.write_merkle_tree_block = btrfs_write_merkle_tree_block,
> > +};
> > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> > index 5df73001aad4..fa21c8aac78d 100644
> > --- a/include/uapi/linux/btrfs.h
> > +++ b/include/uapi/linux/btrfs.h
> > @@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
> >   * first mount when booting older kernel versions.
> >   */
> >  #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
> > +#define BTRFS_FEATURE_COMPAT_RO_VERITY		(1ULL << 2)
> >  
> >  #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
> >  #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
> > @@ -308,7 +309,6 @@ struct btrfs_ioctl_fs_info_args {
> >  #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
> >  #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
> >  #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
> > -
> 
> Keep the newline please
> 
> >  struct btrfs_ioctl_feature_flags {
> >  	__u64 compat_flags;
> >  	__u64 compat_ro_flags;
> > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > index ae25280316bd..2be57416f886 100644
> > --- a/include/uapi/linux/btrfs_tree.h
> > +++ b/include/uapi/linux/btrfs_tree.h
> > @@ -118,6 +118,14 @@
> >  #define BTRFS_INODE_REF_KEY		12
> >  #define BTRFS_INODE_EXTREF_KEY		13
> >  #define BTRFS_XATTR_ITEM_KEY		24
> > +
> > +/*
> > + * fsverity has a descriptor per file, and then
> > + * a number of sha or csum items indexed by offset in to the file.
> > + */
> > +#define BTRFS_VERITY_DESC_ITEM_KEY	36
> > +#define BTRFS_VERITY_MERKLE_ITEM_KEY	37
> > +
> >  #define BTRFS_ORPHAN_ITEM_KEY		48
> >  /* reserve 2-15 close to the inode for later flexibility */
> >  
> > @@ -996,4 +1004,11 @@ struct btrfs_qgroup_limit_item {
> >  	__le64 rsv_excl;
> >  } __attribute__ ((__packed__));
> >  
> > +struct btrfs_verity_descriptor_item {
> > +	/* size of the verity descriptor in bytes */
> > +	__le64 size;
> > +	__le64 reserved[2];
> 
> Is the reserved space "just in case" or are there plans to use it? For
> items the extension and compatibility can be done by checking the item
> size, without further flags or bits set to distinguish that.
> 
> If the extension happens rarely it's manageable to do the size check
> instead of reserving the space.
> 
> The reserved space must be otherwise zero if not used, this serves as
> the way to check the compatibility. It still may need additional code to
> make sure old kernel does recognize unkown contents and eg. refuses to
> work. I can imagine in the context of verity it could be significant.
> 
> > +	__u8 encryption;
> > +} __attribute__ ((__packed__));
> > +
> >  #endif /* _BTRFS_CTREE_H_ */
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/5] btrfs: initial fsverity support
  2021-05-13 19:19     ` Boris Burkov
@ 2021-05-17 21:40       ` David Sterba
  0 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2021-05-17 21:40 UTC (permalink / raw)
  To: Boris Burkov; +Cc: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Thu, May 13, 2021 at 12:19:38PM -0700, Boris Burkov wrote:
> On Tue, May 11, 2021 at 10:31:43PM +0200, David Sterba wrote:
> > On Wed, May 05, 2021 at 12:20:40PM -0700, Boris Burkov wrote:
> > > +	/* zero fill any bytes we didn't write into the page */
> > > +	if (ret < PAGE_SIZE) {
> > > +		char *kaddr = kmap_atomic(p);
> > > +
> > > +		memset(kaddr + ret, 0, PAGE_SIZE - ret);
> > > +		kunmap_atomic(kaddr);
> > 
> > There's helper memzero_page wrapping the kmap
> 
> FWIW, that helper uses kmap_atomic, not kmap_local_page. Would you
> prefer I use the helper or not introduce new uses of kmap_atomic?

Please use the memzero_page helper, see d048b9c2a737eb791a5e9 "btrfs:
use memzero_page() instead of open coded kmap pattern" for more details.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-05-11 19:11   ` David Sterba
@ 2021-05-17 21:48     ` David Sterba
  2021-05-19 21:45       ` Boris Burkov
  0 siblings, 1 reply; 26+ messages in thread
From: David Sterba @ 2021-05-17 21:48 UTC (permalink / raw)
  To: dsterba, Boris Burkov, linux-btrfs, linux-fscrypt, kernel-team

On Tue, May 11, 2021 at 09:11:08PM +0200, David Sterba wrote:
> On Wed, May 05, 2021 at 12:20:39PM -0700, Boris Burkov wrote:
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -191,6 +191,7 @@ struct btrfs_inode {
> >  
> >  	/* flags field from the on disk inode */
> >  	u32 flags;
> > +	u64 compat_flags;
> 
> This got me curious, u32 flags is for the in-memory inode, but the
> on-disk inode_item::flags is u64
> 
> >  BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
>                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> > +BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
> 
> >  	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> 
> Which means we currently use only 32 bits and half of the on-disk
> inode_item::flags is always zero. So the idea is to repurpose this for
> the incompat bits (say upper 16 bits). With a minimal patch to tree
> checker we can make old kernels accept a verity-enabled kernel.
> 
> It could be tricky, but for backport only additional bitmask would be
> added to BTRFS_INODE_FLAG_MASK to ignore bits 48-63.
> 
> For proper support the inode_item::flags can be simply used as one space
> where the split would be just logical, and IMO manageable.

To demonstrate the idea, here's a compile-tested patch, based on
current misc-next but the verity bits are easy to match to your
patchset:

- btrfs_inode::ro_flags - in-memory representation of the ro flags
- tree-checker verifies the flags separately
  - errors if there are unkonwn flags (u32)
  - errors if ro_flags don't match fs ro_compat bits
- inode_item::flags gets synced with btrfs_inode::flags + ro_flags
- the split of inode_item::flags is 32/32 for simplicity as it matches
  the current type, we can make it 48/16 if that would work (maybe not)

---

--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -189,8 +189,10 @@ struct btrfs_inode {
 	 */
 	u64 csum_bytes;
 
-	/* flags field from the on disk inode */
+	/* Flags field from the on disk inode, lower half of inode_item::flags  */
 	u32 flags;
+	/* Read-only compatibility flags, upper half of inode_item::flags */
+	u32 ro_flags;
 
 	/*
 	 * Counters to keep track of the number of extent item's we may use due
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -281,7 +281,8 @@ struct btrfs_super_block {
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
 	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
-	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
+	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
+	 BTRFS_FEATURE_COMPAT_RO_VERITY)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -1490,6 +1491,8 @@ do {                                                                   \
 
 #define BTRFS_INODE_ROOT_ITEM_INIT	(1 << 31)
 
+#define BTRFS_INODE_RO_VERITY		(1ULL << 32)
+
 #define BTRFS_INODE_FLAG_MASK						\
 	(BTRFS_INODE_NODATASUM |					\
 	 BTRFS_INODE_NODATACOW |					\
@@ -1505,6 +1508,9 @@ do {                                                                   \
 	 BTRFS_INODE_COMPRESS |						\
 	 BTRFS_INODE_ROOT_ITEM_INIT)
 
+#define BTRFS_INODE_FLAG_INCOMPAT_MASK			(0x00000000FFFFFFFF)
+#define BTRFS_INODE_FLAG_RO_COMPAT_MASK			(0xFFFFFFFF00000000)
+
 struct btrfs_map_token {
 	struct extent_buffer *eb;
 	char *kaddr;
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1717,7 +1717,8 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
 				       inode_peek_iversion(inode));
 	btrfs_set_stack_inode_transid(inode_item, trans->transid);
 	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
-	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
+	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags |
+			((u64)BTRFS_I(inode)->ro_flags << 32));
 	btrfs_set_stack_inode_block_group(inode_item, 0);
 
 	btrfs_set_stack_timespec_sec(&inode_item->atime,
@@ -1775,7 +1776,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 				   btrfs_stack_inode_sequence(inode_item));
 	inode->i_rdev = 0;
 	*rdev = btrfs_stack_inode_rdev(inode_item);
-	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
+	BTRFS_I(inode)->flags = (u32)btrfs_stack_inode_flags(inode_item);
+	BTRFS_I(inode)->ro_flags = (u32)(btrfs_stack_inode_flags(inode_item) >> 32);
 
 	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
 	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3630,7 +3630,8 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	rdev = btrfs_inode_rdev(leaf, inode_item);
 
 	BTRFS_I(inode)->index_cnt = (u64)-1;
-	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
+	BTRFS_I(inode)->flags = (u32)btrfs_inode_flags(leaf, inode_item);
+	BTRFS_I(inode)->ro_flags = (u32)(btrfs_inode_flags(leaf, inode_item) >> 32);
 
 cache_index:
 	/*
@@ -3796,7 +3797,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
-	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags |
+			((u64)BTRFS_I(inode)->ro_flags << 32));
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -378,7 +378,7 @@ static int check_csum_item(struct extent_buffer *leaf, struct btrfs_key *key,
 
 /* Inode item error output has the same format as dir_item_err() */
 #define inode_item_err(eb, slot, fmt, ...)			\
-	dir_item_err(eb, slot, fmt, __VA_ARGS__)
+	dir_item_err(eb, slot, fmt, ## __VA_ARGS__)
 
 static int check_inode_key(struct extent_buffer *leaf, struct btrfs_key *key,
 			   int slot)
@@ -999,6 +999,7 @@ static int check_inode_item(struct extent_buffer *leaf,
 	u32 valid_mask = (S_IFMT | S_ISUID | S_ISGID | S_ISVTX | 0777);
 	u32 mode;
 	int ret;
+	u64 inode_flags;
 
 	ret = check_inode_key(leaf, key, slot);
 	if (unlikely(ret < 0))
@@ -1054,13 +1055,22 @@ static int check_inode_item(struct extent_buffer *leaf,
 			btrfs_inode_nlink(leaf, iitem));
 		return -EUCLEAN;
 	}
-	if (unlikely(btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)) {
+	inode_flags = btrfs_inode_flags(leaf, iitem);
+	if (unlikely(inode_flags & ~BTRFS_INODE_FLAG_INCOMPAT_MASK)) {
 		inode_item_err(leaf, slot,
-			       "unknown flags detected: 0x%llx",
-			       btrfs_inode_flags(leaf, iitem) &
-			       ~BTRFS_INODE_FLAG_MASK);
+			       "unknown incompat flags detected: 0x%llx",
+			       inode_flags & ~BTRFS_INODE_FLAG_INCOMPAT_MASK);
 		return -EUCLEAN;
 	}
+	if (unlikely(inode_flags & ~BTRFS_INODE_FLAG_RO_COMPAT_MASK)) {
+		if (unlikely(inode_flags & BTRFS_INODE_RO_VERITY)) {
+			if (btrfs_fs_compat_ro(fs_info, VERITY)) {
+				inode_item_err(leaf, slot,
+			"inode ro compat VERITY flag set but not on filesystem");
+				return -EUCLEAN;
+			}
+		}
+	}
 	return 0;
 }
 
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3941,7 +3941,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
-	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags |
+			((u64)BTRFS_I(inode)->ro_flags << 32));
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
  * first mount when booting older kernel versions.
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
+#define BTRFS_FEATURE_COMPAT_RO_VERITY			(1ULL << 2)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
-- 
2.29.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-05-17 21:48     ` David Sterba
@ 2021-05-19 21:45       ` Boris Burkov
  2021-06-07 21:43         ` David Sterba
  0 siblings, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2021-05-19 21:45 UTC (permalink / raw)
  To: dsterba, linux-btrfs, linux-fscrypt, kernel-team

On Mon, May 17, 2021 at 11:48:59PM +0200, David Sterba wrote:
> On Tue, May 11, 2021 at 09:11:08PM +0200, David Sterba wrote:
> > On Wed, May 05, 2021 at 12:20:39PM -0700, Boris Burkov wrote:
> > > --- a/fs/btrfs/btrfs_inode.h
> > > +++ b/fs/btrfs/btrfs_inode.h
> > > @@ -191,6 +191,7 @@ struct btrfs_inode {
> > >  
> > >  	/* flags field from the on disk inode */
> > >  	u32 flags;
> > > +	u64 compat_flags;
> > 
> > This got me curious, u32 flags is for the in-memory inode, but the
> > on-disk inode_item::flags is u64
> > 
> > >  BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
> >                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > > +BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
> > 
> > >  	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> > 
> > Which means we currently use only 32 bits and half of the on-disk
> > inode_item::flags is always zero. So the idea is to repurpose this for
> > the incompat bits (say upper 16 bits). With a minimal patch to tree
> > checker we can make old kernels accept a verity-enabled kernel.
> > 
> > It could be tricky, but for backport only additional bitmask would be
> > added to BTRFS_INODE_FLAG_MASK to ignore bits 48-63.
> > 
> > For proper support the inode_item::flags can be simply used as one space
> > where the split would be just logical, and IMO manageable.
> 
> To demonstrate the idea, here's a compile-tested patch, based on
> current misc-next but the verity bits are easy to match to your
> patchset:

Thanks for taking the time to prove this idea out. However, I'd still
like to discuss the pros/cons of this approach for this application.

As far as I can tell, the two issues at hand are ensuring compatibility
and using fewer of the reserved bits. Your proposal uses 0 reserved
bits, which is great, but is still quite a headache for compatibility,
as an administrator would have to backport the compat patch on any kernel
they wanted to roll back to before the one this went out on.

This is especially painful for less well-loved things like
dracut/systemd mounting the root filesystem and doing a pivot_root during
boot. You would have to make sure that any machine using fsverity btrfs
files has an updated initramfs kernel or it won't be able to boot.

Alternatively, we could have our cake and eat it too if we separate the
idea of unlocking the top 32 bits of the inode flags from adding compat
flags.

If we:
1. take a u16 or a u32 out of reserved and make it compat flags (my
patch, but shrinking from u64)
2. implement something similar to your patch, but don't use those 32
bits just yet

Then we are setup to more conveniently use the freed-up 32 bits in the
future, as the application which wants reserved bytes then will have a
buffer of kernel versions to trivially roll back into, which may cover
most practical rollbacks.

For what it's worth, I do like that your proposal stuffs inode flags and
inode compat flags together, which is certainly neater than turning the
upper 32 of inode flags into general reserved bits. But I'm just not
sure that the aesthetic benefit is worth the real pain now.

> 
> - btrfs_inode::ro_flags - in-memory representation of the ro flags
> - tree-checker verifies the flags separately
>   - errors if there are unkonwn flags (u32)
>   - errors if ro_flags don't match fs ro_compat bits
> - inode_item::flags gets synced with btrfs_inode::flags + ro_flags
> - the split of inode_item::flags is 32/32 for simplicity as it matches
>   the current type, we can make it 48/16 if that would work (maybe not)
> 
> ---
> 
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -189,8 +189,10 @@ struct btrfs_inode {
>  	 */
>  	u64 csum_bytes;
>  
> -	/* flags field from the on disk inode */
> +	/* Flags field from the on disk inode, lower half of inode_item::flags  */
>  	u32 flags;
> +	/* Read-only compatibility flags, upper half of inode_item::flags */
> +	u32 ro_flags;
>  
>  	/*
>  	 * Counters to keep track of the number of extent item's we may use due
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -281,7 +281,8 @@ struct btrfs_super_block {
>  
>  #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
>  	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
> -	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
> +	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID | \
> +	 BTRFS_FEATURE_COMPAT_RO_VERITY)
>  
>  #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
>  #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
> @@ -1490,6 +1491,8 @@ do {                                                                   \
>  
>  #define BTRFS_INODE_ROOT_ITEM_INIT	(1 << 31)
>  
> +#define BTRFS_INODE_RO_VERITY		(1ULL << 32)
> +
>  #define BTRFS_INODE_FLAG_MASK						\
>  	(BTRFS_INODE_NODATASUM |					\
>  	 BTRFS_INODE_NODATACOW |					\
> @@ -1505,6 +1508,9 @@ do {                                                                   \
>  	 BTRFS_INODE_COMPRESS |						\
>  	 BTRFS_INODE_ROOT_ITEM_INIT)
>  
> +#define BTRFS_INODE_FLAG_INCOMPAT_MASK			(0x00000000FFFFFFFF)
> +#define BTRFS_INODE_FLAG_RO_COMPAT_MASK			(0xFFFFFFFF00000000)
> +
>  struct btrfs_map_token {
>  	struct extent_buffer *eb;
>  	char *kaddr;
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1717,7 +1717,8 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
>  				       inode_peek_iversion(inode));
>  	btrfs_set_stack_inode_transid(inode_item, trans->transid);
>  	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
> -	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> +	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags |
> +			((u64)BTRFS_I(inode)->ro_flags << 32));
>  	btrfs_set_stack_inode_block_group(inode_item, 0);
>  
>  	btrfs_set_stack_timespec_sec(&inode_item->atime,
> @@ -1775,7 +1776,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
>  				   btrfs_stack_inode_sequence(inode_item));
>  	inode->i_rdev = 0;
>  	*rdev = btrfs_stack_inode_rdev(inode_item);
> -	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
> +	BTRFS_I(inode)->flags = (u32)btrfs_stack_inode_flags(inode_item);
> +	BTRFS_I(inode)->ro_flags = (u32)(btrfs_stack_inode_flags(inode_item) >> 32);
>  
>  	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
>  	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3630,7 +3630,8 @@ static int btrfs_read_locked_inode(struct inode *inode,
>  	rdev = btrfs_inode_rdev(leaf, inode_item);
>  
>  	BTRFS_I(inode)->index_cnt = (u64)-1;
> -	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
> +	BTRFS_I(inode)->flags = (u32)btrfs_inode_flags(leaf, inode_item);
> +	BTRFS_I(inode)->ro_flags = (u32)(btrfs_inode_flags(leaf, inode_item) >> 32);
>  
>  cache_index:
>  	/*
> @@ -3796,7 +3797,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
>  	btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
>  	btrfs_set_token_inode_transid(&token, item, trans->transid);
>  	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
> -	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
> +	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags |
> +			((u64)BTRFS_I(inode)->ro_flags << 32));
>  	btrfs_set_token_inode_block_group(&token, item, 0);
>  }
>  
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -378,7 +378,7 @@ static int check_csum_item(struct extent_buffer *leaf, struct btrfs_key *key,
>  
>  /* Inode item error output has the same format as dir_item_err() */
>  #define inode_item_err(eb, slot, fmt, ...)			\
> -	dir_item_err(eb, slot, fmt, __VA_ARGS__)
> +	dir_item_err(eb, slot, fmt, ## __VA_ARGS__)
>  
>  static int check_inode_key(struct extent_buffer *leaf, struct btrfs_key *key,
>  			   int slot)
> @@ -999,6 +999,7 @@ static int check_inode_item(struct extent_buffer *leaf,
>  	u32 valid_mask = (S_IFMT | S_ISUID | S_ISGID | S_ISVTX | 0777);
>  	u32 mode;
>  	int ret;
> +	u64 inode_flags;
>  
>  	ret = check_inode_key(leaf, key, slot);
>  	if (unlikely(ret < 0))
> @@ -1054,13 +1055,22 @@ static int check_inode_item(struct extent_buffer *leaf,
>  			btrfs_inode_nlink(leaf, iitem));
>  		return -EUCLEAN;
>  	}
> -	if (unlikely(btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)) {
> +	inode_flags = btrfs_inode_flags(leaf, iitem);
> +	if (unlikely(inode_flags & ~BTRFS_INODE_FLAG_INCOMPAT_MASK)) {
>  		inode_item_err(leaf, slot,
> -			       "unknown flags detected: 0x%llx",
> -			       btrfs_inode_flags(leaf, iitem) &
> -			       ~BTRFS_INODE_FLAG_MASK);
> +			       "unknown incompat flags detected: 0x%llx",
> +			       inode_flags & ~BTRFS_INODE_FLAG_INCOMPAT_MASK);
>  		return -EUCLEAN;
>  	}
> +	if (unlikely(inode_flags & ~BTRFS_INODE_FLAG_RO_COMPAT_MASK)) {
> +		if (unlikely(inode_flags & BTRFS_INODE_RO_VERITY)) {
> +			if (btrfs_fs_compat_ro(fs_info, VERITY)) {
> +				inode_item_err(leaf, slot,
> +			"inode ro compat VERITY flag set but not on filesystem");
> +				return -EUCLEAN;
> +			}
> +		}
> +	}
>  	return 0;
>  }
>  
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3941,7 +3941,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
>  	btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
>  	btrfs_set_token_inode_transid(&token, item, trans->transid);
>  	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
> -	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
> +	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags |
> +			((u64)BTRFS_I(inode)->ro_flags << 32));
>  	btrfs_set_token_inode_block_group(&token, item, 0);
>  }
>  
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
>   * first mount when booting older kernel versions.
>   */
>  #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
> +#define BTRFS_FEATURE_COMPAT_RO_VERITY			(1ULL << 2)
>  
>  #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
>  #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-05-05 19:20 ` [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
  2021-05-11 19:11   ` David Sterba
@ 2021-05-25 18:12   ` Eric Biggers
  2021-06-07 21:10     ` David Sterba
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Biggers @ 2021-05-25 18:12 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, linux-fscrypt, kernel-team

On Wed, May 05, 2021 at 12:20:39PM -0700, Boris Burkov wrote:
> The tree checker currently rejects unrecognized flags when it reads
> btrfs_inode_item. Practically, this means that adding a new flag makes
> the change backwards incompatible if the flag is ever set on a file.
> 
> Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a
> new "compat_flags". These flags are zero on inode creation in btrfs and
> mkfs and are ignored by an older kernel, so it should be safe to use
> them in this way.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

This patchset doesn't have a cover letter anymore for some reason.

Also, please mention where this patchset applies to.  I tried mainline and
btrfs/for-next, but neither works.

- Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-05-25 18:12   ` Eric Biggers
@ 2021-06-07 21:10     ` David Sterba
  0 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2021-06-07 21:10 UTC (permalink / raw)
  To: Eric Biggers; +Cc: Boris Burkov, linux-btrfs, linux-fscrypt, kernel-team

On Tue, May 25, 2021 at 11:12:21AM -0700, Eric Biggers wrote:
> On Wed, May 05, 2021 at 12:20:39PM -0700, Boris Burkov wrote:
> > The tree checker currently rejects unrecognized flags when it reads
> > btrfs_inode_item. Practically, this means that adding a new flag makes
> > the change backwards incompatible if the flag is ever set on a file.
> > 
> > Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a
> > new "compat_flags". These flags are zero on inode creation in btrfs and
> > mkfs and are ignored by an older kernel, so it should be safe to use
> > them in this way.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> 
> This patchset doesn't have a cover letter anymore for some reason.
> 
> Also, please mention where this patchset applies to.  I tried mainline and
> btrfs/for-next, but neither works.

There was a parallel change updating file attributes causing conflict
with the patchset as sent. Boris is aware of that and the new version
will be on top of something that appalies on top of the btrfs
development branch again.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-05-19 21:45       ` Boris Burkov
@ 2021-06-07 21:43         ` David Sterba
  0 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2021-06-07 21:43 UTC (permalink / raw)
  To: Boris Burkov; +Cc: dsterba, linux-btrfs, linux-fscrypt, kernel-team

Hi,

sorry I did not notice you replied some time ago.

On Wed, May 19, 2021 at 02:45:23PM -0700, Boris Burkov wrote:
> On Mon, May 17, 2021 at 11:48:59PM +0200, David Sterba wrote:
> > On Tue, May 11, 2021 at 09:11:08PM +0200, David Sterba wrote:
> > > On Wed, May 05, 2021 at 12:20:39PM -0700, Boris Burkov wrote:
> > > > --- a/fs/btrfs/btrfs_inode.h
> > > > +++ b/fs/btrfs/btrfs_inode.h
> > > > @@ -191,6 +191,7 @@ struct btrfs_inode {
> > > >  
> > > >  	/* flags field from the on disk inode */
> > > >  	u32 flags;
> > > > +	u64 compat_flags;
> > > 
> > > This got me curious, u32 flags is for the in-memory inode, but the
> > > on-disk inode_item::flags is u64
> > > 
> > > >  BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
> > >                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > 
> > > > +BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
> > > 
> > > >  	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> > > 
> > > Which means we currently use only 32 bits and half of the on-disk
> > > inode_item::flags is always zero. So the idea is to repurpose this for
> > > the incompat bits (say upper 16 bits). With a minimal patch to tree
> > > checker we can make old kernels accept a verity-enabled kernel.
> > > 
> > > It could be tricky, but for backport only additional bitmask would be
> > > added to BTRFS_INODE_FLAG_MASK to ignore bits 48-63.
> > > 
> > > For proper support the inode_item::flags can be simply used as one space
> > > where the split would be just logical, and IMO manageable.
> > 
> > To demonstrate the idea, here's a compile-tested patch, based on
> > current misc-next but the verity bits are easy to match to your
> > patchset:
> 
> Thanks for taking the time to prove this idea out. However, I'd still
> like to discuss the pros/cons of this approach for this application.
> 
> As far as I can tell, the two issues at hand are ensuring compatibility
> and using fewer of the reserved bits. Your proposal uses 0 reserved
> bits, which is great, but is still quite a headache for compatibility,
> as an administrator would have to backport the compat patch on any kernel
> they wanted to roll back to before the one this went out on.

The compatibility problems are there for any new feature and usually
it's strict no mount, while here we can do a read-only compat mode at
least. Deploying a new feature should always take the fallback mount
into account, so it's advisable to wait a few releases eg. up to the
next stable release.

Luckily in that case we can backport the compatibility to the older
stable trees so the fallback would work after a minor release.

> This is especially painful for less well-loved things like
> dracut/systemd mounting the root filesystem and doing a pivot_root during
> boot. You would have to make sure that any machine using fsverity btrfs
> files has an updated initramfs kernel or it won't be able to boot.

So I hope this would get covered by the backports, as discussed, to 5.4
and 5.10.

> Alternatively, we could have our cake and eat it too if we separate the
> idea of unlocking the top 32 bits of the inode flags from adding compat
> flags.
> 
> If we:
> 1. take a u16 or a u32 out of reserved and make it compat flags (my
> patch, but shrinking from u64)
> 2. implement something similar to your patch, but don't use those 32
> bits just yet
> 
> Then we are setup to more conveniently use the freed-up 32 bits in the
> future, as the application which wants reserved bytes then will have a
> buffer of kernel versions to trivially roll back into, which may cover
> most practical rollbacks.
> 
> For what it's worth, I do like that your proposal stuffs inode flags and
> inode compat flags together, which is certainly neater than turning the
> upper 32 of inode flags into general reserved bits. But I'm just not
> sure that the aesthetic benefit is worth the real pain now.

My motivation is not aesthetic, rather I'm very conservative when
on-disk structures get changed, and inode is the core structure.
Curiously, you can thank Josef who switched the per-inode compat flags
to whole-filesystem only in f2b636e80d8206dd40 "Btrfs: add support for
compat flags to btrfs". But that was in 2008 and was a hard incompatible
change that lead to the last major format change (the _BHRfS_M
signature).

If the incompat change can be squeezed into existing structure, it
leaves the reserved fileds untouched. Right now we have 4x u64. Any
other change requires increasing the item size which is ultimately
possible but brings other problems. So if there's a possibily not to go
to the next level, I'll pursue it. Right now the major objection is the
problem with deployment and fallback mount, but I think this is solved.

Until now I haven't found any problem with the ro compat flags merged to
normal flags on itself, so as agreed offline, we're going to do that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-06-07 21:46 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1620241221.git.boris@bur.io>
2021-05-05 19:20 ` [PATCH v4 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
2021-05-11 19:11   ` David Sterba
2021-05-17 21:48     ` David Sterba
2021-05-19 21:45       ` Boris Burkov
2021-06-07 21:43         ` David Sterba
2021-05-25 18:12   ` Eric Biggers
2021-06-07 21:10     ` David Sterba
2021-05-05 19:20 ` [PATCH v4 2/5] btrfs: initial fsverity support Boris Burkov
2021-05-06  0:09   ` kernel test robot
2021-05-06  0:09     ` kernel test robot
2021-05-11 19:20   ` David Sterba
2021-05-11 20:31   ` David Sterba
2021-05-11 21:52     ` Boris Burkov
2021-05-12 17:10       ` David Sterba
2021-05-13 19:19     ` Boris Burkov
2021-05-17 21:40       ` David Sterba
2021-05-12 17:34   ` David Sterba
2021-05-05 19:20 ` [PATCH v4 3/5] btrfs: check verity for reads of inline extents and holes Boris Burkov
2021-05-12 17:57   ` David Sterba
2021-05-12 18:25     ` Boris Burkov
2021-05-05 19:20 ` [PATCH v4 4/5] btrfs: fallback to buffered io for verity files Boris Burkov
2021-05-05 19:20 ` [PATCH v4 5/5] btrfs: verity metadata orphan items Boris Burkov
2021-05-12 17:48   ` David Sterba
2021-05-12 18:08     ` Boris Burkov
2021-05-12 23:36       ` David Sterba
2021-05-05 19:20 [PATCH v4 0/5] btrfs: support fsverity Boris Burkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.